Btrfs: broken by design?

By Jonathan Corbet
June 22, 2010

The Btrfs filesystem is seen by many as the primary Linux filesystem for the next decade or so. It brings a next-generation design and a wide range of features (snapshots, data checksums, internal RAID, etc.) that users have been waiting for. Despite being merged for 2.6.29, Btrfs remains an experimental development, but some of the more adventurous distributions are beginning to offer Btrfs installation options and Meego has chosen Btrfs as its default filesystem. So when a filesystem developer started calling Btrfs "broken by design," people took notice.

Edward Shishkin, perhaps better known for his efforts to keep reiser4 development alive, first posted some concerns on June 3. It seems he ran a simple test: create a new Btrfs filesystem, then create 2048-byte files until space runs out. Others have talked about suboptimal space efficiency in Btrfs before, but Edward was still surprised that he was only able to use 17% of the nominal space in the filesystem before it was reported as being full. Such poor efficiency was, according to Edward, evidence the Btrfs was "broken by design" and should not be used:

The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization.... As to current Btrfs code: *NOT ACK*!!! I don't think we need such "file systems".

Part of the problem comes down to the use of "inline extents" in Btrfs. The core data structure on a Btrfs filesystem is a B-tree which provides access to all of the objects stored in the filesystem. For larger files, the actual file data is stored in extents, which are pointed to from within the tree. Small extents, though, can be stored in the tree itself, hopefully yielding both better space efficiency and better performance. If these extents are sized inconveniently, though, they can cause a lot of wasted space. There's only room for one 2048-byte inline extent in a B-tree node, leaving 1800 bytes or so of unused space. That is a lot of internal fragmentation - a lot of wasted space.

As noted in Chris Mason's response, there are two approaches which can be taken to mitigate this kind of problem. One is to turn off inline extents altogether; Btrfs has a max_inline= mount option which can be used for just that purpose. The other approach would be to allow inline extents to be split between tree nodes so that the pieces could be sized to fill those nodes exactly. Btrfs cannot do that, and probably will not be able to anytime soon:

I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents).

Chris also noted that most of the other variable-size items stored in B-tree nodes - extended attributes, for example - can be split between nodes if need be. So these items should not cause fragmentation problems; it's mainly the inline extents which are at fault there.

But, as Edward pointed out, there's more to the problem than inline extents. In his investigations, he's found numerous places where groups of nearly-empty nodes exist; some were less than 1% utilized. That, in all likelihood, is the real source of the worst space utilization problems. To Edward, this behavior is another sign that the algorithms used in Btrfs are all wrong and in need of a redesign.

Chris sees it a little differently, though:

The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug.

He has promised to track it down and post a fix. Between the bug fix and turning off inline extents (or, at least, reducing their maximum size), it is hoped that the worst space utilization problems in Btrfs will be no more.

That fix has not been posted as of this writing, so its effectiveness cannot yet be judged. But, chances are, this is not a case of a filesystem needing a fundamental redesign. Instead, all it needs is more extensive testing, some performance tuning, and, inevitably, some bug fixes. The good news is that the process seems to be working as it should be: these problems have been found before any sort of wide-scale deployment of this very new filesystem.

Index entries for this article
Kernel	Btrfs
Kernel	Filesystems/Btrfs

Btrfs: broken by design?

Posted Jun 24, 2010 8:09 UTC (Thu) by mjthayer (guest, #39183) [Link]

Couldn't one just ask the questions how well btrfs performs in "normal" usage cases and what the special usage cases are in which it performs less well, or may even not be the right choice? Such cases can exist without btrfs being the wrong choice "in general".

Btrfs: broken by design?

Posted Jun 24, 2010 10:50 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (11 responses)

Isn't ext2/3 almost as broken with respect to this pathological case? If it's created with mke2fs defaults, it runs out of inodes when the space is 25% used (bytes/inode = 8192). If it's created to maximise the number of small files, it reaches a shade under 50% utilized (one 4K block per file plus overheads). And I shudder to think about fscking it. (offline!)

Ext2 simply can't store a file using less than 4K. Btrfs can. Am I right to think that a filesystem full of 2Kb files is something close to Btrfs's worst case, and that Btrfs would handle a vaster number of really tiny files (hundreds or tens of bytes) rather well? (After the bug is fixed, of course).

Btrfs: broken by design?

Posted Jun 24, 2010 11:00 UTC (Thu) by ekj (guest, #1524) [Link] (4 responses)

True. But I'm also worried about the "removes ALL boundaries" claim. I don't know btrfs or the algorithms well enough to do the math, but the claim is, essentially, that btrfs may, in the pathological case, consume an infinite amount of space, for actually storing nothing.

That is unlikely to happen for real workloads, but should still be impossible. Especially since pathological workloads can often be provoked deliberately by an attacker. (typical for race-conditions, for example, even race-conditions that would "almost never" happen in normal situations, become a problem if an attacker can deliberately create them)

Btrfs: broken by design?

Posted Jun 24, 2010 11:34 UTC (Thu) by nye (subscriber, #51576) [Link]

I wonder how this interacts with quotas. Could a malicious user fill a quota with pathological files that actually use vastly more disk space? The alternative is that a user fills their quota without having anywhere near that much actual data.

Btrfs: broken by design?

Posted Jun 24, 2010 11:41 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (2 responses)

You can equally persuade ext2 to run out of space without storing any data Just create zero-byte files until it runs out of inodes!

Btrfs: broken by design?

Posted Jun 24, 2010 12:08 UTC (Thu) by ekj (guest, #1524) [Link]

You're still storing something: you're storing the metadata. (at a minimum, the filenames and permissions)

But you're right, that (pathological) case is likely the lower bound for utilization in ext2.

Btrfs: broken by design?

Posted Jun 24, 2010 20:50 UTC (Thu) by intgr (subscriber, #39733) [Link]

> until it runs out of inodes
Not really -- that's why ext2 also has inode quotas/file quotas.

Btrfs: broken by design?

Posted Jun 27, 2010 12:28 UTC (Sun) by nix (subscriber, #2304) [Link] (4 responses)

True. But the amount of space consumed by metadata is horrifying to me. 40%+ of your disk capacity eaten by metadata? What the hell? The only FS I've ever seen that was this bad before was Coda (and actually that was fairly trim, at roughly 5% metadata overhead: it was just that all the metadata for the whole filesystem was mmap()ed in at all times, a bit problematic on a 32-bit system...)

(some of us also have hardware RAID so every byte spent backing up metadata is a byte wasted, but I'm sure btrfs can turn that off: and admittedly this is not a common case.)

Btrfs: broken by design?

Posted Jun 27, 2010 21:48 UTC (Sun) by dark (guest, #8483) [Link] (3 responses)

Hmm, where are you getting the 40% number? Is that real metadata, or are the inline extents also counted as metadata because they are stored in leaf nodes?

Btrfs: broken by design?

Posted Jun 27, 2010 22:02 UTC (Sun) by dlang (guest, #313) [Link] (2 responses)

in the mail thread the btrfs author stated this figure (20% in metadata that is then replicated)

Btrfs: broken by design?

Posted Jun 27, 2010 23:02 UTC (Sun) by nix (subscriber, #2304) [Link]

I think it might even have been higher than that, 46% or something, but I could remember it was above 40%.

Btrfs: broken by design?

Posted Jun 29, 2010 20:26 UTC (Tue) by masoncl (subscriber, #47138) [Link]

Everything being stored is in metadata blocks (the btree) but this does include the actual file data. The test chose file sizes that btrfs would pack into the btree.

Btrfs: broken by design?

Posted Jul 28, 2010 23:05 UTC (Wed) by sbergman27 (guest, #10767) [Link]

It's actually worse than this. Not sure why. But I just created a 1G filesystem using ext4, and started creating 2048 byte files in it. I set the number of inodes to 513,000. And I get an -ENOSPC at 156979 files. At 100% efficiency, I would have expected 524288 files. Taking into account the 4k block size, I would have expected 262144. But I got less than 60% of that. And less than 30% storage efficiency overall.

Btrfs: broken by design?

Posted Jun 24, 2010 12:37 UTC (Thu) by masoncl (subscriber, #47138) [Link] (1 responses)

Just a quick update, the fix was posted this week but I did hit a race under load so I have a better fix in a long QA run here.

Edward's 17% number didn't factor in the part where btrfs duplicates all btree blocks by default, so the actual fragmentation caused by the btree is much lower.

All filesystems suffer from the problem of partially data used blocks. The btrfs btree does allow partially used btree leaves, but the higher level nodes are much more traditional (fixed record length) blocks. So the worst case in btrfs is to have the directory items in one block, the inode in another block and the inline file data extent in a third block.

This means our worst case is similar to the standard case for ext*. Ext will always get more than one inode in a block, but it'll also always have separate blocks for inodes, directories and file data.

I do appreciate Edward's comments, he has been working on filesystems for a long time and contributed the btrfs grubv1 support.

Btrfs: broken by design?

Posted Jun 24, 2010 16:08 UTC (Thu) by ikm (guest, #493) [Link]

So, what's the new number, after the fix applied?

Btrfs: broken by design?

Posted Jun 24, 2010 12:52 UTC (Thu) by johnflux (guest, #58833) [Link] (6 responses)

I was expecting this article to contain a comment or rebuttal for the quote of the week:

> If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level.

Anyone have any further input on this?

Btrfs: broken by design?

Posted Jun 24, 2010 13:12 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (1 responses)

I put a comment against the QOTW. Basically, it's whether one wants provable behaviour in all circumstances, or whether pragmatism rules. There are very many things that "work", even though you can't prove that they always will. Indeed, there are many useful things which one can prove have a small (hopefully thermodynamically small) chance of total failure.

I can't prove that my head won't fall off before I finish typing this. It's thermodynamically possible, but so unlikely that it won't happen even given a googleplex of universes to try it out i

Btrfs: broken by design?

Posted Jun 25, 2010 14:09 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

A really easy example is the "Two generals problem". We know (mathematical proof, and frankly you can work it out with some thought and no training) that you can't achieve the reliable agreement required in this problem. But it appears that many every day things require such agreement. You can't create a TCP connection without it, for example, or buy something with a credit card.

In reality what happens is that we put up with a negligible chance of failure. The failure of this sort are vastly outweighed by more pragmatic problems (imagine if the generals managed their agreement, and then one finds his men have developed dysentery and cannot fight) so we don't worry about it very much.

Btrfs: broken by design?

Posted Jun 24, 2010 13:31 UTC (Thu) by masoncl (subscriber, #47138) [Link] (3 responses)

I replied to the thread via patch instead ;) Everyone has different ideas on what a filesystem should be, and my goal isn't to convince everyone that Btrfs fits their definition of ideal.

Ohad Rodeh's paper was really about snapshot reference counting. He did go into details of btree balancing but this was mostly about how top-down balancing is best suited to the snapshot reference counting problem.

We've definitely changed some of the fundamentals around snapshot reference counting too, and we've continued to discuss these with Dr Rodeh (he's really good to work with). His work was presented as a starting point for people interested in snapshotting.

There are also a lot of ideas from reiserfsv3 and xfs and ext in Btrfs. To me, filesystems are really just a collection of tradeoffs and optimizations for specific goals. We trade CPU for disk IO, or CPU and disk IO for features, and these tradeoffs change as we go.

When designing btrfs we took a big pile of known solutions for filesystem problems, crammed them together, and then fixed up the new problems that resulted. Somehow I missed the scientific magazine step that all the other filesystems have followed, but I'll do what I can to keep improving the FS.

Btrfs: broken by design?

Posted Jun 24, 2010 16:40 UTC (Thu) by bronson (subscriber, #4806) [Link]

> Somehow I missed the scientific magazine step that all the other filesystems have followed

Can a quote of the week appear in the previous week's comments?

Btrfs: broken by design?

Posted Jun 26, 2010 13:52 UTC (Sat) by johnflux (guest, #58833) [Link] (1 responses)

> When designing btrfs we took a big pile of known solutions for filesystem problems, crammed them together, and then fixed up the new problems that resulted. Somehow I missed the scientific magazine step that all the other filesystems have followed, but I'll do what I can to keep improving the FS.

Out of interest, what are your reasons for not having a peer reviewed article on this? Lack of familiarity of the process and/or time, would be my guess?

Especially since you benefited from reading a published paper, it would be nice for you to write up your changes. If you just don't want to go through that bother, maybe a "comment" would be nice? These are half-a-page comments to existing papers but they also get peer reviewed and published.

Actually, you mentioned that you worked with author of the original paper. Maybe he would be interested in publishing your work on your behalf.

Btrfs: broken by design?

Posted Jun 29, 2010 20:40 UTC (Tue) by masoncl (subscriber, #47138) [Link]

>Out of interest, what are your reasons for not having a peer reviewed article on this? Lack of familiarity of the process and/or time, would be my guess?

Partially...the part in question is the btree indexing. This is important because it is a core part of the filesystem but it is also the least exotic part of the FS.

The part that is much more performance critical is how do we use that index to track block reference counts and maintain snapshot integrity. Much of my initial work here was expanded by Yan Zheng, and I think his improvements are a better topic for research papers.

Here is a paper where they tried to replace the btrfs back reference tracking and measured some differences:

http://www.usenix.org/events/fast10/tech/full_papers/mack...

Another key part of btrfs is how do we spread checksumming or compression across the CPUs and still maintain good IO ordering.

Or, how do we track free space and not explode (both Josef Bacik and Yan Zheng have spent considerable time on this part).

If I were going to invest time in writing the papers, I'd pick one of those three because they are newer problems where more active research is going on.