Btrfs: broken by design?
Edward Shishkin, perhaps better known for his efforts to keep reiser4 development alive, first posted some concerns on June 3. It seems he ran a simple test: create a new Btrfs filesystem, then create 2048-byte files until space runs out. Others have talked about suboptimal space efficiency in Btrfs before, but Edward was still surprised that he was only able to use 17% of the nominal space in the filesystem before it was reported as being full. Such poor efficiency was, according to Edward, evidence the Btrfs was "broken by design" and should not be used:
Part of the problem comes down to the use of "inline extents" in Btrfs. The core data structure on a Btrfs filesystem is a B-tree which provides access to all of the objects stored in the filesystem. For larger files, the actual file data is stored in extents, which are pointed to from within the tree. Small extents, though, can be stored in the tree itself, hopefully yielding both better space efficiency and better performance. If these extents are sized inconveniently, though, they can cause a lot of wasted space. There's only room for one 2048-byte inline extent in a B-tree node, leaving 1800 bytes or so of unused space. That is a lot of internal fragmentation - a lot of wasted space.
As noted in Chris Mason's response, there are two approaches which can be taken to mitigate this kind of problem. One is to turn off inline extents altogether; Btrfs has a max_inline= mount option which can be used for just that purpose. The other approach would be to allow inline extents to be split between tree nodes so that the pieces could be sized to fill those nodes exactly. Btrfs cannot do that, and probably will not be able to anytime soon:
Chris also noted that most of the other variable-size items stored in B-tree nodes - extended attributes, for example - can be split between nodes if need be. So these items should not cause fragmentation problems; it's mainly the inline extents which are at fault there.
But, as Edward pointed out, there's more to the problem than inline extents. In his investigations, he's found numerous places where groups of nearly-empty nodes exist; some were less than 1% utilized. That, in all likelihood, is the real source of the worst space utilization problems. To Edward, this behavior is another sign that the algorithms used in Btrfs are all wrong and in need of a redesign.
Chris sees it a little differently, though:
He has promised to track it down and post a fix. Between the bug fix and turning off inline extents (or, at least, reducing their maximum size), it is hoped that the worst space utilization problems in Btrfs will be no more.
That fix has not been posted as of this writing, so its effectiveness
cannot yet be judged. But, chances are, this is not a case of a filesystem
needing a fundamental redesign. Instead, all it needs is more extensive
testing, some performance tuning, and, inevitably, some bug fixes. The
good news is that the process seems to be working as it should be: these
problems have been found before any sort of wide-scale deployment of this
very new filesystem.
Index entries for this article | |
---|---|
Kernel | Btrfs |
Kernel | Filesystems/Btrfs |
Posted Jun 24, 2010 8:09 UTC (Thu)
by mjthayer (guest, #39183)
[Link]
Posted Jun 24, 2010 10:50 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link] (11 responses)
Ext2 simply can't store a file using less than 4K. Btrfs can. Am I right to think that a filesystem full of 2Kb files is something close to Btrfs's worst case, and that Btrfs would handle a vaster number of really tiny files (hundreds or tens of bytes) rather well? (After the bug is fixed, of course).
Posted Jun 24, 2010 11:00 UTC (Thu)
by ekj (guest, #1524)
[Link] (4 responses)
That is unlikely to happen for real workloads, but should still be impossible. Especially since pathological workloads can often be provoked deliberately by an attacker. (typical for race-conditions, for example, even race-conditions that would "almost never" happen in normal situations, become a problem if an attacker can deliberately create them)
Posted Jun 24, 2010 11:34 UTC (Thu)
by nye (subscriber, #51576)
[Link]
Posted Jun 24, 2010 11:41 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link] (2 responses)
Posted Jun 24, 2010 12:08 UTC (Thu)
by ekj (guest, #1524)
[Link]
But you're right, that (pathological) case is likely the lower bound for utilization in ext2.
Posted Jun 24, 2010 20:50 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
Posted Jun 27, 2010 12:28 UTC (Sun)
by nix (subscriber, #2304)
[Link] (4 responses)
(some of us also have hardware RAID so every byte spent backing up metadata is a byte wasted, but I'm sure btrfs can turn that off: and admittedly this is not a common case.)
Posted Jun 27, 2010 21:48 UTC (Sun)
by dark (guest, #8483)
[Link] (3 responses)
Posted Jun 27, 2010 22:02 UTC (Sun)
by dlang (guest, #313)
[Link] (2 responses)
Posted Jun 27, 2010 23:02 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Jun 29, 2010 20:26 UTC (Tue)
by masoncl (subscriber, #47138)
[Link]
Posted Jul 28, 2010 23:05 UTC (Wed)
by sbergman27 (guest, #10767)
[Link]
Posted Jun 24, 2010 12:37 UTC (Thu)
by masoncl (subscriber, #47138)
[Link] (1 responses)
Edward's 17% number didn't factor in the part where btrfs duplicates all btree blocks by default, so the actual fragmentation caused by the btree is much lower.
All filesystems suffer from the problem of partially data used blocks. The btrfs btree does allow partially used btree leaves, but the higher level nodes are much more traditional (fixed record length) blocks. So the worst case in btrfs is to have the directory items in one block, the inode in another block and the inline file data extent in a third block.
This means our worst case is similar to the standard case for ext*. Ext will always get more than one inode in a block, but it'll also always have separate blocks for inodes, directories and file data.
I do appreciate Edward's comments, he has been working on filesystems for a long time and contributed the btrfs grubv1 support.
Posted Jun 24, 2010 16:08 UTC (Thu)
by ikm (guest, #493)
[Link]
Posted Jun 24, 2010 12:52 UTC (Thu)
by johnflux (guest, #58833)
[Link] (6 responses)
> If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level.
Anyone have any further input on this?
Posted Jun 24, 2010 13:12 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link] (1 responses)
I can't prove that my head won't fall off before I finish typing this. It's thermodynamically possible, but so unlikely that it won't happen even given a googleplex of universes to try it out i
Posted Jun 25, 2010 14:09 UTC (Fri)
by tialaramex (subscriber, #21167)
[Link]
In reality what happens is that we put up with a negligible chance of failure. The failure of this sort are vastly outweighed by more pragmatic problems (imagine if the generals managed their agreement, and then one finds his men have developed dysentery and cannot fight) so we don't worry about it very much.
Posted Jun 24, 2010 13:31 UTC (Thu)
by masoncl (subscriber, #47138)
[Link] (3 responses)
Ohad Rodeh's paper was really about snapshot reference counting. He did go into details of btree balancing but this was mostly about how top-down balancing is best suited to the snapshot reference counting problem.
We've definitely changed some of the fundamentals around snapshot reference counting too, and we've continued to discuss these with Dr Rodeh (he's really good to work with). His work was presented as a starting point for people interested in snapshotting.
There are also a lot of ideas from reiserfsv3 and xfs and ext in Btrfs. To me, filesystems are really just a collection of tradeoffs and optimizations for specific goals. We trade CPU for disk IO, or CPU and disk IO for features, and these tradeoffs change as we go.
When designing btrfs we took a big pile of known solutions for filesystem problems, crammed them together, and then fixed up the new problems that resulted. Somehow I missed the scientific magazine step that all the other filesystems have followed, but I'll do what I can to keep improving the FS.
Posted Jun 24, 2010 16:40 UTC (Thu)
by bronson (subscriber, #4806)
[Link]
Can a quote of the week appear in the previous week's comments?
Posted Jun 26, 2010 13:52 UTC (Sat)
by johnflux (guest, #58833)
[Link] (1 responses)
Out of interest, what are your reasons for not having a peer reviewed article on this? Lack of familiarity of the process and/or time, would be my guess?
Especially since you benefited from reading a published paper, it would be nice for you to write up your changes. If you just don't want to go through that bother, maybe a "comment" would be nice? These are half-a-page comments to existing papers but they also get peer reviewed and published.
Actually, you mentioned that you worked with author of the original paper. Maybe he would be interested in publishing your work on your behalf.
Posted Jun 29, 2010 20:40 UTC (Tue)
by masoncl (subscriber, #47138)
[Link]
Partially...the part in question is the btree indexing. This is important because it is a core part of the filesystem but it is also the least exotic part of the FS.
The part that is much more performance critical is how do we use that index to track block reference counts and maintain snapshot integrity. Much of my initial work here was expanded by Yan Zheng, and I think his improvements are a better topic for research papers.
Here is a paper where they tried to replace the btrfs back reference tracking and measured some differences:
http://www.usenix.org/events/fast10/tech/full_papers/mack...
Another key part of btrfs is how do we spread checksumming or compression across the CPUs and still maintain good IO ordering.
Or, how do we track free space and not explode (both Josef Bacik and Yan Zheng have spent considerable time on this part).
If I were going to invest time in writing the papers, I'd pick one of those three because they are newer problems where more active research is going on.
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Not really -- that's why ext2 also has inode quotas/file quotas.
Btrfs: broken by design?
Hmm, where are you getting the 40% number? Is that real metadata, or are the inline extents also counted as metadata because they are stored in leaf nodes?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?
Btrfs: broken by design?