LWN: Comments on "A short history of btrfs" https://lwn.net/Articles/342892/ This is a special feed containing comments posted to the individual LWN article titled "A short history of btrfs". en-us Sat, 25 Oct 2025 02:57:02 +0000 Sat, 25 Oct 2025 02:57:02 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net A short history of btrfs https://lwn.net/Articles/695255/ https://lwn.net/Articles/695255/ joel_unix <div class="FormattedComment"> A correction to the article. It is better to be explicit about the fact that its "B+trees" you're talking about when you say "leaves of the tree are linked together".<br> RE:<br> "To start with, btrees in their native form are wildly incompatible with COW. The leaves of the tree are linked together"<br> </div> Sun, 24 Jul 2016 17:09:10 +0000 "Check back in two years and see if I got any of these predictions right! " https://lwn.net/Articles/453984/ https://lwn.net/Articles/453984/ tjwoods <div class="FormattedComment"> It's two years later now... I'd love to see an update of this article with a review of how btrfs is progressing.<br> </div> Wed, 03 Aug 2011 18:24:27 +0000 A short history of btrfs https://lwn.net/Articles/347126/ https://lwn.net/Articles/347126/ bill_mcgonigle <div class="FormattedComment"> Great article. It was just the right mix of history, details and refresher for a lapsed CS geek to understand btrfs. The comparison with ZFS cleared up quite a bit of confusion I had in that area.<br> <p> </div> Fri, 14 Aug 2009 08:22:49 +0000 2 years before bytfs is the default filesystem? https://lwn.net/Articles/346186/ https://lwn.net/Articles/346186/ stanrbt <div class="FormattedComment"> And I forgot to add that the Linux "leadership" can help speed up the maturing of btrfs, the question is do you want to do so ? <br> <p> Or do you prefer to play the "Cat on a hot tin roof" game ? Please do not try digging in the IBM/Microsoft way - evolution will swipe you away since you are lacking their backers.<br> <p> Just to avoid any speculation - I have nothing to do with Oracle. I am simply a manager who has been crazy about open source for a looong time (not just squeezing money out of it). Not many of us around, right ? Have you ever wondered why ?<br> </div> Mon, 10 Aug 2009 15:23:18 +0000 2 years before bytfs is the default filesystem? https://lwn.net/Articles/346182/ https://lwn.net/Articles/346182/ stanrbt <div class="FormattedComment"> Quite a spiteful &amp; exposing comment, I think. What is evolution ? Improvement of processes. In other words it is not only possible to build &amp; construct much faster and more efficiently today than say 10 years ago, it is mandatory if we are really trying to improve/advance/develop.<br> <p> Maybe tytso is getting old ? Maybe he does not have the needed analytical skills to understand that Oracle is currently the only IT company which has resources to act pro-actively in this environment, everyone else is acting re-actively and things in the industry worsen by the day.<br> <p> It is not the amount of money you throw at a project which makes it successful. Sure, you need some "minimum amount of money" but the approach &amp; the skills you apply are much more important. The "approach" part is the critical one.<br> <p> I absolutely agree that the success of btrfs depends on much more than just Oracle's muscle but this is what Leadership is all about - give the example, break the ice, change the reality. All of us can talk spiteful rubbish.<br> </div> Mon, 10 Aug 2009 14:51:53 +0000 Free space management? https://lwn.net/Articles/346179/ https://lwn.net/Articles/346179/ Yenya <div class="FormattedComment"> FWIW, I have reported this as <a href="http://bugzilla.kernel.org/show_bug.cgi?id=13957">http://bugzilla.kernel.org/show_bug.cgi?id=13957</a>. However, I can no longer use this HW for testing, sorry.<br> </div> Mon, 10 Aug 2009 14:05:08 +0000 2 years before bytfs is the default filesystem? https://lwn.net/Articles/345981/ https://lwn.net/Articles/345981/ tytso <div class="FormattedComment"> For community distributions? Maybe. Time will tell. I'm supportive of btrfs, and helped to rally industry support for it behind the scenes, but I'm also realistic about how long it takes to bring a file system to production status. Sun had around two dozen engineers (from what I could tell) working five years (2000-2005) before ZFS was released --- and then it was another three years or so before system administrators really trusted it for critical production systems at least in an enterprise data center. Consider the following report by a system administrator in 2007: <a href="http://forestlaw.blogspot.com/2007/03/solaris-zfs-not-ready-for-production.html">http://forestlaw.blogspot.com/2007/03/solaris-zfs-not-rea...</a><br> <p> &lt;i&gt;Btrfs is heading for 1.0, a little more than 2 years since the first announcement. This is much faster than many file systems veterans - including myself - expected, especially given that during most of that time, btrfs had only one full-time developer. &lt;/i&gt;<br> <p> Actually, from what I can tell, btrfs is a bit behind schedule. It was supposed to be format-stable by December 2008, and it's not quite format stable yet. Last I heard it still panic'ed on ENOSPC. And its userspace tools are still quite primitive at this point in time.<br> <p> Can it be ready for community distributions in two years; probably, but a lot of work needs to be put into it between now and then. And from developers beyond just those at Oracle.<br> <p> </div> Sat, 08 Aug 2009 04:08:51 +0000 A short history of btrfs https://lwn.net/Articles/345186/ https://lwn.net/Articles/345186/ topher <blockquote><em>Some database rows will be updated, and most DBMS do that in-place.</em></blockquote> This is actually incorrect. There are almost no databases that actually do that anymore. All modern (decent) databases will create a new row when performing an update, and then mark the old row as deleted internally. This is significantly safer, performs better, and allows you to support other important transactional features. Tue, 04 Aug 2009 15:41:11 +0000 A short history of btrfs https://lwn.net/Articles/345086/ https://lwn.net/Articles/345086/ wmf <div class="FormattedComment"> COW is also crash-safe without the overhead of journaling.<br> </div> Tue, 04 Aug 2009 00:51:52 +0000 Solid-state drives https://lwn.net/Articles/345061/ https://lwn.net/Articles/345061/ SEJeff <div class="FormattedComment"> btrfs already supports exactly what you asked:<br> <a href="http://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Multiple_Devices">http://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_M...</a><br> <p> By default data is striped accross multiple disks and metadata is mirrored. You can even mirror metadata on a single disk to prevent bad blocks from corrupting your data. Chris and company did a great job in designing this beast.<br> <p> <p> </div> Mon, 03 Aug 2009 21:01:30 +0000 Solid-state drives https://lwn.net/Articles/345059/ https://lwn.net/Articles/345059/ dlang <div class="FormattedComment"> what I was thinking of is the ability to use different levels of raid for different types of data.<br> say raid1 for metadata, raid0 for any files in /tmp, raid5/6 for everything else.<br> </div> Mon, 03 Aug 2009 20:41:26 +0000 Solid-state drives https://lwn.net/Articles/345058/ https://lwn.net/Articles/345058/ SEJeff <div class="FormattedComment"> If you raid drives with btrfs it lets you raid the metadata. Actually, doesn't it do this by default now?<br> </div> Mon, 03 Aug 2009 20:31:39 +0000 A short history of btrfs https://lwn.net/Articles/344921/ https://lwn.net/Articles/344921/ jamaica <div class="FormattedComment"> Heise wrote an article about btrfs in a style like: I got up this morning, formated a btrfs partition, used this command to do that and that command to do this.<br> <p> The articel is available in german <a rel="nofollow" href="http://www.heise.de/open/Das-Dateisystem-Btrfs--/artikel/141341">http://www.heise.de/open/Das-Dateisystem-Btrfs--/artikel/...</a><br> and english <a rel="nofollow" href="http://www.h-online.com/open/The-Btrfs-file-system--/features/113738">http://www.h-online.com/open/The-Btrfs-file-system--/feat...</a><br> </div> Sun, 02 Aug 2009 08:59:43 +0000 Solid-state drives https://lwn.net/Articles/344897/ https://lwn.net/Articles/344897/ dlang <div class="FormattedComment"> the idea of allowing the metadata to exist on different media is both very interesting and very scary.<br> <p> on the interesting side, the fact that you can now buy _very_ high speed drives (up to 64G of battery backed ram fast enough that it needs multiple SATA cables to saturate it) means that variations on the space allocation policies can lead to very significant speed-ups.<br> <p> however, on the scary side, you have the fact that you now need all these different drives to remain operational and available or you run the risk of loosing everything.<br> <p> a follow-up for after btrfs gets going solidly and reliably would be to look into a multi-drive version that could maintain multiple copies of metadata on different drives<br> </div> Sun, 02 Aug 2009 01:02:25 +0000 Solid-state drives https://lwn.net/Articles/344895/ https://lwn.net/Articles/344895/ cjb <div class="FormattedComment"> <font class="QuotedText">&gt; What's btrfs' SSD story? ZFS gets some amazing performance improvements if you add an SSD for the ZIL or as L2ARC, does the btrfs architecture have any similar parts that could be farmed out to an SSD?</font><br> <p> Yes. There are a couple of ideas here -- one (as yet unimplemented) is that you could use the SSD as a caching frontend to larger, slower media behind it. Another is that you could use exclusively the SSD to store metadata items; btrfs doesn't care how far away from data you put its metadata, or even if they end up on the same device.<br> </div> Sun, 02 Aug 2009 00:08:35 +0000 A short history of btrfs https://lwn.net/Articles/344886/ https://lwn.net/Articles/344886/ efexis <div class="FormattedComment"> Have you seen the comparitive difference in TCO of your average Linux zealot compared to your average Windows or Apple zealot? I'm sure there's a microsoft advert out there somewhere about it :-)<br> <p> <p> </div> Sat, 01 Aug 2009 21:39:09 +0000 A short history of btrfs https://lwn.net/Articles/344867/ https://lwn.net/Articles/344867/ graydon <p>A couple points. I'm not expert enough to be sure these answer your question:</p> <ul> <li>As far as I know, slab allocators are based on a regional grouping: "this entire range of memory is dedicated to objects of size N bytes", say. The Lea allocator keeps a freelist <em>that can hold</em> objects of N bytes, but there's nothing saying those objects are either allocated from an N-byte-object slab initially, nor that they're <em>exactly</em> N bytes. Just that that's the right power-of-two freelist to put them on. It also coalesces adjacent allocations into larger blocks and moves them to larger freelists whenever it can.</li> <li>Furthermore -- here I am getting more sketchy in my understanding -- I <em>think</em> that the design of a slab-like allocator implies having a certain degree of co-operation from your allocation client. The client has to be willing to ask for memory from slabs, and the memory it asks for has to (commonly) be in slab-entry-sized units. This is fine -- actually a correct design observation that slab allocation was invented to take advantage of -- when your client is a kernel with a fixed-at-compile-time assortment of struct sizes it's going to be (mostly) allocating. It is not really a correct design observation for things like "general-purpose mallocs" or "general-purpose filesystems", where the clients are unknown and aren't interested in confining themselves to slab-appropriate behavior.</li> </ul> <p>IOW I think both are <em>trying</em> to "reduce fragmentation", but with a different assumed type of workload: a slab allocator eliminates external fragmentation almost entirely <em>if the client can co-operate</em>, but a more general allocator like Lea malloc lacks that co-operation promise, and so takes a simpler approach that (on average) keeps both internal and external fragmentation "under control", but with less of a guarantee about it.</p> <p>I believe, if I'm reading correctly, that btrfs will be more in the Lea-malloc camp: possibly subject to more external fragmentation (say if you let a bunch of leaves from the middle of blocks drop to 0-refcount), but possibly less internally fragmenting than ZFS (due to the absence of slab size-grouping), so better on average for the general workload of a filesystem. Allocators are always tuned to workload assumptions.</p> Sat, 01 Aug 2009 17:27:45 +0000 A short history of btrfs https://lwn.net/Articles/344859/ https://lwn.net/Articles/344859/ Trelane <div class="FormattedComment"> Seconded. *This* is why I ponied up for LWN instead of e.g. Ars Technica. (That, and the commenters aren't all MSFT or Apple zealots).<br> </div> Sat, 01 Aug 2009 13:11:09 +0000 A short history of btrfs https://lwn.net/Articles/344704/ https://lwn.net/Articles/344704/ sitaram <div class="FormattedComment"> Val joins Jon and others in being yet another reason to subscribe to LWN.<br> <p> Thanks very much indeed -- you write well...<br> </div> Fri, 31 Jul 2009 11:16:00 +0000 Solid-state drives https://lwn.net/Articles/344527/ https://lwn.net/Articles/344527/ TRS-80 What's btrfs' SSD story? ZFS gets some amazing performance improvements if you add an SSD for the ZIL or as L2ARC, does the btrfs architecture have any similar parts that could be farmed out to an SSD? Fri, 31 Jul 2009 01:03:34 +0000 A short history of btrfs https://lwn.net/Articles/344488/ https://lwn.net/Articles/344488/ joib <div class="FormattedComment"> Great article, thanks!<br> <p> Though I'm a bit confused why the SLAB allocator style approach in ZFS would be such a problem from a fragmentation standpoint. In the VM, SLAB was introduced to reduce fragmentation in the first place, after all. Not to mention, many user-space malloc() implementation use somewhat similar strategies to cope with arbitrary sized requests, e.g. the Doug Lea allocator in glibc malloc maintains lists of objects in increasing size of powers-of-two all the way up to the mmap limit, and each allocation is rounded up to the nearest power-of-2, all this mainly to reduce fragmentation (or something like that, it's been a while since I looked into it). Heck, back in fs land, one thing that ext4 got from Lustre was having separate large- and small-file areas on disk, to reduce fragmentation, and it wouldn't surprise me if btrfs did something similar as well. <br> <p> Anyway, from the article it naively sounds like the real problem is that the SLAB's are pre-allocated, perhaps even at fs creation time, rather than as needed from free space, which again naively would do away with a lot of the requirement to coalesce or split objects?<br> </div> Thu, 30 Jul 2009 16:15:28 +0000 A short history of btrfs https://lwn.net/Articles/344370/ https://lwn.net/Articles/344370/ alkby <div class="FormattedComment"> I don't get one thing about this COW thing. Why it's so good ? That's convenient for snapshots, but <br> if I don't use snapshots and have huge files and often rewrite their pieces then they'll become <br> highly fragmented, right ?<br> <p> Imagine a database data file that is organized as a btree itself. Some database rows will be <br> updated, and most DBMS do that in-place. So our modern COW filesystem will gradually <br> fragment that data file. And when DBMS will do, say a range scan in btree it will expect mostly <br> linear (and fast) disk IO, while in fact it will get random (and dead slow) IO. Maybe I've missed <br> something important here?<br> </div> Thu, 30 Jul 2009 08:00:56 +0000 Free space management? https://lwn.net/Articles/343640/ https://lwn.net/Articles/343640/ masoncl <div class="FormattedComment"> Both the kernel.org bugzilla and the btrfs mailing list are good resources (linux-<br> btrfs@vger.kernel.org)<br> </div> Mon, 27 Jul 2009 23:01:20 +0000 Free space management? https://lwn.net/Articles/343629/ https://lwn.net/Articles/343629/ Yenya <div class="FormattedComment"> Thanks for the explanation!<br> <p> BTW, where should I post issues with BTRFS? I have a testing BTRFS over 8 drives (both data and metadata mirrored), and recently one of the drives died. Now when I want to access the BTRFS volume, add a new drive, or whatever, it crashes on me (2.6.30, I think; I can compile any kernel<br> you point me at, though). Is the kernel bugzilla the right place?<br> </div> Mon, 27 Jul 2009 22:02:04 +0000 Free space management? https://lwn.net/Articles/343526/ https://lwn.net/Articles/343526/ masoncl <div class="FormattedComment"> In general, the allocator is going to be one of the most complex parts of a cow based <br> filesystem. Log structured filesystems have segments, where they force reasonably sized <br> units of the disk to become free and then allocate out of that.<br> <p> Btrfs uses a btree to record which extents are in use, and free space is anything that isn't <br> recorded in that btree. So if it says we have an extent allocated from [16-20k ] and one at <br> [24-28k], then we have 4k free from 20k to 24k.<br> <p> The part where btrfs gets complex in a hurry is that every extent in use in the filesystem is <br> recorded in this btree, including the extents that make up that btree. All of the btrees in <br> btrfs are managed with COW.<br> <p> This doesn't spiral out of control because COW is strictly done as copy on write. Meaning <br> that as long as we haven't yet written a block we are allowed to keep changing it. The end <br> result is that we allocate blocks to hold new copies of the btree blocks and then are able to <br> make a number of modifications to those new blocks before they get written to disk.<br> <p> As extents are freed we are able to reuse them, so we don't just keep walking to the end of <br> the drive. Btrfs does still have problems with ENOSPC, but this is mostly an accounting <br> problem of making sure there are enough free extents to make all of the btree modifications <br> we've promised we're going to make.<br> <p> </div> Mon, 27 Jul 2009 14:38:17 +0000 A short history of btrfs https://lwn.net/Articles/343465/ https://lwn.net/Articles/343465/ eparis123 <div class="FormattedComment"> Thanks a lot for this article.<br> <p> The explanations and the linked pdfs were all outrageous. <br> </div> Sun, 26 Jul 2009 19:23:42 +0000 Free space management? https://lwn.net/Articles/343429/ https://lwn.net/Articles/343429/ bronson <div class="FormattedComment"> Good question. Last I heard, btrfs just keeps writing data toward the end of the disk until it runs out of disk. Then it panics.<br> <p> This appears to have been fixed, and doesn't invovled anything so crufty as a vacuum running in the background, but I can't find specifics. Does anybody have up to date info?<br> </div> Sat, 25 Jul 2009 21:28:23 +0000 A short history of btrfs https://lwn.net/Articles/343428/ https://lwn.net/Articles/343428/ bronson <div class="FormattedComment"> So, you're going to arrange some way of recording average %sy time as a single number (there are lots of different ways of doing this), then somehow graph it that makes sense to regular people?<br> <p> Just timing a file decompression is a lot easier for all involved, no? It's not a great benchmark, true, but it will quickly and reliably tell you if one FS requires more CPU than another. And that's the most important thing.<br> <p> </div> Sat, 25 Jul 2009 21:20:11 +0000 Free space management? https://lwn.net/Articles/343410/ https://lwn.net/Articles/343410/ Yenya <div class="FormattedComment"> How do COW btree-based filesystems manage the free disk space? I presume it is not in the btree itself, because for allocation of the new block it would be necessary to - well - allocate a COW block for the new version of (at least) the btree leaf.<br> </div> Sat, 25 Jul 2009 16:01:44 +0000 A short history of btrfs https://lwn.net/Articles/343314/ https://lwn.net/Articles/343314/ jlokier <div class="FormattedComment"> I'm intrigued by the idea that "btrees and extents seemed fundamentally incompatible with copy-on-write" so recently.<br> <p> B+Trees are an old technique, fairly traditional by now. Next-leaf links never did make much sense - as Ohad Rodeh's PDF explains, they don't help you prefetch in parallel. (By the way, Microsoft patented previous-leaf links!) Omitting them is cheap if you have enough memory to keep all blocks on the path from leaf to root in memory, which we certainly do these days - the path is bounded by tree depth and so quite small. Like all tree structures with only child links, copy-on-write is straightforward: Functional programming languages have always implemented copy-on-write for tree structures. What the PDF calls lazy reference counting (compared with, say, WAFL's method), would be called ordinary reference counting when describing a simple tree data structure in memory.<br> <p> Btrfs is clearly a big step forward in Linux filesystems, and there is undoubtedly a lot of other design in it which is not covered by the academic papers. I'm looking forward to using it, especially the reflink feature and snapshots.<br> <p> But I'm really surprised that the underlying tree structure and algorithms turn out to be new, or at least not well described in open literature already.<br> </div> Fri, 24 Jul 2009 20:23:51 +0000 A short history of btrfs https://lwn.net/Articles/343306/ https://lwn.net/Articles/343306/ Bayes <div class="FormattedComment"> Great article! Thanks for the B+ trees lesson, very interesting stuff.<br> </div> Fri, 24 Jul 2009 19:03:54 +0000 A short history of btrfs https://lwn.net/Articles/343240/ https://lwn.net/Articles/343240/ dlang <div class="FormattedComment"> they do have a comment section on their website<br> </div> Fri, 24 Jul 2009 09:44:31 +0000 A short history of btrfs https://lwn.net/Articles/343227/ https://lwn.net/Articles/343227/ nix Hard: I still don't have email thanks to the implosion of Zetnet, sorry, Breathe, sorry, they went bust and cut all their *other* customers off from their IMAP mailservers, the new company is called Breathe now. <p> Moving to <a href="http://www.aaisp.net.uk/">a decent ISP</a> as soon as BT get around to it... but that's a week away, plus another week for the new MX record to propagate around. Three weeks without properly-working email, sigh. Fri, 24 Jul 2009 07:34:15 +0000 A short history of btrfs https://lwn.net/Articles/343222/ https://lwn.net/Articles/343222/ PaulWay <div class="FormattedComment"> Echoing the other positive comments, this is a really great article. Awesome writing, Val!<br> </div> Fri, 24 Jul 2009 07:07:04 +0000 error bars https://lwn.net/Articles/343149/ https://lwn.net/Articles/343149/ dlang <div class="FormattedComment"> you need to tell them, not us ;-)<br> </div> Thu, 23 Jul 2009 22:31:49 +0000 error bars https://lwn.net/Articles/343137/ https://lwn.net/Articles/343137/ tialaramex <div class="FormattedComment"> Still no error bars on their charts. Yes, the error bars would probably mean most results found nothing. That's a good thing!<br> <p> I'd like to see more investigation. I think that would follow from narrowing results down to only those that were significant. If you find 500 tiny differences between two things, most of which are just measurement noise, you have no reason to investigate further. But if you make one big significant finding you can do a whole article about what it means - why is the Frooqux significantly faster ? Is it the same on an AMD machine ? In OpenSolaris ? With a different network card ?<br> </div> Thu, 23 Jul 2009 22:11:22 +0000 The problem is not with benchmark itself... https://lwn.net/Articles/343091/ https://lwn.net/Articles/343091/ khim <blockquote>Uh, having the cpu loaded will find problems if the FS code itself is too cpu hoggy....</blockquote> <p>The problem is not with low-level CPU bound benchmark but with average taken from many different benchmarks without a case or thought. In the end you are getting average temperature of hospice patients: some are having high fever, some are in morgue already, so in the end average temperature is useless.</p> <p>If you plan to mix a lot of different benchmarks you must be ready to carefully study the results, separate expected results from unexpected ones, cluster them in groups (by relevance to this or that real-word task), etc. Ortherwise it's just pointless race where winner is more-or-less random.</p> <p>And it's also pointless to try to fix the situation by adding more benchmarks to the mix: when you mix a lot of differend kinds of food - you are getting pile of garbage as a result and if you'll add some more dishes - you'll just get a bigger pile of garbage.</p> Thu, 23 Jul 2009 17:50:42 +0000 A short history of btrfs https://lwn.net/Articles/343084/ https://lwn.net/Articles/343084/ jengelh <div class="FormattedComment"> Uh, extracting a tarball with lots of files and watching the %sy time is likely to do the same.<br> </div> Thu, 23 Jul 2009 17:04:13 +0000 A short history of btrfs https://lwn.net/Articles/343083/ https://lwn.net/Articles/343083/ kjp <div class="FormattedComment"> Uh, having the cpu loaded will find problems if the FS code itself is too cpu hoggy....<br> </div> Thu, 23 Jul 2009 16:59:59 +0000 A short history of btrfs https://lwn.net/Articles/343051/ https://lwn.net/Articles/343051/ dlang <div class="FormattedComment"> they just announced that they are about to start using a new version of their benchmarks, so this is the perfect time to jump in and try to improve things.<br> <p> <a href="http://www.phoronix.com/scan.php?page=article&amp;item=pts_20_details&amp;num=1">http://www.phoronix.com/scan.php?page=article&amp;item=pt...</a><br> </div> Thu, 23 Jul 2009 15:26:27 +0000