LWN: Comments on "Btrfs at Facebook" https://lwn.net/Articles/824855/ This is a special feed containing comments posted to the individual LWN article titled "Btrfs at Facebook". en-us Sat, 20 Sep 2025 22:19:18 +0000 Sat, 20 Sep 2025 22:19:18 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Btrfs at Facebook https://lwn.net/Articles/829578/ https://lwn.net/Articles/829578/ flussence <div class="FormattedComment"> Sounds like email, or at least the variants that don&#x27;t stuff multi-gigabyte mailboxes into a *single* file...<br> </div> Wed, 26 Aug 2020 11:45:44 +0000 Btrfs at Facebook https://lwn.net/Articles/829317/ https://lwn.net/Articles/829317/ anselm <p> Things are maybe not so bad if – like Facebook – you have actual file system developers on your payroll. </p> Fri, 21 Aug 2020 17:18:19 +0000 Btrfs at Facebook https://lwn.net/Articles/829316/ https://lwn.net/Articles/829316/ rep_movsd <div class="FormattedComment"> Back in 2003-2004 I was on a small team that built an instant messenger client called Rediff Bol 2.0<br> <p> Version 1.0 had been a huge success in India, but they wanted a snazzy UI like Yahoo.<br> Their message server used this same retarded technique of saving messages as files.... on a shared NFS mount.<br> <p> I won&#x27;t go into the horrors of such a design idea (which to be fair did have one thing going for it - simplicity) but it&#x27;s insane to use a filesystem as a database.<br> <p> Im sure FB devs had their reasons (maybe legacy), but fundamentally a crap idea. Filesystems are designed to be moderately efficient at various sizes of files.<br> I don&#x27;t think any FS dev in their wildest nightmares would have considered a use case of zillions of files which are smaller than a floppy disk sector <br> </div> Fri, 21 Aug 2020 16:31:55 +0000 Btrfs at Facebook https://lwn.net/Articles/828582/ https://lwn.net/Articles/828582/ redneb This is not true anymore, it was a bug that was fixed in 4.14 (11/2017). There are <a rel="nofollow" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=21634a19f6467674ef67fba9714c835a1c0a1e67">two</a> <a rel="nofollow" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4330e183c9537df20952d4a9ee142c536fb8ae54">commits</a> about this and the <a rel="nofollow" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66ba772ee3119849fcdd8ac9766c6c25ede4a982">pull request</a> says: "degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check". Tue, 11 Aug 2020 22:17:34 +0000 Btrfs at Facebook https://lwn.net/Articles/825949/ https://lwn.net/Articles/825949/ anton <blockquote>Basic striping and mirroring work well, and have done so for years</blockquote> If only btrfs was user-friendly when a drive of a mirror pair fails; instead it gives you one shot at <a href="http://www.complang.tuwien.ac.at/anton/btrfs-raid1.html">dealing with the problem</a>, and on the next reboot becomes irreversably read-only; that was certainly the state of the FAQ when we installed a new server earlier this year. So we decided to try out ZFS instead, and its handling of a simulated disk failure and replacement was very smooth, hardly perceptible. Sun, 12 Jul 2020 15:35:46 +0000 Btrfs at Facebook https://lwn.net/Articles/825795/ https://lwn.net/Articles/825795/ khim <p><b>Useful</b> automatic deduplication is work in progress.</p> <p>You can run <a href="https://github.com/markfasheh/duperemove/wiki">duperemove</a> in background and everything would be nicely deduplicated but performance would be so awful then you just don't want to do that.</p> <p>But if that's developer's system you can do that when developer is sleeping - then it works nicely.</p> Thu, 09 Jul 2020 20:51:52 +0000 Btrfs at Facebook https://lwn.net/Articles/825531/ https://lwn.net/Articles/825531/ intelfx <div class="FormattedComment"> <font class="QuotedText">&gt; Then there is the issue of RAID support, another longstanding problem area for Btrfs. Basic striping and mirroring work well, and have done so for years, he said. RAID 5 and 6 have &quot;lots of edge cases&quot;, though, and have been famously unreliable. These problems, too, are on his list, but solving them will require &quot;lots of work&quot; over the next year or two.</font><br> <p> I&#x27;d be *really* pleased if this was to happen.<br> </div> Wed, 08 Jul 2020 10:22:18 +0000 Btrfs at Facebook https://lwn.net/Articles/825436/ https://lwn.net/Articles/825436/ jezuch <div class="FormattedComment"> Git actually deduplicates even binary files - that&#x27;s a convenient side effect of content-addressable storage. Btrfs is in theory able to share extents between identical copies of files, but I think you have to tell it these are copies - automatic deduplication is a work in progress I think?<br> </div> Tue, 07 Jul 2020 12:15:56 +0000 Btrfs at Facebook https://lwn.net/Articles/825434/ https://lwn.net/Articles/825434/ TMM <div class="FormattedComment"> I don&#x27;t work for facebook, but given that they said they used cheap flash drives I&#x27;m assuming that they are also using somewhat cheap raid controllers. The controller where I saw this was a highpoint rocketraid controller which did the same thing. I imagine that more cheaper (soft)raid controllers which are actually mostly just sata controllers do this to write their raid metadata.<br> </div> Tue, 07 Jul 2020 03:37:41 +0000 Btrfs at Facebook https://lwn.net/Articles/825335/ https://lwn.net/Articles/825335/ ncm <div class="FormattedComment"> Or, indeed, just about any conventional database.<br> </div> Sun, 05 Jul 2020 21:51:07 +0000 Btrfs at Facebook https://lwn.net/Articles/825333/ https://lwn.net/Articles/825333/ ncm <div class="FormattedComment"> Facebook seems to prefer SSD, for Reasons, but a TB SSD is pretty normal now. FB can certainly afford a thousand per developer, and something to plug them all into, what with large multiple $millions annual profit per employee. Not spending as much per head as could possibly be useful would be irresponsible even for a wildly less profitable enterprise. If they don&#x27;t *all* have 3990Xes with 1TB RAM and 3 4k screens yet, heads should roll.<br> <p> But as a practical matter, I gather from discussion with core devs (a couple of years ago) that it is part of FB&#x27;s received engineering culture to copy dependencies, so there are, e.g., hundreds or thousands of copies of libz scattered throughout their repository, in many, many versions, and likewise now, I expect, libzstd. Btrfs would be able to share blocks for them when unpacked, but maybe not so much so as represented within git?<br> <p> Imagine the engineering effort just to consolidate all those libz instances, identifying version, bug, and local patch dependencies, while every week more appear. Imagine the effort just to bring them all up to the current release: not a thing to make your CV shine.<br> </div> Sun, 05 Jul 2020 21:02:01 +0000 Btrfs at Facebook https://lwn.net/Articles/825322/ https://lwn.net/Articles/825322/ caliloo <div class="FormattedComment"> Whatsapp was built with erlang, i wouldn’t be surprised this has something to do with their method of message storage....<br> </div> Sun, 05 Jul 2020 19:02:19 +0000 Btrfs at Facebook https://lwn.net/Articles/825292/ https://lwn.net/Articles/825292/ josefbacik <div class="FormattedComment"> There’s an index log that compresses well. Mostly they’re in it for the snapshotting.<br> </div> Sun, 05 Jul 2020 02:10:48 +0000 Btrfs at Facebook https://lwn.net/Articles/825288/ https://lwn.net/Articles/825288/ gus3 <div class="FormattedComment"> I just (an hour or so ago) bought a 4T USB drive for backups for less than $100. My old backup drive is showing some age, so it was time.<br> <p> That&#x27;s big enough for four 800GB containers mapped to virtual space, plus filesystem overhead. The only choke-points would be the USB3 connection and head seek time on the platters. Of course, SSD would eliminate the latter.<br> <p> It isn&#x27;t bleeding edge, or even leading edge anymore. It&#x27;s COTS (commodity, off-the-shelf).<br> </div> Sat, 04 Jul 2020 22:03:19 +0000 Btrfs at Facebook https://lwn.net/Articles/825199/ https://lwn.net/Articles/825199/ pizza <div class="FormattedComment"> <font class="QuotedText">&gt; For a complex system, there will be different levels of design and taking into account low-level technical details for the high-level design would be wrong.</font><br> <p> I&#x27;m not talking about &quot;low level details&quot;, I&#x27;m talking about being aware of what is or isn&#x27;t possible (or practical) to implement.<br> <p> Outside of a handful of fields (mostly to do with advanced mathematics) those &quot;designs&quot; still need to be realized in the real world.<br> </div> Sat, 04 Jul 2020 12:04:27 +0000 Btrfs at Facebook https://lwn.net/Articles/825250/ https://lwn.net/Articles/825250/ scott@deltaex.com <div class="FormattedComment"> Compressing ciphertext is effectively the same as cracking it.<br> </div> Fri, 03 Jul 2020 22:52:04 +0000 Btrfs at Facebook https://lwn.net/Articles/825248/ https://lwn.net/Articles/825248/ nivedita76 <div class="FormattedComment"> I didn&#x27;t understand it that way: they&#x27;re not talking about compression issues when discussing the messages, but problems with storing them along with the metadata instead of in separate blocks. My guess is that if text messages are compressed at all, the messages are compressed, then encrypted and sent; and decrypted and decompressed on the recipient&#x27;s phone. The larger ones for photos, videos, audio would already be in a compressed format anyway.<br> </div> Fri, 03 Jul 2020 22:49:10 +0000 Page tables https://lwn.net/Articles/825222/ https://lwn.net/Articles/825222/ corbet The kernel was (and still is) built for a given page-table depth; the compiler then shorts out the code for the levels that aren't actually used. Fri, 03 Jul 2020 16:58:42 +0000 Btrfs at Facebook https://lwn.net/Articles/825218/ https://lwn.net/Articles/825218/ shemminger <div class="FormattedComment"> 800G is huge. All of Windows is a single git repo of 300G with 35M files.<br> Microsoft built Git Virtual Filesystem to deal with that, and avoid having to do massive downloads and checkouts.<br> </div> Fri, 03 Jul 2020 16:51:27 +0000 Btrfs at Facebook https://lwn.net/Articles/825215/ https://lwn.net/Articles/825215/ champtar <div class="FormattedComment"> Encrypted content compress really poorly if at all, thus my interrogation, how can btrfs compression can help. See chatcannon for an explanation.<br> </div> Fri, 03 Jul 2020 16:11:08 +0000 Btrfs at Facebook https://lwn.net/Articles/825205/ https://lwn.net/Articles/825205/ scientes <div class="FormattedComment"> <font class="QuotedText">&gt; But Btrfs, too, has run into scalability problems with this workload. Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file&#x27;s metadata rather than in a separate data extent. That leads to filesystems with hundreds of gigabytes of metadata and high levels of fragmentation. These problems have been addressed, Bacik said, and it&#x27;s &quot;relatively smooth sailing&quot; now. That said, there are still some issues left to be dealt with, and WhatsApp may not make the switch to Btrfs in the end. </font><br> <p> This sounds like a perfect use-case of kyotocabinet. I have heard reports that it is rock solid with huge loads, and you also put kyototycoon on top for tcp access (including a memcached-compatible mode).<br> </div> Fri, 03 Jul 2020 14:29:27 +0000 Btrfs at Facebook https://lwn.net/Articles/825204/ https://lwn.net/Articles/825204/ Wol <div class="FormattedComment"> In this case it sounds like &quot;the computer is the database&quot;, along the lines of the AS/400 or Pick, would be a good implementation ... :-)<br> <p> Cheers,<br> Wol<br> </div> Fri, 03 Jul 2020 14:26:08 +0000 Page tables https://lwn.net/Articles/825202/ https://lwn.net/Articles/825202/ Wol <div class="FormattedComment"> Was the third level emulated in software for processors that didn&#x27;t have it in hardware? I&#x27;m sure I remember something of the sort - that it was better to have an &quot;ideal&quot; design and emulate round practical short-comings.<br> <p> Cheers,<br> Wol<br> </div> Fri, 03 Jul 2020 14:23:57 +0000 Btrfs at Facebook https://lwn.net/Articles/825200/ https://lwn.net/Articles/825200/ kevincox <div class="FormattedComment"> It doesn&#x27;t seem large to me. They have a lot of projects going on, and I don&#x27;t know how much history this contains. I wouldn&#x27;t be surprised if a number of assets such as images were checked in as well so that they can be built into the final site/whatever. Then I would guess there are 27 versions of jQuery and a couple of compiled binaries.<br> <p> None of these things are huge but when you sum them up across a large company 800GB seems like they were really keeping themselves in check on the average.<br> </div> Fri, 03 Jul 2020 13:59:42 +0000 Btrfs at Facebook https://lwn.net/Articles/825191/ https://lwn.net/Articles/825191/ champtar <div class="FormattedComment"> That makes more sense, thanks<br> </div> Fri, 03 Jul 2020 13:35:17 +0000 Btrfs at Facebook https://lwn.net/Articles/825186/ https://lwn.net/Articles/825186/ Karellen <blockquote>These machines contain the entire source code for the web site; it is a struggle to get the whole thing to fit into the 800GB allotted for each and still leave room for some actual work to be done.</blockquote> <p><em>800GB‽‽</em> <p>Am I just getting old, or is that actually somewhat on the large side? Fri, 03 Jul 2020 13:20:42 +0000 Page tables https://lwn.net/Articles/825187/ https://lwn.net/Articles/825187/ corbet Digging through the code...v1.0 had two-level page tables internally. The third level was added once processors started supporting it. The same is true of the fourth and fifth levels. Fri, 03 Jul 2020 13:18:19 +0000 Btrfs at Facebook https://lwn.net/Articles/825185/ https://lwn.net/Articles/825185/ judas_iscariote <div class="FormattedComment"> wut? whatsapp message leave the user device encrypted.. you need the recipient private key to decrypt it.<br> </div> Fri, 03 Jul 2020 13:10:13 +0000 Btrfs at Facebook https://lwn.net/Articles/825184/ https://lwn.net/Articles/825184/ cpitrat <div class="FormattedComment"> I guess WoL has a point. The issue here is that we&#x27;re saying &quot;design&quot; for different things. For a complex system, there will be different levels of design and taking into account low-level technical details for the high-level design would be wrong.<br> <p> But in this case, we&#x27;re talking about deciding how to store the individual messages. It&#x27;s not even about design, it&#x27;s a technical decision to store them as individual files (if that&#x27;s really what is done, that&#x27;s where I have a doubt).<br> </div> Fri, 03 Jul 2020 13:02:55 +0000 Btrfs at Facebook https://lwn.net/Articles/825163/ https://lwn.net/Articles/825163/ pizza <div class="FormattedComment"> <font class="QuotedText">&gt; Actually, not taking into account the limitations of the tools is GOOD design.</font><br> <p> I&#x27;m not so sure about that.<br> <p> Because ignoring reality is how we get designs are predicated on likes of spherical cows and other things that don&#x27;t actually exist, and result in overly complex implementations with more exceptions than rules. If it can even be implemented at all.<br> <p> <p> </div> Fri, 03 Jul 2020 12:44:49 +0000 Btrfs at Facebook https://lwn.net/Articles/825162/ https://lwn.net/Articles/825162/ Wol <div class="FormattedComment"> Actually, not taking into account the limitations of the tools is GOOD design.<br> <p> Not taking into account the limitations when implementing the design is bad engineering.<br> <p> I think Linus has commented on various occasions that when he&#x27;s taken things like processor limitations into account for the basic design, the resulting implementation has been, shall we say, suboptimal.<br> <p> If on the other hand the design *ignores* the limitations, and then the implementation works round the limitations, the end result is much better (plus, of course, as the tools improve the workarounds can be ripped out. If the workarounds are part of the design, then adapting to better tools is something that rarely gets done).<br> <p> For example, I think the early processors only had a two-level page table. The best design turned out to be three-level, and when the early code using the two-level hardware was replaced by a three-level design that took advantage of the hardware for two of them, the result was a major improvement.<br> <p> Cheers,<br> Wol<br> </div> Fri, 03 Jul 2020 12:25:42 +0000 Btrfs at Facebook https://lwn.net/Articles/825158/ https://lwn.net/Articles/825158/ ibukanov <div class="FormattedComment"> Engineers at Facebook are not stupid. I suspect there is a missing piece of information about this particular case that justifies using the file system for this that overweights the harm of storing billions of tiny files.<br> </div> Fri, 03 Jul 2020 10:11:01 +0000 Btrfs at Facebook https://lwn.net/Articles/825154/ https://lwn.net/Articles/825154/ chatcannon <div class="FormattedComment"> My guess: The message metadata (sender, recipient, timestamp, possibly a sequential ID, etc) are not encrypted. Most messages are short enough that the message metadata are larger than the cyphertext of the message content, and that all of this (message metadata plus message content) is small enough to fit in the filesystem metadata space rather than needing a block of its own.<br> </div> Fri, 03 Jul 2020 07:20:47 +0000 Btrfs at Facebook https://lwn.net/Articles/825151/ https://lwn.net/Articles/825151/ cpitrat <div class="FormattedComment"> Well, not taking into account the limitations of the tools you use in your design is a bad design. If you plan to store billions of files of a few bytes and your rationale is &quot;a good filesystem should handle this well, all filesystems are bad&quot;, this is not what I&#x27;d call good engineering ...<br> </div> Fri, 03 Jul 2020 06:36:45 +0000 Btrfs at Facebook https://lwn.net/Articles/825142/ https://lwn.net/Articles/825142/ champtar <div class="FormattedComment"> Am I the only one understanding that WhatsApp message are stored unencrypted ? If they are encrypted I&#x27;m not sure how Btrfs compression can help ?<br> </div> Fri, 03 Jul 2020 01:03:55 +0000 Btrfs at Facebook https://lwn.net/Articles/825132/ https://lwn.net/Articles/825132/ Tov <div class="FormattedComment"> &quot;You mean each message is a separate file? It sounds like a poor design ...&quot;<br> <p> Not if you really think about it. A message is a single entity, which should be &quot;filed&quot; individually. However, metadata overhead and poor performance of older file systems have tought us to implement even more layers of indirection to paste over poor file system design...<br> </div> Thu, 02 Jul 2020 21:50:13 +0000 Btrfs at Facebook https://lwn.net/Articles/825130/ https://lwn.net/Articles/825130/ cpitrat <div class="FormattedComment"> &quot;Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file&#x27;s metadata rather than in a separate data extent.&quot;<br> <p> You mean each message is a separate file? It sounds like a poor design ...<br> </div> Thu, 02 Jul 2020 20:59:00 +0000 Btrfs at Facebook https://lwn.net/Articles/825122/ https://lwn.net/Articles/825122/ gbalane <div class="FormattedComment"> Thanks for the update. One question from the last update, do FB still uses readonly stores on Btrfs for serving the news feed ?<br> </div> Thu, 02 Jul 2020 19:16:11 +0000 Btrfs at Facebook https://lwn.net/Articles/825120/ https://lwn.net/Articles/825120/ Kamilion <div class="FormattedComment"> <font class="QuotedText">&gt; Experience has led him to assume that such things are Btrfs bugs, but this time it turned out that the RAID controller was writing some random data to the middle of the disk on every reboot. This problem had been happening for years, silently corrupting filesystems; Btrfs flagged it almost immediately. </font><br> <p> Which RAID controller would this be? If it&#x27;s an LSI, I&#x27;d really love to know, and I doubt Broadcom would care about LSI&#x27;s reputation at this point... Hopefully it&#x27;s not an adaptec...<br> </div> Thu, 02 Jul 2020 19:02:28 +0000