Btrfs at Facebook

By Jonathan Corbet
July 2, 2020

The Btrfs filesystem has had a long and sometimes turbulent history; LWN first wrote about it in 2007. It offers features not found in any other mainline Linux filesystem, but reliability and performance problems have prevented its widespread adoption. There is at least one company that is using Btrfs on a massive scale, though: Facebook. At the 2020 Open Source Summit North America virtual event, Btrfs developer Josef Bacik described why and how Facebook has invested deeply in Btrfs and where the remaining challenges are.

Every Facebook service, Bacik began, runs within a container; among other things, that makes it easy to migrate services between machines (or even between data centers). Facebook has a huge number of machines, so it is impossible to manage them in any sort of unique way; the company wants all of these machines to be as consistent as possible. It should be possible to move any service to any machine at any time. The company will, on occasion, bring down entire data centers to test how well its disaster-recovery mechanisms work.

Faster testing and more

All of these containerized services are using Btrfs for their root filesystem. The initial use case within Facebook, though, was for the build servers. The company has a lot of code, implementing the web pages, mobile apps, test suites, and the infrastructure to support all of that.

The Facebook workflow dictates that nobody commits code directly to a repository. Instead, there is a whole testing routine that is run on each change first. The build system will clone the company repository, apply the patch, build the system, and run the tests; once that is done, the whole thing is cleaned up in preparation for the next patch to test. That cleanup phase, as it turns out, is relatively slow, averaging two or three minutes to delete large directory trees full of code. Some tests can take ten minutes to clean up, during which that machine is unavailable to run the next test.

The end result was that developers were finding that it would take hours to get individual changes through the process. So the infrastructure team decided to try using Btrfs. Rather than creating a clone of the repository, the test system just makes a snapshot, which is a nearly instantaneous operation. After the tests are run, the snapshot is deleted, which also appears to be instantaneous from a user-space point of view. There is, of course, a worker thread actually cleaning up the snapshot in the background, but cleaning up a snapshot is a lot faster than removing directories from an ordinary filesystem. This change saves a lot of time in the build system and reduced the capacity requirement — the number of machines needed to do builds and testing — by one-third.

After that experiment worked out so well, the infrastructure team switched fully to Btrfs; the entire build system now uses it. It turns out that there was another strong reason to switch to Btrfs: its support for on-disk compression. The point here is not just saving storage space, but also extending its lifetime. Facebook spends a lot of money on flash storage — evidently inexpensive, low-quality flash storage at that. The company would like this storage to last as long as possible, which implies minimizing the number of write cycles performed. Source code tends to compress well, so compression reduces the number of blocks written considerably, slowing the process of wearing out the storage devices.

This work, Bacik said, was done by the infrastructure team without any sort of encouragement by Facebook's Btrfs developers; indeed, they didn't even know that it was happening. He was surprised by how well it worked in the end.

Then, there is the case of virtual machines for developers. Facebook has an "unreasonable amount" of engineering staff, Bacik said; each developer has a virtual machine for their work. These machines contain the entire source code for the web site; it is a struggle to get the whole thing to fit into the 800GB allotted for each and still leave room for some actual work to be done. Once again, compression helps to save the day, and works well, though he did admit that there have been some "ENOSPC issues" (problems that result when the filesystem runs out of available space).

Another big part of the Facebook code base is the container system itself, known internally as "tupperware". Containers in this system use Btrfs for the root filesystem, a choice that enables a number of interesting things. The send and receive mechanism can be used both to build the base image and to enable fast, reliable upgrades. When a task is deployed to a container, a snapshot is made of the base image (running a version of CentOS) and the specific container is loaded on top. When that service is done, cleanup is just a matter of deleting the working subvolume and returning to the base snapshot.

Additionally, Btrfs compression once again reduces write I/O, helping Facebook to make the best use of cheap flash drives. Btrfs is also the only Linux filesystem that works with the io.latency and io.cost (formerly io.weight) block I/O controllers. These controllers don't work with ext4 at all, he said, and there are some problems still with XFS; he has not been able to invest the effort to make things work better on those filesystems.

An in-progress project concerns the WhatsApp service. WhatsApp messages are normally stored on the users' devices, but they must be kept centrally when the user is offline. Given the number of users, that's a fair amount of storage. Facebook is using XFS for this task, but has run into unspecified "weird scalability issues". Btrfs compression can help here as well, and snapshots will be useful for cleaning things up.

But Btrfs, too, has run into scalability problems with this workload. Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file's metadata rather than in a separate data extent. That leads to filesystems with hundreds of gigabytes of metadata and high levels of fragmentation. These problems have been addressed, Bacik said, and it's "relatively smooth sailing" now. That said, there are still some issues left to be dealt with, and WhatsApp may not make the switch to Btrfs in the end.

The good, the bad, and the unresolved

Bacik concluded with a summary of what has worked well and what has not. He told the story of tracking down a bug where Btrfs kept reporting checksum errors when working with a specific RAID controller. Experience has led him to assume that such things are Btrfs bugs, but this time it turned out that the RAID controller was writing some random data to the middle of the disk on every reboot. This problem had been happening for years, silently corrupting filesystems; Btrfs flagged it almost immediately. That is when he started to think that, perhaps, it's time to start trusting Btrfs a bit more.

Another unexpected benefit was the help Btrfs has provided in tracking down microarchitectural processor bugs. Btrfs tends to stress the system's CPU more than other filesystems; features like checksumming, compression, and work offloaded to threads tend to keep things busy. Facebook, which builds its own hardware, has run into a few CPU problems that have been exposed by Btrfs; that made it easy to create reproducers to send to CPU vendors in order to get things fixed.

In general, he said, he has spent a lot of time trying to track down systemic problems in the filesystem. Being a filesystem developer, he is naturally conservative; he worries that "the world will burn down" and it will all be his fault. In almost every case, these problems have turned out to have their origin in the hardware or other parts of the system. Hardware, he said, is worse than Btrfs when it comes to quality.

What he was most happy with, though, was perhaps the fact that most Btrfs use cases in the company have been developed naturally by other groups. He has never gone out of his way to tell other teams that they need to use Btrfs, but they have chosen it for its merits anyway.

[PULL QUOTE: "It's not Btrfs if there isn't an ENOSPC issue", he said. END QUOTE] All is not perfect, however. At the top of his list was bugs that manifest when Btrfs runs out of space, a problem that has plagued Btrfs since the beginning; "it's not Btrfs if there isn't an ENOSPC issue", he said, adding that he has spent most of his career chasing these problems. There are still a few bad cases in need of fixing, but these are rare occurrences at this point. He is relatively happy, finally, with the state of ENOSPC handling.

There have been some scalability problems that have come up, primarily with the WhatsApp workload as described above. These bugs have highlighted some interesting corner cases, he said, but haven't been really difficult to work out. There were also a few "weird, one-off issues" that mostly affected block subsystem maintainer Jens Axboe; "we like Jens, so we fixed it".

At the top of the list of problems still needing resolution is quota groups; their overhead is too high, he said, and things just break at times. He plans to solve these problems within the next year. There are users who would like to see encryption at the subvolume level; that would allow the safe stacking of services with differing privacy requirements. Omar Sandoval is working on that feature now.

Then there is the issue of RAID support, another longstanding problem area for Btrfs. Basic striping and mirroring work well, and have done so for years, he said. RAID 5 and 6 have "lots of edge cases", though, and have been famously unreliable. These problems, too, are on his list, but solving them will require "lots of work" over the next year or two.

Index entries for this article
Kernel	Btrfs
Kernel	Filesystems/Btrfs
Conference	Open Source Summit North America/2020

Btrfs at Facebook

Posted Jul 2, 2020 19:02 UTC (Thu) by Kamilion (subscriber, #42576) [Link] (1 responses)

> Experience has led him to assume that such things are Btrfs bugs, but this time it turned out that the RAID controller was writing some random data to the middle of the disk on every reboot. This problem had been happening for years, silently corrupting filesystems; Btrfs flagged it almost immediately.

Which RAID controller would this be? If it's an LSI, I'd really love to know, and I doubt Broadcom would care about LSI's reputation at this point... Hopefully it's not an adaptec...

Btrfs at Facebook

Posted Jul 7, 2020 3:37 UTC (Tue) by TMM (subscriber, #79398) [Link]

I don't work for facebook, but given that they said they used cheap flash drives I'm assuming that they are also using somewhat cheap raid controllers. The controller where I saw this was a highpoint rocketraid controller which did the same thing. I imagine that more cheaper (soft)raid controllers which are actually mostly just sata controllers do this to write their raid metadata.

Btrfs at Facebook

Posted Jul 2, 2020 19:16 UTC (Thu) by gbalane (subscriber, #130561) [Link]

Thanks for the update. One question from the last update, do FB still uses readonly stores on Btrfs for serving the news feed ?

Btrfs at Facebook

Posted Jul 2, 2020 20:59 UTC (Thu) by cpitrat (subscriber, #116459) [Link] (14 responses)

"Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file's metadata rather than in a separate data extent."

You mean each message is a separate file? It sounds like a poor design ...

Btrfs at Facebook

Posted Jul 2, 2020 21:50 UTC (Thu) by Tov (subscriber, #61080) [Link] (10 responses)

"You mean each message is a separate file? It sounds like a poor design ..."

Not if you really think about it. A message is a single entity, which should be "filed" individually. However, metadata overhead and poor performance of older file systems have tought us to implement even more layers of indirection to paste over poor file system design...

Btrfs at Facebook

Posted Jul 3, 2020 6:36 UTC (Fri) by cpitrat (subscriber, #116459) [Link] (9 responses)

Well, not taking into account the limitations of the tools you use in your design is a bad design. If you plan to store billions of files of a few bytes and your rationale is "a good filesystem should handle this well, all filesystems are bad", this is not what I'd call good engineering ...

Btrfs at Facebook

Posted Jul 3, 2020 10:11 UTC (Fri) by ibukanov (subscriber, #3942) [Link]

Engineers at Facebook are not stupid. I suspect there is a missing piece of information about this particular case that justifies using the file system for this that overweights the harm of storing billions of tiny files.

Btrfs at Facebook

Posted Jul 3, 2020 12:25 UTC (Fri) by Wol (subscriber, #4433) [Link] (7 responses)

Actually, not taking into account the limitations of the tools is GOOD design.

Not taking into account the limitations when implementing the design is bad engineering.

I think Linus has commented on various occasions that when he's taken things like processor limitations into account for the basic design, the resulting implementation has been, shall we say, suboptimal.

If on the other hand the design *ignores* the limitations, and then the implementation works round the limitations, the end result is much better (plus, of course, as the tools improve the workarounds can be ripped out. If the workarounds are part of the design, then adapting to better tools is something that rarely gets done).

For example, I think the early processors only had a two-level page table. The best design turned out to be three-level, and when the early code using the two-level hardware was replaced by a three-level design that took advantage of the hardware for two of them, the result was a major improvement.

Cheers,
Wol

Btrfs at Facebook

Posted Jul 3, 2020 12:44 UTC (Fri) by pizza (subscriber, #46) [Link] (3 responses)

> Actually, not taking into account the limitations of the tools is GOOD design.

I'm not so sure about that.

Because ignoring reality is how we get designs are predicated on likes of spherical cows and other things that don't actually exist, and result in overly complex implementations with more exceptions than rules. If it can even be implemented at all.

Btrfs at Facebook

Posted Jul 3, 2020 13:02 UTC (Fri) by cpitrat (subscriber, #116459) [Link] (2 responses)

I guess WoL has a point. The issue here is that we're saying "design" for different things. For a complex system, there will be different levels of design and taking into account low-level technical details for the high-level design would be wrong.

But in this case, we're talking about deciding how to store the individual messages. It's not even about design, it's a technical decision to store them as individual files (if that's really what is done, that's where I have a doubt).

Btrfs at Facebook

Posted Jul 3, 2020 14:26 UTC (Fri) by Wol (subscriber, #4433) [Link]

In this case it sounds like "the computer is the database", along the lines of the AS/400 or Pick, would be a good implementation ... :-)

Cheers,
Wol

Btrfs at Facebook

Posted Jul 4, 2020 12:04 UTC (Sat) by pizza (subscriber, #46) [Link]

> For a complex system, there will be different levels of design and taking into account low-level technical details for the high-level design would be wrong.

I'm not talking about "low level details", I'm talking about being aware of what is or isn't possible (or practical) to implement.

Outside of a handful of fields (mostly to do with advanced mathematics) those "designs" still need to be realized in the real world.

Page tables

Posted Jul 3, 2020 13:18 UTC (Fri) by corbet (editor, #1) [Link] (2 responses)

Digging through the code...v1.0 had two-level page tables internally. The third level was added once processors started supporting it. The same is true of the fourth and fifth levels.

Page tables

Posted Jul 3, 2020 14:23 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

Was the third level emulated in software for processors that didn't have it in hardware? I'm sure I remember something of the sort - that it was better to have an "ideal" design and emulate round practical short-comings.

Cheers,
Wol

Page tables

Posted Jul 3, 2020 16:58 UTC (Fri) by corbet (editor, #1) [Link]

The kernel was (and still is) built for a given page-table depth; the compiler then shorts out the code for the levels that aren't actually used.

Btrfs at Facebook

Posted Aug 21, 2020 16:31 UTC (Fri) by rep_movsd (guest, #100040) [Link] (1 responses)

Back in 2003-2004 I was on a small team that built an instant messenger client called Rediff Bol 2.0

Version 1.0 had been a huge success in India, but they wanted a snazzy UI like Yahoo.
Their message server used this same retarded technique of saving messages as files.... on a shared NFS mount.

I won't go into the horrors of such a design idea (which to be fair did have one thing going for it - simplicity) but it's insane to use a filesystem as a database.

Im sure FB devs had their reasons (maybe legacy), but fundamentally a crap idea. Filesystems are designed to be moderately efficient at various sizes of files.
I don't think any FS dev in their wildest nightmares would have considered a use case of zillions of files which are smaller than a floppy disk sector

Btrfs at Facebook

Posted Aug 21, 2020 17:18 UTC (Fri) by anselm (subscriber, #2796) [Link]

Things are maybe not so bad if – like Facebook – you have actual file system developers on your payroll.

Btrfs at Facebook

Posted Aug 26, 2020 11:45 UTC (Wed) by flussence (guest, #85566) [Link]

Sounds like email, or at least the variants that don't stuff multi-gigabyte mailboxes into a *single* file...

Btrfs at Facebook

Posted Jul 3, 2020 1:03 UTC (Fri) by champtar (subscriber, #128673) [Link] (7 responses)

Am I the only one understanding that WhatsApp message are stored unencrypted ? If they are encrypted I'm not sure how Btrfs compression can help ?

Btrfs at Facebook

Posted Jul 3, 2020 7:20 UTC (Fri) by chatcannon (subscriber, #122400) [Link] (1 responses)

My guess: The message metadata (sender, recipient, timestamp, possibly a sequential ID, etc) are not encrypted. Most messages are short enough that the message metadata are larger than the cyphertext of the message content, and that all of this (message metadata plus message content) is small enough to fit in the filesystem metadata space rather than needing a block of its own.

Btrfs at Facebook

Posted Jul 3, 2020 13:35 UTC (Fri) by champtar (subscriber, #128673) [Link]

That makes more sense, thanks

Btrfs at Facebook

Posted Jul 3, 2020 13:10 UTC (Fri) by judas_iscariote (guest, #47386) [Link] (2 responses)

wut? whatsapp message leave the user device encrypted.. you need the recipient private key to decrypt it.

Btrfs at Facebook

Posted Jul 3, 2020 16:11 UTC (Fri) by champtar (subscriber, #128673) [Link] (1 responses)

Encrypted content compress really poorly if at all, thus my interrogation, how can btrfs compression can help. See chatcannon for an explanation.

Btrfs at Facebook

Posted Jul 3, 2020 22:52 UTC (Fri) by scott@deltaex.com (guest, #139665) [Link]

Compressing ciphertext is effectively the same as cracking it.

Btrfs at Facebook

Posted Jul 3, 2020 22:49 UTC (Fri) by nivedita76 (subscriber, #121790) [Link]

I didn't understand it that way: they're not talking about compression issues when discussing the messages, but problems with storing them along with the metadata instead of in separate blocks. My guess is that if text messages are compressed at all, the messages are compressed, then encrypted and sent; and decrypted and decompressed on the recipient's phone. The larger ones for photos, videos, audio would already be in a compressed format anyway.

Btrfs at Facebook

Posted Jul 5, 2020 2:10 UTC (Sun) by josefbacik (subscriber, #90083) [Link]

There’s an index log that compresses well. Mostly they’re in it for the snapshotting.

Btrfs at Facebook

Posted Jul 3, 2020 13:20 UTC (Fri) by Karellen (subscriber, #67644) [Link] (6 responses)

These machines contain the entire source code for the web site; it is a struggle to get the whole thing to fit into the 800GB allotted for each and still leave room for some actual work to be done.

800GB‽‽

Am I just getting old, or is that actually somewhat on the large side?

Btrfs at Facebook

Posted Jul 3, 2020 13:59 UTC (Fri) by kevincox (subscriber, #93938) [Link]

It doesn't seem large to me. They have a lot of projects going on, and I don't know how much history this contains. I wouldn't be surprised if a number of assets such as images were checked in as well so that they can be built into the final site/whatever. Then I would guess there are 27 versions of jQuery and a couple of compiled binaries.

None of these things are huge but when you sum them up across a large company 800GB seems like they were really keeping themselves in check on the average.

Btrfs at Facebook

Posted Jul 3, 2020 16:51 UTC (Fri) by shemminger (subscriber, #5739) [Link]

800G is huge. All of Windows is a single git repo of 300G with 35M files.
Microsoft built Git Virtual Filesystem to deal with that, and avoid having to do massive downloads and checkouts.

Btrfs at Facebook

Posted Jul 4, 2020 22:03 UTC (Sat) by gus3 (guest, #61103) [Link] (3 responses)

I just (an hour or so ago) bought a 4T USB drive for backups for less than $100. My old backup drive is showing some age, so it was time.

That's big enough for four 800GB containers mapped to virtual space, plus filesystem overhead. The only choke-points would be the USB3 connection and head seek time on the platters. Of course, SSD would eliminate the latter.

It isn't bleeding edge, or even leading edge anymore. It's COTS (commodity, off-the-shelf).

Btrfs at Facebook

Posted Jul 5, 2020 21:02 UTC (Sun) by ncm (guest, #165) [Link] (2 responses)

Facebook seems to prefer SSD, for Reasons, but a TB SSD is pretty normal now. FB can certainly afford a thousand per developer, and something to plug them all into, what with large multiple $millions annual profit per employee. Not spending as much per head as could possibly be useful would be irresponsible even for a wildly less profitable enterprise. If they don't *all* have 3990Xes with 1TB RAM and 3 4k screens yet, heads should roll.

But as a practical matter, I gather from discussion with core devs (a couple of years ago) that it is part of FB's received engineering culture to copy dependencies, so there are, e.g., hundreds or thousands of copies of libz scattered throughout their repository, in many, many versions, and likewise now, I expect, libzstd. Btrfs would be able to share blocks for them when unpacked, but maybe not so much so as represented within git?

Imagine the engineering effort just to consolidate all those libz instances, identifying version, bug, and local patch dependencies, while every week more appear. Imagine the effort just to bring them all up to the current release: not a thing to make your CV shine.

Btrfs at Facebook

Posted Jul 7, 2020 12:15 UTC (Tue) by jezuch (subscriber, #52988) [Link] (1 responses)

Git actually deduplicates even binary files - that's a convenient side effect of content-addressable storage. Btrfs is in theory able to share extents between identical copies of files, but I think you have to tell it these are copies - automatic deduplication is a work in progress I think?

Btrfs at Facebook

Posted Jul 9, 2020 20:51 UTC (Thu) by khim (subscriber, #9252) [Link]

Useful automatic deduplication is work in progress.

You can run duperemove in background and everything would be nicely deduplicated but performance would be so awful then you just don't want to do that.

But if that's developer's system you can do that when developer is sleeping - then it works nicely.

Btrfs at Facebook

Posted Jul 3, 2020 14:29 UTC (Fri) by scientes (guest, #83068) [Link] (1 responses)

> But Btrfs, too, has run into scalability problems with this workload. Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file's metadata rather than in a separate data extent. That leads to filesystems with hundreds of gigabytes of metadata and high levels of fragmentation. These problems have been addressed, Bacik said, and it's "relatively smooth sailing" now. That said, there are still some issues left to be dealt with, and WhatsApp may not make the switch to Btrfs in the end.

This sounds like a perfect use-case of kyotocabinet. I have heard reports that it is rock solid with huge loads, and you also put kyototycoon on top for tcp access (including a memcached-compatible mode).

Btrfs at Facebook

Posted Jul 5, 2020 21:51 UTC (Sun) by ncm (guest, #165) [Link]

Or, indeed, just about any conventional database.

Btrfs at Facebook

Posted Jul 5, 2020 19:02 UTC (Sun) by caliloo (subscriber, #50055) [Link]

Whatsapp was built with erlang, i wouldn’t be surprised this has something to do with their method of message storage....

Btrfs at Facebook

Posted Jul 8, 2020 10:22 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> Then there is the issue of RAID support, another longstanding problem area for Btrfs. Basic striping and mirroring work well, and have done so for years, he said. RAID 5 and 6 have "lots of edge cases", though, and have been famously unreliable. These problems, too, are on his list, but solving them will require "lots of work" over the next year or two.

I'd be *really* pleased if this was to happen.

Btrfs at Facebook

Posted Jul 12, 2020 15:35 UTC (Sun) by anton (subscriber, #25547) [Link] (1 responses)

Basic striping and mirroring work well, and have done so for years

If only btrfs was user-friendly when a drive of a mirror pair fails; instead it gives you one shot at dealing with the problem, and on the next reboot becomes irreversably read-only; that was certainly the state of the FAQ when we installed a new server earlier this year. So we decided to try out ZFS instead, and its handling of a simulated disk failure and replacement was very smooth, hardly perceptible.

Btrfs at Facebook

Posted Aug 11, 2020 22:17 UTC (Tue) by redneb (guest, #140746) [Link]

This is not true anymore, it was a bug that was fixed in 4.14 (11/2017). There are two commits about this and the pull request says: "degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check".