Btrfs at Facebook
Every Facebook service, Bacik began, runs within a container; among other things, that makes it easy to migrate services between machines (or even between data centers). Facebook has a huge number of machines, so it is impossible to manage them in any sort of unique way; the company wants all of these machines to be as consistent as possible. It should be possible to move any service to any machine at any time. The company will, on occasion, bring down entire data centers to test how well its disaster-recovery mechanisms work.
Faster testing and more
All of these containerized services are using Btrfs for their root filesystem. The initial use case within Facebook, though, was for the build servers. The company has a lot of code, implementing the web pages, mobile apps, test suites, and the infrastructure to support all of that.
The Facebook workflow dictates that nobody commits code directly to a repository. Instead, there is a whole testing routine that is run on each change first. The build system will clone the company repository, apply the patch, build the system, and run the tests; once that is done, the whole thing is cleaned up in preparation for the next patch to test. That cleanup phase, as it turns out, is relatively slow, averaging two or three minutes to delete large directory trees full of code. Some tests can take ten minutes to clean up, during which that machine is unavailable to run the next test.
The end result was that developers were finding that it would take hours to get individual changes through the process. So the infrastructure team decided to try using Btrfs. Rather than creating a clone of the repository, the test system just makes a snapshot, which is a nearly instantaneous operation. After the tests are run, the snapshot is deleted, which also appears to be instantaneous from a user-space point of view. There is, of course, a worker thread actually cleaning up the snapshot in the background, but cleaning up a snapshot is a lot faster than removing directories from an ordinary filesystem. This change saves a lot of time in the build system and reduced the capacity requirement — the number of machines needed to do builds and testing — by one-third.
After that experiment worked out so well, the infrastructure team switched
fully to Btrfs; the entire build system now uses it. It turns out that
there was another strong reason to switch to Btrfs: its support for on-disk
compression. The point here is not just saving storage space, but also
extending its lifetime. Facebook spends a lot of money on flash storage —
evidently inexpensive, low-quality flash storage at that. The company
would like this storage to last as long as possible, which
implies minimizing the number of write cycles performed. Source code tends
to compress well, so compression reduces the number of blocks written
considerably, slowing the process of wearing out the storage devices.
This work, Bacik said, was done by the infrastructure team without any sort of encouragement by Facebook's Btrfs developers; indeed, they didn't even know that it was happening. He was surprised by how well it worked in the end.
Then, there is the case of virtual machines for developers. Facebook has an "unreasonable amount" of engineering staff, Bacik said; each developer has a virtual machine for their work. These machines contain the entire source code for the web site; it is a struggle to get the whole thing to fit into the 800GB allotted for each and still leave room for some actual work to be done. Once again, compression helps to save the day, and works well, though he did admit that there have been some "ENOSPC issues" (problems that result when the filesystem runs out of available space).
Another big part of the Facebook code base is the container system itself, known internally as "tupperware". Containers in this system use Btrfs for the root filesystem, a choice that enables a number of interesting things. The send and receive mechanism can be used both to build the base image and to enable fast, reliable upgrades. When a task is deployed to a container, a snapshot is made of the base image (running a version of CentOS) and the specific container is loaded on top. When that service is done, cleanup is just a matter of deleting the working subvolume and returning to the base snapshot.
Additionally, Btrfs compression once again reduces write I/O, helping Facebook to make the best use of cheap flash drives. Btrfs is also the only Linux filesystem that works with the io.latency and io.cost (formerly io.weight) block I/O controllers. These controllers don't work with ext4 at all, he said, and there are some problems still with XFS; he has not been able to invest the effort to make things work better on those filesystems.
An in-progress project concerns the WhatsApp service. WhatsApp messages are normally stored on the users' devices, but they must be kept centrally when the user is offline. Given the number of users, that's a fair amount of storage. Facebook is using XFS for this task, but has run into unspecified "weird scalability issues". Btrfs compression can help here as well, and snapshots will be useful for cleaning things up.
But Btrfs, too, has run into scalability problems with this workload. Messages are tiny, compressed files; they are small enough that the message text is usually stored with the file's metadata rather than in a separate data extent. That leads to filesystems with hundreds of gigabytes of metadata and high levels of fragmentation. These problems have been addressed, Bacik said, and it's "relatively smooth sailing" now. That said, there are still some issues left to be dealt with, and WhatsApp may not make the switch to Btrfs in the end.
The good, the bad, and the unresolved
Bacik concluded with a summary of what has worked well and what has not. He told the story of tracking down a bug where Btrfs kept reporting checksum errors when working with a specific RAID controller. Experience has led him to assume that such things are Btrfs bugs, but this time it turned out that the RAID controller was writing some random data to the middle of the disk on every reboot. This problem had been happening for years, silently corrupting filesystems; Btrfs flagged it almost immediately. That is when he started to think that, perhaps, it's time to start trusting Btrfs a bit more.
Another unexpected benefit was the help Btrfs has provided in tracking down microarchitectural processor bugs. Btrfs tends to stress the system's CPU more than other filesystems; features like checksumming, compression, and work offloaded to threads tend to keep things busy. Facebook, which builds its own hardware, has run into a few CPU problems that have been exposed by Btrfs; that made it easy to create reproducers to send to CPU vendors in order to get things fixed.
In general, he said, he has spent a lot of time trying to track down systemic problems in the filesystem. Being a filesystem developer, he is naturally conservative; he worries that "the world will burn down" and it will all be his fault. In almost every case, these problems have turned out to have their origin in the hardware or other parts of the system. Hardware, he said, is worse than Btrfs when it comes to quality.
What he was most happy with, though, was perhaps the fact that most Btrfs use cases in the company have been developed naturally by other groups. He has never gone out of his way to tell other teams that they need to use Btrfs, but they have chosen it for its merits anyway.
[PULL QUOTE: "It's not Btrfs if there isn't an ENOSPC issue", he said. END QUOTE] All is not perfect, however. At the top of his list was bugs that manifest when Btrfs runs out of space, a problem that has plagued Btrfs since the beginning; "it's not Btrfs if there isn't an ENOSPC issue", he said, adding that he has spent most of his career chasing these problems. There are still a few bad cases in need of fixing, but these are rare occurrences at this point. He is relatively happy, finally, with the state of ENOSPC handling.
There have been some scalability problems that have come up, primarily with the WhatsApp workload as described above. These bugs have highlighted some interesting corner cases, he said, but haven't been really difficult to work out. There were also a few "weird, one-off issues" that mostly affected block subsystem maintainer Jens Axboe; "we like Jens, so we fixed it".
At the top of the list of problems still needing resolution is quota groups; their overhead is too high, he said, and things just break at times. He plans to solve these problems within the next year. There are users who would like to see encryption at the subvolume level; that would allow the safe stacking of services with differing privacy requirements. Omar Sandoval is working on that feature now.
Then there is the issue of RAID support, another longstanding problem area
for Btrfs. Basic striping and mirroring work
well, and have done so for years, he said. RAID 5 and 6 have "lots of
edge cases", though, and have been famously unreliable. These problems,
too, are on his list, but solving them will require "lots of work" over the
next year or two.
Index entries for this article | |
---|---|
Kernel | Btrfs |
Kernel | Filesystems/Btrfs |
Conference | Open Source Summit North America/2020 |
Posted Jul 2, 2020 19:02 UTC (Thu)
by Kamilion (subscriber, #42576)
[Link] (1 responses)
Which RAID controller would this be? If it's an LSI, I'd really love to know, and I doubt Broadcom would care about LSI's reputation at this point... Hopefully it's not an adaptec...
Posted Jul 7, 2020 3:37 UTC (Tue)
by TMM (subscriber, #79398)
[Link]
Posted Jul 2, 2020 19:16 UTC (Thu)
by gbalane (subscriber, #130561)
[Link]
Posted Jul 2, 2020 20:59 UTC (Thu)
by cpitrat (subscriber, #116459)
[Link] (14 responses)
You mean each message is a separate file? It sounds like a poor design ...
Posted Jul 2, 2020 21:50 UTC (Thu)
by Tov (subscriber, #61080)
[Link] (10 responses)
Not if you really think about it. A message is a single entity, which should be "filed" individually. However, metadata overhead and poor performance of older file systems have tought us to implement even more layers of indirection to paste over poor file system design...
Posted Jul 3, 2020 6:36 UTC (Fri)
by cpitrat (subscriber, #116459)
[Link] (9 responses)
Posted Jul 3, 2020 10:11 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link]
Posted Jul 3, 2020 12:25 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (7 responses)
Not taking into account the limitations when implementing the design is bad engineering.
I think Linus has commented on various occasions that when he's taken things like processor limitations into account for the basic design, the resulting implementation has been, shall we say, suboptimal.
If on the other hand the design *ignores* the limitations, and then the implementation works round the limitations, the end result is much better (plus, of course, as the tools improve the workarounds can be ripped out. If the workarounds are part of the design, then adapting to better tools is something that rarely gets done).
For example, I think the early processors only had a two-level page table. The best design turned out to be three-level, and when the early code using the two-level hardware was replaced by a three-level design that took advantage of the hardware for two of them, the result was a major improvement.
Cheers,
Posted Jul 3, 2020 12:44 UTC (Fri)
by pizza (subscriber, #46)
[Link] (3 responses)
I'm not so sure about that.
Because ignoring reality is how we get designs are predicated on likes of spherical cows and other things that don't actually exist, and result in overly complex implementations with more exceptions than rules. If it can even be implemented at all.
Posted Jul 3, 2020 13:02 UTC (Fri)
by cpitrat (subscriber, #116459)
[Link] (2 responses)
But in this case, we're talking about deciding how to store the individual messages. It's not even about design, it's a technical decision to store them as individual files (if that's really what is done, that's where I have a doubt).
Posted Jul 3, 2020 14:26 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jul 4, 2020 12:04 UTC (Sat)
by pizza (subscriber, #46)
[Link]
I'm not talking about "low level details", I'm talking about being aware of what is or isn't possible (or practical) to implement.
Outside of a handful of fields (mostly to do with advanced mathematics) those "designs" still need to be realized in the real world.
Posted Jul 3, 2020 13:18 UTC (Fri)
by corbet (editor, #1)
[Link] (2 responses)
Posted Jul 3, 2020 14:23 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
Cheers,
Posted Jul 3, 2020 16:58 UTC (Fri)
by corbet (editor, #1)
[Link]
Posted Aug 21, 2020 16:31 UTC (Fri)
by rep_movsd (guest, #100040)
[Link] (1 responses)
Version 1.0 had been a huge success in India, but they wanted a snazzy UI like Yahoo.
I won't go into the horrors of such a design idea (which to be fair did have one thing going for it - simplicity) but it's insane to use a filesystem as a database.
Im sure FB devs had their reasons (maybe legacy), but fundamentally a crap idea. Filesystems are designed to be moderately efficient at various sizes of files.
Posted Aug 21, 2020 17:18 UTC (Fri)
by anselm (subscriber, #2796)
[Link]
Things are maybe not so bad if – like Facebook – you have actual file system developers on your payroll.
Posted Aug 26, 2020 11:45 UTC (Wed)
by flussence (guest, #85566)
[Link]
Posted Jul 3, 2020 1:03 UTC (Fri)
by champtar (subscriber, #128673)
[Link] (7 responses)
Posted Jul 3, 2020 7:20 UTC (Fri)
by chatcannon (subscriber, #122400)
[Link] (1 responses)
Posted Jul 3, 2020 13:35 UTC (Fri)
by champtar (subscriber, #128673)
[Link]
Posted Jul 3, 2020 13:10 UTC (Fri)
by judas_iscariote (guest, #47386)
[Link] (2 responses)
Posted Jul 3, 2020 16:11 UTC (Fri)
by champtar (subscriber, #128673)
[Link] (1 responses)
Posted Jul 3, 2020 22:52 UTC (Fri)
by scott@deltaex.com (guest, #139665)
[Link]
Posted Jul 3, 2020 22:49 UTC (Fri)
by nivedita76 (subscriber, #121790)
[Link]
Posted Jul 5, 2020 2:10 UTC (Sun)
by josefbacik (subscriber, #90083)
[Link]
Posted Jul 3, 2020 13:20 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (6 responses)
800GB‽‽
Am I just getting old, or is that actually somewhat on the large side?
Posted Jul 3, 2020 13:59 UTC (Fri)
by kevincox (subscriber, #93938)
[Link]
None of these things are huge but when you sum them up across a large company 800GB seems like they were really keeping themselves in check on the average.
Posted Jul 3, 2020 16:51 UTC (Fri)
by shemminger (subscriber, #5739)
[Link]
Posted Jul 4, 2020 22:03 UTC (Sat)
by gus3 (guest, #61103)
[Link] (3 responses)
That's big enough for four 800GB containers mapped to virtual space, plus filesystem overhead. The only choke-points would be the USB3 connection and head seek time on the platters. Of course, SSD would eliminate the latter.
It isn't bleeding edge, or even leading edge anymore. It's COTS (commodity, off-the-shelf).
Posted Jul 5, 2020 21:02 UTC (Sun)
by ncm (guest, #165)
[Link] (2 responses)
But as a practical matter, I gather from discussion with core devs (a couple of years ago) that it is part of FB's received engineering culture to copy dependencies, so there are, e.g., hundreds or thousands of copies of libz scattered throughout their repository, in many, many versions, and likewise now, I expect, libzstd. Btrfs would be able to share blocks for them when unpacked, but maybe not so much so as represented within git?
Imagine the engineering effort just to consolidate all those libz instances, identifying version, bug, and local patch dependencies, while every week more appear. Imagine the effort just to bring them all up to the current release: not a thing to make your CV shine.
Posted Jul 7, 2020 12:15 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Jul 9, 2020 20:51 UTC (Thu)
by khim (subscriber, #9252)
[Link]
Useful automatic deduplication is work in progress. You can run duperemove in background and everything would be nicely deduplicated but performance would be so awful then you just don't want to do that. But if that's developer's system you can do that when developer is sleeping - then it works nicely.
Posted Jul 3, 2020 14:29 UTC (Fri)
by scientes (guest, #83068)
[Link] (1 responses)
This sounds like a perfect use-case of kyotocabinet. I have heard reports that it is rock solid with huge loads, and you also put kyototycoon on top for tcp access (including a memcached-compatible mode).
Posted Jul 5, 2020 21:51 UTC (Sun)
by ncm (guest, #165)
[Link]
Posted Jul 5, 2020 19:02 UTC (Sun)
by caliloo (subscriber, #50055)
[Link]
Posted Jul 8, 2020 10:22 UTC (Wed)
by intelfx (subscriber, #130118)
[Link]
I'd be *really* pleased if this was to happen.
Posted Jul 12, 2020 15:35 UTC (Sun)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Aug 11, 2020 22:17 UTC (Tue)
by redneb (guest, #140746)
[Link]
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Wol
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Wol
Btrfs at Facebook
Digging through the code...v1.0 had two-level page tables internally. The third level was added once processors started supporting it. The same is true of the fourth and fifth levels.
Page tables
Page tables
Wol
The kernel was (and still is) built for a given page-table depth; the compiler then shorts out the code for the levels that aren't actually used.
Page tables
Btrfs at Facebook
Their message server used this same retarded technique of saving messages as files.... on a shared NFS mount.
I don't think any FS dev in their wildest nightmares would have considered a use case of zillions of files which are smaller than a floppy disk sector
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
These machines contain the entire source code for the web site; it is a struggle to get the whole thing to fit into the 800GB allotted for each and still leave room for some actual work to be done.
Btrfs at Facebook
Btrfs at Facebook
Microsoft built Git Virtual Filesystem to deal with that, and avoid having to do massive downloads and checkouts.
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Btrfs at Facebook
Basic striping and mirroring work well, and have done so for years
If only btrfs was user-friendly when a drive of a mirror pair fails; instead it gives you one shot at dealing with the problem, and on the next reboot becomes irreversably read-only; that was certainly the state of the FAQ when we installed a new server earlier this year. So we decided to try out ZFS instead, and its handling of a simulated disk failure and replacement was very smooth, hardly perceptible.
This is not true anymore, it was a bug that was fixed in 4.14 (11/2017). There are two commits about this and the pull request says: "degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check".
Btrfs at Facebook