Btrfs and high-speed devices
At LinuxCon North America in Toronto, Chris Mason relayed some of the experiences that his employer, Facebook, has had using Btrfs, especially with regard to its performance on high-speed solid-state storage devices (SSDs). While Mason was the primary developer early on in the history of Btrfs, he is one of a few maintainers of the filesystem now, and the project has seen contributions from around 70 developers throughout the Linux community in the last year.
![Chris Mason [Chris Mason]](https://static.lwn.net/images/2016/lcna-mason-sm.jpg) 
He is on the kernel team at Facebook; one of the main reasons the company wanted to hire him was because it wanted to use Btrfs in production. Being able to use Btrfs in that kind of environment is also the primary reason he chose to take the job, he said. As the company is rolling Btrfs out, it is figuring out which features it wants to use and finding things that work well and not so well.
Mason went through the usual list of high-level Btrfs features, including efficient writable snapshots, internal RAID with restriping, online device management, online scrubbing to check in the background if the CRCs are the same as when the data was written, and so on. The CRCs for both data and metadata are a feature that "saved us a lot of pain" at Facebook, he said.
The Btrfs CRC checking means that a read from a corrupted sector will cause an I/O error rather than return garbage. Facebook had some storage devices that would appear to store data correctly in a set of logical block addresses (LBAs) until the next reboot, at which point reads to those blocks would return GUID partition table (GPT) data instead. He did not name the device maker because it turned out to actually be a BIOS problem. In any case, the CRCs allowed the Facebook team to quickly figure out that the problem was not in Btrfs when it affected thousands of machines as they were rebooted for a kernel upgrade.
Volume management in Btrfs is done in terms of "chunks", which are normally 1GB in size. That is part of what allows the filesystem to handle differently sized devices for RAID volumes, for example. Volumes can have specific chunks reserved for data or metadata and different RAID levels can be applied to each (e.g. RAID-1 for the metadata and RAID-5 for the data).
But Btrfs has had some lock-contention problems; it still has some of them, he said, though there are improvements coming. The filesystem is optimized for use on SSDs, but he ran an fs_mark benchmark in a virtual machine (for comparative rather than hard numbers) creating zero-length files and found that XFS could create roughly four times the number of files per second (33,000 versus 9,000). That was "not good", but before he started tuning Btrfs, he wanted to make XFS go as fast as he could.
To that end, he looked at what XFS was blocked on, which turned out to be locks for allocating filesystem objects. By increasing the allocation groups in the filesystem when it was created (from four to sixteen to match the number of CPUs in his test system), he could increase its performance to 200,000 file-creations per second. At that point, it was mostly CPU bound and the function using the most CPU was one that could not be easily tweaked away with a mkfs option.
So then he turned to Btrfs. Using perf, he was able to see that there was lock contention on the B-tree locks. The Btrfs B-tree stores all of its data in the leaves of the tree; when it is updating the tree, it has to lock non-leaf nodes on the way to the leaf, starting with the root node. For some operations, those locks have to be held as it traverses the tree. Hopefully only the leaf needs to be locked, but sometimes that is not the case and, since everything starts at the root, it is not surprising that there is contention for that lock.
As an experiment to make Btrfs go faster, he used the subvolume feature to effectively create more root nodes. Instead of the usual one volume (with one root node), he created sixteen subvolumes so that there was one per CPU, each with its own root node and lock. That allowed Btrfs to get close to the XFS performance at 175,000 file-creations per second.
But the goal was to make the filesystem faster without resorting to subvolumes, which led to a new B-tree locking scheme. By default, Btrfs has 16KB nodes, which is not changing, but instead of being treated as a single group, each node will now be broken up into sixteen groups, each with its own lock.
He has not yet picked the best number of groups for each node, but the change allows a default Btrfs filesystem create 90,000 files per second. There are a lot of assumptions in Btrfs that there is only one lock per node, which he is working on removing. In addition, Btrfs switched to reader/writer locks a ways back and it turns out that those perform worse than expected, so he will be looking into that.
By some other measures, though, Btrfs compares favorably with XFS on the benchmark. XFS writes 120MB/second and does 3000 I/O operations/second (IOPS) for the benchmark, while Btrfs does 50MB/second and 300 IOPS to accomplish the same amount of work. That means that Btrfs is ordering things better and doing less I/O, Mason said.
The Gluster workloads at Facebook, which use rotational storage, are extremely sensitive to metadata latency to the point where one node's high latency can make the entire cluster slower than it should be. In the past, the company has used flashcache (which is similar to bcache) for both XFS and Btrfs to cache some data and metadata on SSDs, which improves the metadata latencies, but not enough.
To combat that, he has a set of patches to automatically put the Btrfs metadata on SSDs. The block layer provides information on whether the storage is rotational; for now, his patch assumes that if it is not rotational then it is fast. The patch has made a huge difference in the latencies and requires less flash storage (e.g. 450GB for 40TB filesystem) for Facebook's file workload that consists of a wide variety of file sizes. "You will need a lot more metadata if you have all 2KB files", he said.
That patch set is small (73 lines of code added), which is nice, he said. It is not entirely complete, though, as btrfs-utils needs changes to support it, but that should be a similarly sized change.
Another bottleneck he has encountered is in using the trim (or discard) command to tell SSDs about blocks that are no longer in use by the filesystem. That allows the flash translation layer to ignore those blocks when it is doing garbage collection and should, in theory, provide better performance. But many devices are slow when handling trim commands. Both XFS and Btrfs keep lists of blocks to trim, submit them as trim commands, and then must wait for those commands to complete during transaction commits, which stalls new filesystem operations. Those stalls can be huge, on the order of "tens of seconds", he said.
Ric Wheeler spoke up to say that trim is simply a request that the drive is free to ignore. He suggested that trim should not be performed during regular filesystem operations. Ted Ts'o agreed and said that the best practice for ext4 and probably other filesystems was to run the fstrim batch-trimming command regularly out of cron.
In answer to a question, Mason said that the disadvantages of not trimming are device-dependent. In some cases, it may reduce the lifetime of the device or add latencies during garbage collection, but it may also do nothing. Wheeler pointed out that if you are using thin provisioning, though, failing to trim could cause the storage to run out of space when there is actually space available.
Though it is not a flash-specific change, there have been some problems with large (> 16TB) Btrfs filesystems because of the free-space cache. Originally, free extents were not tracked, but that required scanning the entire filesystem at mount time, which was slow. When free-space was added, the cache was per-block-group and large filesystems have a lot of block groups, which meant that there was more caching on each commit. In the 4.5 kernel, Omar Sandoval added a new free-space cache (which can be enabled with -o space_cache=v2) that is "dramatically faster", with commit latencies dropping from four to zero seconds.
For the near future, he plans to finalize the new B-tree locking and improve some fsync() bottlenecks, though he thinks that the new space cache will help there. There are also some other spinlocks slowing things down that he wants to look at.
He mentioned a few of the tools that he uses to find bottlenecks. Perf is the right tool when processing is "pegged in the CPU", but finding problems when things are blocking is much harder. For that, he recommended BPF and BCC. In particular, Brendan Gregg's offcputime BPF script is useful to show both kernel and application stack traces to help show the reasons why a process is blocked. In fact, Facebook likes offcputime so much that fellow Btrfs maintainer Josef Bacik has created a way to aggregate the output of the program across multiple systems.
There were a few questions at the end of the session. One person asked whether Mason had seen any uptake of Btrfs for smaller devices. Mason said that the filesystem "needs love and care" when it is being used, which is why Facebook can use it. Someone with an ARM background would need to be working on Btrfs upstream in order to provide that kind of care if it were to be adopted on ARM-powered devices, he said.
Another asked how much faster the current design of Btrfs could go. Mason seemed quite optimistic that it could go "much faster". The metadata format is flexible, so "if things are broken, we can fix them".
The last two questions regarded two different benchmarks, both of which are interesting, but neither of which Mason has run. Flashcache versus bcache would likely provide similar numbers, he thought, but flashcache worked for Facebook so there was no need to try bcache. He also has not run benchmarks against ZFS. When he started Btrfs, ZFS was not available. There is no reason not to do so now, he said, but he hasn't, though he would be interested in the results.
[I would like to thank the Linux Foundation for travel assistance to
Toronto for LinuxCon North America.]
| Index entries for this article | |
|---|---|
| Kernel | Btrfs | 
| Kernel | Filesystems/Btrfs | 
| Conference | LinuxCon North America/2016 | 
      Posted Aug 25, 2016 2:47 UTC (Thu)
                               by pr1268 (guest, #24648)
                              [Link] (10 responses)
       Is it just me, or does the creation of e.g. 175,000 or 200,000 files per second on a single computer (albeit one with multiple cores) on (presumably) a single filesystem seem just the slightest bit outrageous? Granted, this is Facebook, and their data storage requirements are colossal (to say the least), but I just imagine a company, even of FB size, to distribute massive file creation of that order to at least, say, 10 or 20 computers. ;-) P.S. A huge thanks to Mr. Mason for his contributions to BTRFS and Linux in general. 
     
    
      Posted Aug 25, 2016 5:21 UTC (Thu)
                               by dgc (subscriber, #6611)
                              [Link] 
       
On a desktop computer it's overkill. For high performance workloads on storage that can do millions of IOPS, it is considered "barely sufficient". 
-Dave. 
     
      Posted Aug 25, 2016 7:51 UTC (Thu)
                               by farnz (subscriber, #17727)
                              [Link] 
       To put that sort of number into context, it's on the order of 1 file per second for each person in 0.02% of Facebook's daily active user base. Spread that userbase across 5,000 machines, and it's still only one file per second per user per machine - and that's assuming it's the sort of data that doesn't benefit from locality of access, so can be spread sensibly.
 Gives you a sense of how badly intuition can break down at unusual scales...
      
           
     
      Posted Aug 25, 2016 18:00 UTC (Thu)
                               by josefbacik (subscriber, #90083)
                              [Link] (2 responses)
       
     
    
      Posted Aug 27, 2016 12:07 UTC (Sat)
                               by walex (guest, #69836)
                              [Link] (1 responses)
       That is about absolute speed of metadata operations, and that's not the real issue. The real issue being described here is that metadata operations don't scale with hardware capacity, regardless of the absolute speed desires, that is the real issue is about design. The real issue exists because it is relatively easy to have scalable data speeds: just choose a domain which is "easily parallelizable" and throw more disks more RAM more threads at it. For data, RAID scales up speed pretty nicely. By contrast metadata operations are not easily parallelizable, because there are dependencies across metadata, both structural dependencies and ordering dependencies, and therefore fairly fine grained locking must be used (ordering) and RAID does not work as well (structural). The biggest problem with hard-to-parallelize metadata is not even file creation rates, it is whole-tree scans, like fsck or RSYNC scans. I have seen a lot of cases where some "clever" person designed a storage subsystem for average data workloads, and that became catastrophes during peak metadata workloads,  which must happen quite periodically, one way or another. 
     
    
      Posted Aug 29, 2016 17:19 UTC (Mon)
                               by SEJeff (guest, #51588)
                              [Link] 
       
     
      Posted Aug 25, 2016 18:42 UTC (Thu)
                               by ott (guest, #110845)
                              [Link] (3 responses)
       
     
    
      Posted Aug 25, 2016 23:00 UTC (Thu)
                               by gerdesj (subscriber, #5446)
                              [Link] (2 responses)
       
I've just checked out Libre Office a few times and copied it around a bit.  As a long time Gentoo user on a fair few systems, I'm quite familiar with how some pretty large software projects behave from source to binary.   
LO is "only" about 82,000 files at 2.2GB. On my laptop with a reasonably modern Core i7 quad core + HT, with 16GB RAM and 1 x SSD + 1 x spinning disc, it takes a fair old while to compile and needs rather a lot of space.  On previous laptops it used to be an overnight thing.  Check out times are the least of my worries. 
Back to your assertion, even if you are checking out over a 10GBs-1 connection I doubt that the fs is holding you back.   What kind of projects involve millions of files?  Also what sort of repo are you using? 
Cheers 
     
    
      Posted Aug 26, 2016 3:25 UTC (Fri)
                               by ott (guest, #110845)
                              [Link] 
       
     
      Posted Aug 26, 2016 15:55 UTC (Fri)
                               by cwillu (guest, #67268)
                              [Link] 
       
Grepping a tree for the first time today?  Atime updates.  "make clean" in a large repository?  combinations of mtime and deletions.  "apt-get upgrade" with a bunch of pending updates?  Oh, you better believe there's a shittonne (SI technical unit) of file creations, fsyncs, mv's and other metadata updates.   
It ends up being a major but sometimes hidden determinant of how fast you can get shit done. 
     
      Posted Sep 7, 2016 21:44 UTC (Wed)
                               by Pc5Y9sbv (guest, #41328)
                              [Link] 
       
A single microscope slide image might have 100k to 200k tiles in it, totalling a few hundred GB of space. We often want to unpack and host each tile as an individual JPEG file on a static http file server, where a client-side pan/zoom viewer can retrieve just the tiles it need as a user navigates the viewport.  If we are transcoding the tiles, we may be CPU limited but if we are simply extracting them without changing the codec format, we are limited by the metadata rates on the filesystem. 
Conversely, time-series imagery might be produced as a sequence of image frames from data acquisition tools and later multiplexed and/or re-compressed into a movie container format. An hour at 60 fps is 216k frames. However, scientists may want to apply other batch processing steps to each image frame before converting it to a movie file for archving or distribution.  These jobs could run much faster than real-time, and the metadata rates can become the bottleneck.  Such processing is often too exploratory or ad hoc to justify a custom, tuned implementation where you would get your hands on libraries of all needed algorithms, plan your buffer pipeline, and avoid bounding data through external commands with file I/O. 
     
      Posted Aug 26, 2016 16:07 UTC (Fri)
                               by bob.joe (guest, #110687)
                              [Link] 
       
 
 
     
    Insane number of files created per second
      Insane number of files created per second
      
> on a single computer (albeit one with multiple cores) on (presumably) a single
> filesystem seem just the slightest bit outrageous?
Insane number of files created per second
      Insane number of files created per second
      
Insane number of files created per second
      «we don't personally _need_ 200k files/sec, the workload quickly shows us where we have pain points that would cause us problems with real world workloads.»
Insane number of files created per second
      
Insane number of files created per second
      
Insane number of files created per second
      
Jon
Insane number of files created per second
      
It was just to say, it's not "outrageous" to have a directory with million of files, which could be checked out all at once. The FS performance definitely plays a major role there.
Insane number of files created per second
      
Insane number of files created per second
      
Btrfs and high-speed devices
      
 
           