Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
XFS: the filesystem of the future?
Posted Jan 22, 2012 15:31 UTC (Sun) by dgc (subscriber, #6611)
There's several issues that I can see. The big one is that BTRFS metadata trees grow very large and as they grow larger they get slower because it takes more IO to get to any given piece of metadata in the filesystem. When you have a metadata tree that contains 150GB of metadata (what the 8-thread benchmarks I was running ended up with), finding things can take some time and burn a lot of IO and CPU.
This shows up with workloads like the directory traversal - btrfs has a lot more dependent reads to get the directory data from the tree than XFS or ext4 and so is significantly slower at such operations. Whether that can be fixed or not is an open question. Rebalancing (expensive) and larger btree block sizes (mkfs option) are probably ways of reducing the impact of this problem, but metadata tree growth can't actually be avoided.
Another problem is that as the BTRFS filesystem ages, it becomes fragmented due to all the COW that is done. sequential read IO performance will degrade over time as the data gets moved around more widely. Indeed, as the filesystem fills from the bottom up, the distance between where the file data was first written and where the next COW block is written will increase. Hence on spinning rust, seek times will also increase when reading as physical distance between sequential data also increases as the filesystem ages. automatic defrag is the usual way to fix this, but that can be expensive if it occurs at the wrong time...
Then there is the amount of IO that BTRFS does - for a COW filesystem that is supposed to be able to do sequential write IOs, it does an awful lot of small writes and a lot of seeks. Indeed, the limiting performance in all my testing was that BTRFS rapidly got IOPS bound at about 6000 IOPS - sometimes even on single threaded workloads. Part of that is the RAID1 metadata, but even when I turned that off it still drove the disk way harder than XFS and was IOPS bound more than half the time. I'm sure this is fixable to some extent, but I'd suggest there's lots of work to be done here because it ties into the transaction reservation subsystem and how it drives writeback.
[ As an aside, that was one of the big changes I talked about for XFS - making metadata writeback scale. In most cases for XFS, that is driven by the transaction reservation subsystem just like BTRFS does. It's not a simply problem to solve :/ ]
The last thing I'll mention briefly because I've already said some stuff about it is the scalability of the data transformation algorithms in BTRFS. There is already considerable effort going into reducing the overhead of tranformations, but the problem that may not be solvable for everyone - you can only make compression/CRCs/etc so fast and use only so much memory.
I could keep going, but this will give you an idea of some of the problems that are apparent from the scalability testing I was doing....
Posted Jan 22, 2012 20:53 UTC (Sun) by DiegoCG (subscriber, #9198)
I understand the concerns about fragmentation of data due to COW - my workstation runs on top of Btrfs and some files are so fragmented that they don't seem like files anymore (.mozilla/firefox/loderdap.default/urlclassifier3.sqlite: 1145 extents found).
But COW data fragmentation isn't just the reverse of a coin - I guess you could say that non-COW filesystem such as XFS also suffer "write fragmentation" (although I don't know how much of a real problem is). From this point of view, using COW or not for data may be mostly a matter of policy. And since Btrfs can disable data COW not just for the entire filesystems, but for individual files/directories/subvolumes, it doesn't really seem a real problem - "if it hurts, don't do it". And the same applies for data checksums and the rest of data transformations.
As for the issue of making medatada writeback scalable, that probably was the most interesting part of your talk. I imagined it as a particular case of softupdates.
Posted Jan 22, 2012 23:49 UTC (Sun) by dgc (subscriber, #6611)
It might be butter, but it is not magic. :)
> But COW data fragmentation isn't just the reverse of a coin - I guess you
> could say that non-COW filesystem such as XFS also suffer "write
> fragmentation" (although I don't know how much of a real problem is).
What I described is more of an "overwrite fragmentation" problem which non-COW filesystems do not suffer from at all for data or metadata. They just overwrite in place so if the initial allocation is contiguous, it remains that way for the life of the file/metadata. Hence you don't get the same age based fragmentation and the related metadata explosion problems on non-COW filesystems.
> From this point of view, using COW or not for data may be mostly a matter
> of policy. And since Btrfs can disable data COW not just for the entire
> filesystems, but for individual files/directories/subvolumes, it doesn't
> really seem a real problem - "if it hurts, don't do it". And the same
> applies for data checksums and the rest of data transformations.
Sure, you can use nodatacow on BTRFS, but then you are overwriting in place and BTRFS cannot do snapshots or any data transforms (even CRCs, IIRC) on such files. IOWs, you have a file that behaves exactly like it is on a traditional filesystem and you have none of the features or protections that made you want to use BTRFS in the first place. IOWs, you may as well use XFS to store nodatacow files because it will be faster and scale better. :P
Posted Jan 23, 2012 14:38 UTC (Mon) by masoncl (subscriber, #47138)
I only partially agree on the crcs. The intel crc32c optimizations do make it possible for a reasonably large server to scale to really fast storage. But the part where we hand IO off to threads introduces enough latencies to notice in some benchmarks on fast SSDs.
Also, since we have to store the crc for each 4KB block, we do end up tracking much more metadata on the file with crcs on (this is a much bigger factor than the computation time).
With all of that said, there's no reason Btrfs with crcs off can't be as fast as XFS for huge files on huge arrays. Today though, xfs has decades of practice and infrastructure in those workloads.
Posted Jan 22, 2012 20:45 UTC (Sun) by kleptog (subscriber, #1183)
Databases care about reliability and speed. One thing they do is preallocate space for journals to ensure they exist after a crash. Data is updated in place. On a COW filesystem these otherwise efficient methods turn your nice sequentially allocated tables into enormously fragmented files. Ofcourse, SSD will make fragmentation moot, but the $/GB for spinning disks is still a lot lower.
I guess this is because databases and filesystems are trying to solve some of the same problems. A few years ago I actually expected filesystems to export transaction-like features to userspace programs (apparently NTFS does, but that's no good on Linux), but I see no movement on that front.
For example, the whole issue of whether to flush files on rename becomes moot if the program can simply make clear that this is supposed to be an atomic update of the file. This would give the filesystem the necessary information to know that it can defer the writes to the new file, just as long as the rename comes after. Right now there's no way to indicate that.
If you have transactions you don't need to rename at all, just start a transaction, rewrite the file and commit. Much simpler.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds