> I didn't know that Btrfs needed more IO for metadata...
It might be butter, but it is not magic. :)
> But COW data fragmentation isn't just the reverse of a coin - I guess you
> could say that non-COW filesystem such as XFS also suffer "write
> fragmentation" (although I don't know how much of a real problem is).
What I described is more of an "overwrite fragmentation" problem which non-COW filesystems do not suffer from at all for data or metadata. They just overwrite in place so if the initial allocation is contiguous, it remains that way for the life of the file/metadata. Hence you don't get the same age based fragmentation and the related metadata explosion problems on non-COW filesystems.
> From this point of view, using COW or not for data may be mostly a matter
> of policy. And since Btrfs can disable data COW not just for the entire
> filesystems, but for individual files/directories/subvolumes, it doesn't
> really seem a real problem - "if it hurts, don't do it". And the same
> applies for data checksums and the rest of data transformations.
Sure, you can use nodatacow on BTRFS, but then you are overwriting in place and BTRFS cannot do snapshots or any data transforms (even CRCs, IIRC) on such files. IOWs, you have a file that behaves exactly like it is on a traditional filesystem and you have none of the features or protections that made you want to use BTRFS in the first place. IOWs, you may as well use XFS to store nodatacow files because it will be faster and scale better. :P
Posted Jan 23, 2012 14:38 UTC (Mon) by masoncl (subscriber, #47138)
[Link]
Most of the metadata reads come from tracking the extents. All the backrefs we store for each extent get expensive in these workloads. We're definitely reducing this, starting with just using bigger btree blocks. Longer term if that doesn't resolve things we'll make a optimized extent record just for the metadata blocks.
I only partially agree on the crcs. The intel crc32c optimizations do make it possible for a reasonably large server to scale to really fast storage. But the part where we hand IO off to threads introduces enough latencies to notice in some benchmarks on fast SSDs.
Also, since we have to store the crc for each 4KB block, we do end up tracking much more metadata on the file with crcs on (this is a much bigger factor than the computation time).
With all of that said, there's no reason Btrfs with crcs off can't be as fast as XFS for huge files on huge arrays. Today though, xfs has decades of practice and infrastructure in those workloads.