The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 5:47 UTC (Thu) by dgc (subscriber, #6611)
In reply to: The Btrfs inode-number epic (part 2: solutions) by wazoox
Parent article: The Btrfs inode-number epic (part 2: solutions)

Yup. I stand by that statement. I've run tests, measured numbers and observed behaviours. I also have a fair understanding of differences in architetures and implementations. This thread is probably enlightening about the architectural deficiencies within btrfs, which would appear to be unfixable:

https://lore.kernel.org/linux-btrfs/20210121222051.GB4626...

From a filesystem design perspective, COW metadata creates really nasty write amplification and memory footprint problems for pure metadata updates such as updating object reference counts during a snapshot. First the tree has to be stabilised (run all pending metadata COW and write it back), then a reference count update has to be run which then COWs every metadata block in the currently referenced metadata root tree. The metadata amplification is such that with enough previous snapshots, a new snapshot with just a few tens of MB of changed user data can amplify into 10s of GB of internal metadata COW.....

That explained why user data write COW performance on btrfs degraded quickly as snapshot count increases on this specific stress workload (1000 snapshots w/ 10,000 random 4kB overwrites per snapshot, so 40GB of total user data written)

In comparison, dm-snapshot performance on this workload is deterministic and constant as snapshot count increases, same as bcachefs. Bcachefs performed small COW 5x faster than dm-snapshot (largely due to dm-snapshot write amplification due to 64kB minimum COW block size). At 1 snapshot, btrfs COW is about 80% the speed of bcachefs. At 10 snapshots, bcachefs and dm-snapshot performance is unchanged and btrfs has degraded to about the same speed as dm-snapshot. At 100 snapshots, btrfs is bouncing between 1-5% the sustained user data write speed of bcachefs, and less than a quarter of the speed of dm-snapshot, and it doesn't regain any of the original performance as the snapshot count increases further.

That can be seen in workload runtimes - it ran in 20 minutes on bcachefs with each snapshot taking less than 30ms. It ran in about 40 minutes on dm-snapshot, with each snapshot taking less than 20ms. It took 5 hours for XFS+loopback+reflink to run (basically the XFS subvol architecture as a 10 line shell hack) because reflink on an image file with 2 million extents takes ~15s. It took about 9 hours for btrfs to run - a combination of slow user IO (sometimes only getting only *200* 4kB write IOPS from fio for minutes at a time) and the time to run the btrfs snapshot command increasing linearly with snapshot count, taking ~70s to run by the 1000th snapshot.

Sustained IO rates under that workload: bcachefs ~200 write IOPS, 100MB/s. XFS+reflink: ~15k write IOPS, 60MB/s. dm-snapshot: ~10k/10k read/write IOPS, 650/650 read/write MB/s. btrfs: 10-150k write IOPS, 5-10k read IOPS, 0.5-3.2GB/s write, 50MB/s read (9 hours averaging over 1GB/s write will make a serious dent in the production lifetime of most SSDs)

Write amplification as a factor of storage capacity used by that workload: bcachefs: 1.02 xfs+loop+reflink: 1.1 btrfs: ~4.5 dm-snapshot: 17 (because 64kB/4KB = minimum 16x write amplification for every random 4kB IO)

memory footprint: bcachefs: ~2GB. XFS+reflink: ~2.5GB. dm-snapshot: ~2.5GB. btrfs: Used all of the 16GB of RAM and was swapping, writeback throttling on both the root device (swap) and the target device (btrfs IO), userspace was getting blocked for tens of seconds at a time waiting on memory reclaim, swap, IO throttling, etc.

Sure, it's a worst case workload, but the point of running "worst case" workloads is finding out how the implementation handles those situations. It's the "worst case" workloads that generate all the customer support and escalation pain for engineering teams that have to make those subsystems work for their customers. Given that btrfs falls completely apart and makes the machine barely usable in scenarios that bcachefs does not even blink at, it's a fair indication of which filesystem architecture handles stress and adverse conditions/workloads better.

bcachefs also scales better than btrfs. btrfs *still* has major problems with btree lock contention. Even when you separate the namespace btrees by directing threads to different subvolumes, it just moves the lock contention to next btree in teh stack - which IIRC is the global chunk allocation btree. I first reported these scalability problems with btrfs over a decade ago, and it's never been addressed. IOWs, btrfs still generally shows the same negative scaling at concurrency levels as low as 4 threads (i.e. 4 threads is slower than 1 thread, despite burning 4 CPUs trying to do work) as it did a decade ago. In comparison, bcachefs concurrency under the same workloads and without using any subvolume tricks ends up scaling similarly to ext4 (i.e. limited by VFS inode cache hash locking at ~8 threads and 4-6x the performance of a single thread).

I can go on, but I've got lots of numbers from many different workloads that basically say the same thing - if you have a sustained IO and/or concurrency in your workload, btrfs ends up at the bottom of the pack for many important metrics - IO behaviour, filesystem memory footprint, CPU efficiency, scalability, average latency, long tail latency, etc. In some cases, btrfs is a *long* way behind the pack. And the comparison only gets worse for btrfs if you start to throw fsync() operations into the workload mix....

I'm definitely not saying that bcachefs is perfect - far from it - but I am using bcachefs as a baseline to demonstrate that it the poor performance and scalability of btrfs isn't "just what you get from COW filesystems". Competition is good - bcachefs shows that a properly designed and architected COW filesystem can perform extremely well under what are typically called "adverse workload conditions" for COW filesystems. As such, my testing really only serves to highlight the deficiencies in existing upstream snapshot solutions, and so...

"As such, if you want a performant, scalable, robust snapshotting
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."

-Dave.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 30, 2021 9:25 UTC (Mon) by jezuch (subscriber, #52988) [Link] (3 responses)

I think I've seen a couple of mentions of bcachefs and it seemed interesting at that time, but I completely missed the point at which it transitioned from "a cool prototype" to "production-ready, stable and mature". Is it already? Perhaps it needs a PR department of sorts, because it sounds really amazing ;)

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 31, 2021 1:27 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

You haven't missed it--that point is still in the future. Subvols on bcachefs are only a few months old, and much closer to the "cool prototype" end of the spectrum than the other end. My last test run of bcachefs ended after 29 minutes with a readonly splat from the 'cp' command.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 3, 2021 2:06 UTC (Fri) by flussence (guest, #85566) [Link] (1 responses)

Speaking of cool-but-unreleased filesystems, has anyone heard from Tux3 lately? It was showing some mythical benchmark numbers where an in-memory loopback was outperforming tmpfs, then it fell off the face of the earth.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 7, 2021 14:31 UTC (Tue) by nye (subscriber, #51576) [Link]

The mailing list has had nothing but "is this project alive?" type messages for the last couple of years. The latest one of those to get an answer was last year:

> I can't say very active though, we are working on spare time. Recently,
> we are working for snapshot prototype, and inode container improvement

I'd say the last time it looked even vaguely healthy was 2014, and even that was after a couple of very light years, so I think it is probably never going to see the light of day, sadly.