The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
Posted Aug 24, 2021 16:16 UTC (Tue) by wazoox (subscriber, #69624)Parent article: The Btrfs inode-number epic (part 2: solutions)
"As such, if you want a performant, scalable, robust snapshotting
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."
Posted Aug 24, 2021 18:00 UTC (Tue)
by sub2LWN (subscriber, #134200)
[Link]
Posted Aug 24, 2021 18:22 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link] (1 responses)
Do you have any insight into their plans, timeframes, goals for getting it into mainline? I see that an unsuccessful attempt was made in December 2020, but after that… not easy to find more information for an outsider like me.
Posted Aug 25, 2021 14:52 UTC (Wed)
by wazoox (subscriber, #69624)
[Link]
There was some problem, then no news... OTOH at the time snapshots weren't even functional.
Posted Aug 24, 2021 23:07 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Quote from bcachefs.org:
I find all the worst-case O(N) searching for N snapshots in the design doc concerning.
This is what the bcachefs 'snapshot' branch does today:
Posted Aug 26, 2021 5:47 UTC (Thu)
by dgc (subscriber, #6611)
[Link] (4 responses)
https://lore.kernel.org/linux-btrfs/20210121222051.GB4626...
From a filesystem design perspective, COW metadata creates really nasty write amplification and memory footprint problems for pure metadata updates such as updating object reference counts during a snapshot. First the tree has to be stabilised (run all pending metadata COW and write it back), then a reference count update has to be run which then COWs every metadata block in the currently referenced metadata root tree. The metadata amplification is such that with enough previous snapshots, a new snapshot with just a few tens of MB of changed user data can amplify into 10s of GB of internal metadata COW.....
That explained why user data write COW performance on btrfs degraded quickly as snapshot count increases on this specific stress workload (1000 snapshots w/ 10,000 random 4kB overwrites per snapshot, so 40GB of total user data written)
In comparison, dm-snapshot performance on this workload is deterministic and constant as snapshot count increases, same as bcachefs. Bcachefs performed small COW 5x faster than dm-snapshot (largely due to dm-snapshot write amplification due to 64kB minimum COW block size). At 1 snapshot, btrfs COW is about 80% the speed of bcachefs. At 10 snapshots, bcachefs and dm-snapshot performance is unchanged and btrfs has degraded to about the same speed as dm-snapshot. At 100 snapshots, btrfs is bouncing between 1-5% the sustained user data write speed of bcachefs, and less than a quarter of the speed of dm-snapshot, and it doesn't regain any of the original performance as the snapshot count increases further.
That can be seen in workload runtimes - it ran in 20 minutes on bcachefs with each snapshot taking less than 30ms. It ran in about 40 minutes on dm-snapshot, with each snapshot taking less than 20ms. It took 5 hours for XFS+loopback+reflink to run (basically the XFS subvol architecture as a 10 line shell hack) because reflink on an image file with 2 million extents takes ~15s. It took about 9 hours for btrfs to run - a combination of slow user IO (sometimes only getting only *200* 4kB write IOPS from fio for minutes at a time) and the time to run the btrfs snapshot command increasing linearly with snapshot count, taking ~70s to run by the 1000th snapshot.
Sustained IO rates under that workload: bcachefs ~200 write IOPS, 100MB/s. XFS+reflink: ~15k write IOPS, 60MB/s. dm-snapshot: ~10k/10k read/write IOPS, 650/650 read/write MB/s. btrfs: 10-150k write IOPS, 5-10k read IOPS, 0.5-3.2GB/s write, 50MB/s read (9 hours averaging over 1GB/s write will make a serious dent in the production lifetime of most SSDs)
Write amplification as a factor of storage capacity used by that workload: bcachefs: 1.02 xfs+loop+reflink: 1.1 btrfs: ~4.5 dm-snapshot: 17 (because 64kB/4KB = minimum 16x write amplification for every random 4kB IO)
memory footprint: bcachefs: ~2GB. XFS+reflink: ~2.5GB. dm-snapshot: ~2.5GB. btrfs: Used all of the 16GB of RAM and was swapping, writeback throttling on both the root device (swap) and the target device (btrfs IO), userspace was getting blocked for tens of seconds at a time waiting on memory reclaim, swap, IO throttling, etc.
Sure, it's a worst case workload, but the point of running "worst case" workloads is finding out how the implementation handles those situations. It's the "worst case" workloads that generate all the customer support and escalation pain for engineering teams that have to make those subsystems work for their customers. Given that btrfs falls completely apart and makes the machine barely usable in scenarios that bcachefs does not even blink at, it's a fair indication of which filesystem architecture handles stress and adverse conditions/workloads better.
bcachefs also scales better than btrfs. btrfs *still* has major problems with btree lock contention. Even when you separate the namespace btrees by directing threads to different subvolumes, it just moves the lock contention to next btree in teh stack - which IIRC is the global chunk allocation btree. I first reported these scalability problems with btrfs over a decade ago, and it's never been addressed. IOWs, btrfs still generally shows the same negative scaling at concurrency levels as low as 4 threads (i.e. 4 threads is slower than 1 thread, despite burning 4 CPUs trying to do work) as it did a decade ago. In comparison, bcachefs concurrency under the same workloads and without using any subvolume tricks ends up scaling similarly to ext4 (i.e. limited by VFS inode cache hash locking at ~8 threads and 4-6x the performance of a single thread).
I can go on, but I've got lots of numbers from many different workloads that basically say the same thing - if you have a sustained IO and/or concurrency in your workload, btrfs ends up at the bottom of the pack for many important metrics - IO behaviour, filesystem memory footprint, CPU efficiency, scalability, average latency, long tail latency, etc. In some cases, btrfs is a *long* way behind the pack. And the comparison only gets worse for btrfs if you start to throw fsync() operations into the workload mix....
I'm definitely not saying that bcachefs is perfect - far from it - but I am using bcachefs as a baseline to demonstrate that it the poor performance and scalability of btrfs isn't "just what you get from COW filesystems". Competition is good - bcachefs shows that a properly designed and architected COW filesystem can perform extremely well under what are typically called "adverse workload conditions" for COW filesystems. As such, my testing really only serves to highlight the deficiencies in existing upstream snapshot solutions, and so...
"As such, if you want a performant, scalable, robust snapshotting
-Dave.
Posted Aug 30, 2021 9:25 UTC (Mon)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Posted Aug 31, 2021 1:27 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Posted Sep 3, 2021 2:06 UTC (Fri)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Sep 7, 2021 14:31 UTC (Tue)
by nye (subscriber, #51576)
[Link]
> I can't say very active though, we are working on spare time. Recently,
I'd say the last time it looked even vaguely healthy was 2014, and even that was after a couple of very light years, so I think it is probably never going to see the light of day, sadly.
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
https://lkml.org/lkml/2020/10/27/3684
The Btrfs inode-number epic (part 2: solutions)
The functionality and userspace interface for snapshots and subvolumes are roughly modelled after btrfs...
I wouldn't expect anything different. For over a decade, btrfs has had the only viable implementation of this interface to build on in Linux. Even if other filesystems implement subvols and snapshots, they'll be strongly compelled to follow whatever trail btrfs blazes for them now.
# bcachefs subvolume create foo
# date > foo/bar
# bcachefs subvolume snapshot foo quux
# find -ls
4096 0 drwxr-xr-x 3 root root 0 Aug 24 18:40 .
4098 0 drwxr-xr-x 2 root root 0 Aug 24 18:40 ./foo
4099 1 -rw-r--r-- 1 root root 29 Aug 24 18:40 ./foo/bar
4098 0 drwxr-xr-x 2 root root 0 Aug 24 18:40 ./quux
4099 1 -rw-r--r-- 1 root root 29 Aug 24 18:40 ./quux/bar
4097 0 drwx------ 2 root root 0 Aug 24 18:40 ./lost+found
# stat foo/bar quux/bar
File: foo/bar
Size: 29 Blocks: 1 IO Block: 512 regular file
Device: fd04h/64772d Inode: 4099 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
Birth: -
File: quux/bar
Size: 29 Blocks: 1 IO Block: 512 regular file
Device: fd04h/64772d Inode: 4099 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
Birth: -
Duplicate st_dev and st_ino, it's worse than btrfs. On the other hand:
# date > foo/second1
# date > quux/second2
# ls -li */second*
4100 -rw-r--r-- 1 root root 29 Aug 24 18:44 foo/second1
4101 -rw-r--r-- 1 root root 29 Aug 24 18:44 quux/second2
bcachefs will always give new files unique inode numbers, even in different subvols, because the code for creating a new file obtains a globally unique inode number. Possible point for bcachefs here--in this situation, btrfs uses a per-subvol inode number allocator, which would have given both new files inode 4100.
The Btrfs inode-number epic (part 2: solutions)
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
> we are working for snapshot prototype, and inode container improvement