The Btrfs inode-number epic (part 2: solutions) [LWN.net]

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 18:00 UTC (Tue) by sub2LWN (subscriber, #134200) [Link]

Here's a link to the quoted linux-xfs mail: https://lwn.net/ml/linux-xfs/20210823231235.GK3657114%40d...

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 18:22 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (1 responses)

I've looked into bcachefs a couple of times over the years now, and I'd really like to give it a serious go — but not at the cost of having to build my own kernel (as none of the distros relevant to me seem to carry it) and rescue ISO.

Do you have any insight into their plans, timeframes, goals for getting it into mainline? I see that an unsuccessful attempt was made in December 2020, but after that… not easy to find more information for an outsider like me.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 25, 2021 14:52 UTC (Wed) by wazoox (subscriber, #69624) [Link]

Here is the latest request from Kent for mainlining:
https://lkml.org/lkml/2020/10/27/3684

There was some problem, then no news... OTOH at the time snapshots weren't even functional.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 23:07 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

Quote from bcachefs.org:

The functionality and userspace interface for snapshots and subvolumes are roughly modelled after btrfs...

I wouldn't expect anything different. For over a decade, btrfs has had the only viable implementation of this interface to build on in Linux. Even if other filesystems implement subvols and snapshots, they'll be strongly compelled to follow whatever trail btrfs blazes for them now.

I find all the worst-case O(N) searching for N snapshots in the design doc concerning.

This is what the bcachefs 'snapshot' branch does today:

# bcachefs subvolume create foo
# date > foo/bar
# bcachefs subvolume snapshot foo quux
# find -ls
     4096      0 drwxr-xr-x   3 root     root            0 Aug 24 18:40 .
     4098      0 drwxr-xr-x   2 root     root            0 Aug 24 18:40 ./foo
     4099      1 -rw-r--r--   1 root     root           29 Aug 24 18:40 ./foo/bar
     4098      0 drwxr-xr-x   2 root     root            0 Aug 24 18:40 ./quux
     4099      1 -rw-r--r--   1 root     root           29 Aug 24 18:40 ./quux/bar
     4097      0 drwx------   2 root     root            0 Aug 24 18:40 ./lost+found
# stat foo/bar quux/bar
  File: foo/bar
  Size: 29              Blocks: 1          IO Block: 512    regular file
Device: fd04h/64772d    Inode: 4099        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
 Birth: -
  File: quux/bar
  Size: 29              Blocks: 1          IO Block: 512    regular file
Device: fd04h/64772d    Inode: 4099        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
 Birth: -

Duplicate st_dev and st_ino, it's worse than btrfs. On the other hand:

# date > foo/second1
# date > quux/second2
# ls -li */second*
4100 -rw-r--r-- 1 root root 29 Aug 24 18:44 foo/second1
4101 -rw-r--r-- 1 root root 29 Aug 24 18:44 quux/second2

bcachefs will always give new files unique inode numbers, even in different subvols, because the code for creating a new file obtains a globally unique inode number. Possible point for bcachefs here--in this situation, btrfs uses a per-subvol inode number allocator, which would have given both new files inode 4100.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 5:47 UTC (Thu) by dgc (subscriber, #6611) [Link] (4 responses)

Yup. I stand by that statement. I've run tests, measured numbers and observed behaviours. I also have a fair understanding of differences in architetures and implementations. This thread is probably enlightening about the architectural deficiencies within btrfs, which would appear to be unfixable:

https://lore.kernel.org/linux-btrfs/20210121222051.GB4626...

From a filesystem design perspective, COW metadata creates really nasty write amplification and memory footprint problems for pure metadata updates such as updating object reference counts during a snapshot. First the tree has to be stabilised (run all pending metadata COW and write it back), then a reference count update has to be run which then COWs every metadata block in the currently referenced metadata root tree. The metadata amplification is such that with enough previous snapshots, a new snapshot with just a few tens of MB of changed user data can amplify into 10s of GB of internal metadata COW.....

That explained why user data write COW performance on btrfs degraded quickly as snapshot count increases on this specific stress workload (1000 snapshots w/ 10,000 random 4kB overwrites per snapshot, so 40GB of total user data written)

In comparison, dm-snapshot performance on this workload is deterministic and constant as snapshot count increases, same as bcachefs. Bcachefs performed small COW 5x faster than dm-snapshot (largely due to dm-snapshot write amplification due to 64kB minimum COW block size). At 1 snapshot, btrfs COW is about 80% the speed of bcachefs. At 10 snapshots, bcachefs and dm-snapshot performance is unchanged and btrfs has degraded to about the same speed as dm-snapshot. At 100 snapshots, btrfs is bouncing between 1-5% the sustained user data write speed of bcachefs, and less than a quarter of the speed of dm-snapshot, and it doesn't regain any of the original performance as the snapshot count increases further.

That can be seen in workload runtimes - it ran in 20 minutes on bcachefs with each snapshot taking less than 30ms. It ran in about 40 minutes on dm-snapshot, with each snapshot taking less than 20ms. It took 5 hours for XFS+loopback+reflink to run (basically the XFS subvol architecture as a 10 line shell hack) because reflink on an image file with 2 million extents takes ~15s. It took about 9 hours for btrfs to run - a combination of slow user IO (sometimes only getting only *200* 4kB write IOPS from fio for minutes at a time) and the time to run the btrfs snapshot command increasing linearly with snapshot count, taking ~70s to run by the 1000th snapshot.

Sustained IO rates under that workload: bcachefs ~200 write IOPS, 100MB/s. XFS+reflink: ~15k write IOPS, 60MB/s. dm-snapshot: ~10k/10k read/write IOPS, 650/650 read/write MB/s. btrfs: 10-150k write IOPS, 5-10k read IOPS, 0.5-3.2GB/s write, 50MB/s read (9 hours averaging over 1GB/s write will make a serious dent in the production lifetime of most SSDs)

Write amplification as a factor of storage capacity used by that workload: bcachefs: 1.02 xfs+loop+reflink: 1.1 btrfs: ~4.5 dm-snapshot: 17 (because 64kB/4KB = minimum 16x write amplification for every random 4kB IO)

memory footprint: bcachefs: ~2GB. XFS+reflink: ~2.5GB. dm-snapshot: ~2.5GB. btrfs: Used all of the 16GB of RAM and was swapping, writeback throttling on both the root device (swap) and the target device (btrfs IO), userspace was getting blocked for tens of seconds at a time waiting on memory reclaim, swap, IO throttling, etc.

Sure, it's a worst case workload, but the point of running "worst case" workloads is finding out how the implementation handles those situations. It's the "worst case" workloads that generate all the customer support and escalation pain for engineering teams that have to make those subsystems work for their customers. Given that btrfs falls completely apart and makes the machine barely usable in scenarios that bcachefs does not even blink at, it's a fair indication of which filesystem architecture handles stress and adverse conditions/workloads better.

bcachefs also scales better than btrfs. btrfs *still* has major problems with btree lock contention. Even when you separate the namespace btrees by directing threads to different subvolumes, it just moves the lock contention to next btree in teh stack - which IIRC is the global chunk allocation btree. I first reported these scalability problems with btrfs over a decade ago, and it's never been addressed. IOWs, btrfs still generally shows the same negative scaling at concurrency levels as low as 4 threads (i.e. 4 threads is slower than 1 thread, despite burning 4 CPUs trying to do work) as it did a decade ago. In comparison, bcachefs concurrency under the same workloads and without using any subvolume tricks ends up scaling similarly to ext4 (i.e. limited by VFS inode cache hash locking at ~8 threads and 4-6x the performance of a single thread).

I can go on, but I've got lots of numbers from many different workloads that basically say the same thing - if you have a sustained IO and/or concurrency in your workload, btrfs ends up at the bottom of the pack for many important metrics - IO behaviour, filesystem memory footprint, CPU efficiency, scalability, average latency, long tail latency, etc. In some cases, btrfs is a *long* way behind the pack. And the comparison only gets worse for btrfs if you start to throw fsync() operations into the workload mix....

I'm definitely not saying that bcachefs is perfect - far from it - but I am using bcachefs as a baseline to demonstrate that it the poor performance and scalability of btrfs isn't "just what you get from COW filesystems". Competition is good - bcachefs shows that a properly designed and architected COW filesystem can perform extremely well under what are typically called "adverse workload conditions" for COW filesystems. As such, my testing really only serves to highlight the deficiencies in existing upstream snapshot solutions, and so...

"As such, if you want a performant, scalable, robust snapshotting
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."

-Dave.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 30, 2021 9:25 UTC (Mon) by jezuch (subscriber, #52988) [Link] (3 responses)

I think I've seen a couple of mentions of bcachefs and it seemed interesting at that time, but I completely missed the point at which it transitioned from "a cool prototype" to "production-ready, stable and mature". Is it already? Perhaps it needs a PR department of sorts, because it sounds really amazing ;)

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 31, 2021 1:27 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

You haven't missed it--that point is still in the future. Subvols on bcachefs are only a few months old, and much closer to the "cool prototype" end of the spectrum than the other end. My last test run of bcachefs ended after 29 minutes with a readonly splat from the 'cp' command.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 3, 2021 2:06 UTC (Fri) by flussence (guest, #85566) [Link] (1 responses)

Speaking of cool-but-unreleased filesystems, has anyone heard from Tux3 lately? It was showing some mythical benchmark numbers where an in-memory loopback was outperforming tmpfs, then it fell off the face of the earth.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 7, 2021 14:31 UTC (Tue) by nye (subscriber, #51576) [Link]

The mailing list has had nothing but "is this project alive?" type messages for the last couple of years. The latest one of those to get an answer was last year:

> I can't say very active though, we are working on spare time. Recently,
> we are working for snapshot prototype, and inode container improvement

I'd say the last time it looked even vaguely healthy was 2014, and even that was after a couple of very light years, so I think it is probably never going to see the light of day, sadly.