The Btrfs inode-number epic (part 2: solutions)
Take 1: internal mounts
Brown's first patch set attempted to resolve these problems by creating the concept of "internal mounts"; these would be automatically created by Btrfs for each visible subvolume. The automount mechanism is used to make those mounts appear. With a bit of tweaking, the kernel's NFS server daemon (nfsd) can recognize these special mounts and allow them to be crossed without an explicit mount on the remote side. With this setup, the device numbers shown by the system are as expected, and inode numbers are once again unique within a given mount.
At a first glance, this patch set seemed like a good solution to the
problem. When presented with a description of this approach back in July,
filesystem developer Christoph Hellwig responded:
"This is what I've been asking for for years
". With these
changes, Btrfs appears to be a bit less weird and some longstanding
problems are finally resolved.
This patch set quickly ran into trouble, though. Al Viro pointed out that the mechanism for querying device numbers could generate I/O while holding a lock that does not allow for such actions, thus deadlocking the system; without that query, though, the scheme for getting the device number from the filesystem will not work. One potential alternative, providing a separate superblock for each internal mount that would contain the needed information, is even worse. Many operations in the kernel's virtual filesystem layer involve iterating through the full list of mounted superblocks; adding thousands of them for Btrfs subvolumes would create a number of new performance problems that would take massive changes to fix.
Additionally, Amir Goldstein noted that the new mount structure could create trouble for overlayfs; it would also break some of his user-space tools. There is also the little issue of how all those internal mounts would show up in /proc/mounts; on systems with large numbers of subvolumes, that would turn /proc/mounts into a huge, unwieldy mess that could also expose the names of otherwise private subvolumes.
Take 2: file handles
Brown concluded
that "with its current framing the problem is unsolvable
".
Specifically, the problem is the 64 bits set aside for the inode
number, which are not enough for Btrfs even now. The problem gets worse
with overlayfs, which must combine inode numbers from multiple filesystems,
yielding something that is necessarily larger than any one filesystem's
numbers. Brown described the current solution in overlayfs as "it
over-loads the high bits and hopes the filesystem doesn't use them
",
which seems less than fully ideal. But, as long as inode numbers are
limited to any fixed size, there is no way around the problem, he said.
It would be better, he continued, to use the file handle provided by many filesystems, primarily for use with NFS; a file's handle can be obtained with name_to_handle_at(). The handle is of arbitrary length, and it includes a generation number, which handily gets around the problems of inode-number reuse when a file is deleted. If user space were to use handles rather than inode numbers to check whether two files are the same, a lot of problems would go away.
Of course, some new problems would also materialize, mostly in the form of the need to make a lot of changes to user-space interfaces and programs. No files exported by the kernel (/proc files, for example) use handles now, so a set of new files that included the handles would have to be created. Any program that looks at inode numbers would have to be updated. The result would be a lot of broken user-space tools. Brown has repeatedly insisted that breaking things may be possible (and necessary):
If you refuse to risk breaking anything, then you cannot make progress. Providing people can choose when things break, and have advanced warning, they often cope remarkable well.
Incompatible changes remain a hard sell, though. Beyond that, to get the full benefit from the change, Btrfs would have to be changed to stop using artificial device numbers for subvolumes, which is not a small change either. And, as Viro pointed out, it is possible for two different file handles to refer to the same file.
In summary, this approach did not win the day either.
Take 3: mount options
Brown's third attempt approached the problem from a different direction, making all of the changes explicitly opt-in. Specifically, he added two new mount options for Btrfs filesystems that would change their behavior with regard to inode and device numbers.
The first option, inumbits=, changes how inode numbers are
presented; the default value of zero causes the internal object ID to be
used (as is currently the case for Btrfs). A non-zero value tells Btrfs to
generate inode numbers that are "mostly unique
" and that fit
into the given number of bits. Specifically, to generate the inode number
for a given object within a subvolume, Btrfs will:
- Generate an "overlay" value from the subvolume number; this is done by byte-swapping the number so that the low-order bits (which vary the most between subvolumes) are in the most-significant bit positions.
- The overlay is right-shifted to fit within the number of bits specified by inumbits=. If that number is 64, no shift need be done.
- That overlay value is then XORed with the object number to produce the inode number presented to user space.
The resulting inode numbers will still be unique within any given subvolume; collisions within a large Btrfs filesystem can still happen, but they are less likely than before. Setting inumbits=64 minimizes the chances of duplicate inode numbers, but a lower number (such as 56) may make sense in situations (such as when overlayfs is in use) where the top bits are used by other subsystems.
The second mount option is numdevs=; it controls how many device numbers are used to represent subvolumes within the filesystem. The default value, numdevs=many, preserves the existing behavior of allocating a separate device number for every subvolume. Setting numdevs=1, instead, causes a single device number to be used for all subvolumes. When a filesystem is mounted with this option, tools like find and du will not be able to detect the crossing of a subvolume boundary, so their options to stay within a single filesystem may not work as expected. It is also possible to specify numdevs=2, which causes two device numbers to be used in an alternating manner when moving from one subvolume to the next; this makes tools like find work as expected.
Finally, this patch set also added the concept of a "tree ID" that can be fetched with the statx() system call. Btrfs would respond to that query with the subvolume ID, which applications could then use to reliably determine whether two files are contained within the same subvolume or not.
Btrfs developer Josef Bacik described
this work as "a step in the right direction
", but said that he
wants to see a solution that does not require special mount options.
"Mount options are messy, and are just going to lead to distros
turning them on without understanding what's going on and then we have to
support them forever
". A proper solution, he said, does not present
the possibility for users to make bad decisions. He suggested just using
the new tree ID within nfsd to solve the NFS-specific problems,
generating new inode numbers itself if need be.
Brown countered
with a suggestion that, rather than adding mount options, he could just
create a new filesystem type ("btrfs2 or maybe betrfs
") that
would use the new semantics. Bacik didn't
like that idea either, though. Brown added
that he would prefer not to do "magic transformations
" of
Btrfs inode numbers in nfsd; if a filesystem requires such
operations, they should be done in the filesystem itself, he said. He then asked
that the Btrfs developers make a decision on their preferred way to solve
this problem, but did not get an answer.
Take 4: the uniquifier
On August 13, Brown returned with a minimal patch aimed at solving the NFS problems that started this whole discussion. It enables a filesystem to provide a "uniquifier" value associated with a file; this value, the name of which is arguably better suited to a professional wrestler, is only available within the kernel. The NFS server can then XOR that value with the file's inode number to obtain a number that is more likely to be unique. Btrfs provides the overlay value described above as this value; nfsd uses it, and the problem (mostly) goes away.
Bacik said
that this approach was "reasonable
" and acked it for the Btrfs
filesystem. It thus looks like it could finally be a solution for the
problem at hand. Or, at least, it's closer; Brown later realized
that the changed inode numbers would create the dreaded "stale file handle"
errors on existing clients when the transition happens. An updated
version of the patch set adds a new flag in an unused byte of the NFS
file handle to mark "new-style" inode numbers and prevent this error from
occurring.
The second revision of the fourth attempt may indeed be the charm that
makes some NFS-specific
problems go away for Btrfs users. It is hard not to see this change — an
internal process involving magic numbers that still is not guaranteed to
create unique inode numbers — as a bit of a hack, though. Indeed, even
Brown referred
to "hacks to work around misfeatures in filesystems
" when
talking about this work. Hacks, though, can be part of life when dealing
with production systems and large user bases; a cleaner and more correct
solution may not be possible without breaking user systems. So the
uniquifier may be as good as things can get until some other problem is
severe enough to force the acceptance of a more disruptive solution.
Index entries for this article | |
---|---|
Kernel | Filesystems/Btrfs |
Posted Aug 23, 2021 16:48 UTC (Mon)
by jkingweb (subscriber, #113039)
[Link]
Posted Aug 23, 2021 18:15 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (6 responses)
Posted Aug 23, 2021 18:23 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Aug 23, 2021 19:43 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Sure, you can throw more bits at the problem, but you're just treating the symptoms. The real issue here is not "we don't have enough bits." It's "we can't agree on exactly how those bits should be allocated." One possibility: btrfs might decide to have unique inodes over the whole filesystem, and that would likely be challenging but technically possible (for example, when you create a subvolume, you allocate a new 32-bit inode prefix to that subvolume, and whenever any subvolume runs out of inode numbers, you give it another 32-bit prefix - since each prefix contains ~4 billion inode numbers, this allocation should happen rather infrequently, and since there are ~4 billion possible prefix values, the large size of these allocations will not easily cause a shortage).
But I doubt you can actually do that and still maintain on-disk compatibility with existing btrfs filesystems. Oh well.
Posted Aug 23, 2021 22:43 UTC (Mon)
by willy (subscriber, #9762)
[Link]
Posted Aug 23, 2021 22:50 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link]
If we know the highest-numbered subvol on the filesystem (which is a trivial tree lookup at mount time) and we use bit-swap instead of byte-swap, then we know which bits are subvol ID and which are inode (all bits that are not subvol ID are inode ID), so we have a nice pair of O(1) bidirectional conversion functions. We can also know when subvol and inode might potentially collide (it's not possible as long as the number of bits needed for the highest subvol ID and the highest inode do not total more than 64, but you probably want warnings around 56 or so).
If you ran out of inode bits in a subvol then you'd need a lookup table to map a discontiguous range of inodes to subvols. That table would require a disk format change, but most users will never occupy enough bits to need it (it will take decades, creating thousands of inodes every second and thousands of snapshots per day, to make the numbers bump). It could be created lazily when the free bits run out, but if that takes 20 years to happen then that code isn't going to be very well tested.
Alternatively btrfs could in the future do garbage collection to free up old object ID numbers, i.e. start at the highest inodes and pack them into the lowest-numbered available inode slots, and stop when it had freed up enough top bits. That wouldn't require an on-disk format change, it would just be a maintenance task to run at regular intervals, say, once every 15 years. This is roughly equivalent to creating an empty subvol and using 'cp -a --reflink' to move the data into files with smaller inode numbers, so if you are in really dire straits you don't need to wait for a special tool.
Posted Sep 12, 2021 19:41 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Another way of putting it: if you insist on stacking new bits on the front of an inode number for every new sort of thing that must be unique within a mount point (new btrfs subvolumes, new mount points within an NFS export, new this, new that), we can *never* have enough bits, because you can always add another layer of overlayfs or nfs exporting or whatever, and require more: and since most filesystems are using 64-bit inode numbers already, 64 bits is *never* enough to maintain guaranteed uniqueness in that space while adding more spaces as well on top.
(What saves us and lets us use kludges like the one in this article without disaster is that 64-bit spaces are, indeed, so large that we can just assume it is almost entirely empty and we can just pick more numbers at random, as long as they're not mostly-bits-zero or mostly-bits-1, and probably work nearly all the time, despite the birthday paradox. This is gross but probably good enough. I for one do not want a 128-bit ino_t flag day any time soon thankyouverymuch!)
Posted Aug 24, 2021 16:51 UTC (Tue)
by flussence (guest, #85566)
[Link]
It'd work in theory, but it'd be an amount of churn comparable to replacing 32-bit time_t, or IPv4 (even on a closed internal network that's a Sisyphean task).
Posted Aug 23, 2021 18:38 UTC (Mon)
by martin.langhoff (guest, #61417)
[Link] (3 responses)
A tough tradeoff it seems. Questions...
What's the fallout if the inodes are not unique? Given that large modern systems can be really large, inode collisions might be just a fact of life.
and... the solution is an intermediate "let's limit the repercussions on other software" solution. Sure. So then... _is there a clear correct way to check for unique inode that is sane, clear of collisions and portable (across filesystems)?
In other words, if I was a maintainer of a deduplicator utility, or developing the next version of NFS, and I'm alert enough to be reading this article, is there a clear way to DTRT?
While today we want to not break the world, we're also building tomorrow...
Posted Aug 23, 2021 22:04 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (2 responses)
This is unknowable in general - it depends on exactly what assumptions various applications make.
We know some specific problems.
There are probably others. However most code would never notice.
> So then... _is there a clear correct way to check for unique inode that is sane, clear of collisions and portable
Probably not. Even the current best-case behaviour of file-systems like ext4 does not provide the guarantees that I have described tar as requiring (it is possible I've misrepresented 'tar' - I haven't checked the code).
Tracking the identity of filesystems (to detect these mounts) is not well supported. st_dev is, as I say, transient for some filesystems. The statfs() systemcall reports an "fsid", but this is poorly specified. The man page for statfs() says "Nobody knows what f_fsid is supposed to contain". Some filesystems (btrfs, xfs, ext4 and others) provide good values. Other filesystems do less useful things. Some just provide st_dev in a different encoding.
Posted Aug 24, 2021 5:40 UTC (Tue)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
Posted Aug 24, 2021 16:33 UTC (Tue)
by jonesmz (subscriber, #130234)
[Link]
Posted Aug 23, 2021 22:25 UTC (Mon)
by poc (subscriber, #47038)
[Link] (7 responses)
Posted Aug 24, 2021 4:17 UTC (Tue)
by Conan_Kudo (subscriber, #103240)
[Link] (6 responses)
Posted Aug 24, 2021 9:28 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (5 responses)
ZFS is a substantially different filesystem to btrfs and is not directly comparable. It doesn't have anything with comparable flexibilty to btrfs subvols (which I want to call "subtrees").
Posted Aug 24, 2021 10:29 UTC (Tue)
by mtu (guest, #144375)
[Link] (3 responses)
Where btrfs has a flat list of "subvolumes", ZFS has a hierarchical tree of "datasets" (filesystems), each of which can have any number of read-only "snapshots"*, which can in turn be the basis of sparse copy-on-write "clones" that behave like datasets. Dataset properties (like mountpoints, compression and more advanced fs stuff like recordsize) are hereditary throughout the tree. Most dataset operations (like snapshotting, 'sending' into a flat bytestream or modifying properties) work recursively for any subtree.
In contrast, I feel that all that btrfs has to offer is: "Here's a flat list of a few hundred 'subvolumes', good luck keeping track and managing their properties and mounpoints, and try to not write to the ones you meant to keep as read-only snapshots."
* Concerning the matter at hand: Yes, snapshots are accessible through a dataset's ".zfs/" subdirectory (unless the feature is disabled for a given dataset). But they are usually only every exposed by explicit "cd" or path addressing from a shell, and never pop up to confuse find, NFS, samba, du or any other userspace application working recursively—at least that's my experience on FreeBSD, that's always had intimate and seamless integration of ZFS. On Linux, that seems to be a different story: https://github.com/openzfs/zfs/issues/6154
Posted Aug 24, 2021 11:56 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (1 responses)
btrfs subvolumes are not a flat list; they can be created at any point in a normal directory hierarchy, including as children of other subvolumes. And all the other features you describe of ZFS snapshosts work in btrfs, too, just with different tooling.
It sounds like you've encountered one set of tooling to manage btrfs subvolumes, and assumed that the limits of that tooling are the limits of btrfs.
Posted Aug 24, 2021 14:42 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
They were in 2008. Things have changed a little since then.
Posted Aug 24, 2021 16:17 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
btrfs snapshots are a lazy version of 'cp -a --reflink'. Users can drop a subvol anywhere in the filesystem and snapshot it anywhere else. This is part of the current problem--there isn't a single administrator-managed tree of subvols or snapshots, because ordinary applications can create and use subvols the same way they make directories (*). An existing NFS export can wake up one morning after an application software upgrade and suddenly find itself hosting a lot of subvols it didn't plan for. This proliferation of subvols is why the obvious solution (create distinct mount points for each and every subvol) isn't very popular (nor is the other obvious solution, lock down subvols so they aren't as trivial to use).
Unlike other popular snapshot systems, btrfs has no distinction between "base" and "clone" subvols. There is a notion of an "original" subvol and a "snapshot" subvol, but it's not part of the implementation, it's only a hint for administrators to label before-and-after snapshots for incremental send/receive. After a snapshot, both subvols are fully writable equal peers sharing ownership of their POSIX tree and data blocks, the same as if you had done cp -a --reflink atomically. Snapshots have a read-only bit that can be turned on or off (but turning it off means the subvol is no longer synchronized with copies on other filesystems, so it can't be used as a basis for incremental send/receive any more). You can chain snapshots (snap A to B, snap B to C, snap C to D...), with equal cost to write any subvol in the chain, and you can delete any of the snapshots in the chain with equal cost and without disrupting any other snapshot (other snapshot systems will have up to O(n) extra cost if there are n snapshots, or may not be able to delete the original subvol before deleting all snapshots). These properties greatly improve the usability of snapshots for applications since they can freely switch between treating them as subvol units or as individual files.
(*) If that seems weird, observe that a long time ago 'mkdir' required root privileges (**).
(**) OK there were different reasons for that. Still, ideas about what is "normal" for a filesystem and what is "privileged" do change over time.
Posted Aug 24, 2021 19:26 UTC (Tue)
by zev (subscriber, #88455)
[Link]
Posted Aug 24, 2021 6:21 UTC (Tue)
by eru (subscriber, #2753)
[Link] (3 responses)
Posted Aug 29, 2021 4:55 UTC (Sun)
by patrakov (subscriber, #97174)
[Link] (2 responses)
Posted Aug 31, 2021 13:44 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
The *obvious* fix is to make i-nodes unique at the root file-system level. It's not a problem to think of a snapshots sharing i-nodes if they're sharing the same file ...
BUT. As soon as you break the link by modifying the file in one snapshot, that means you need to change the i-node number. No problem? Until the user has been using hardlinks to avoid having multiple identical copies of large files. You're now forcing "create temp files, copy over original" behaviour onto user space and that breaks hard links ...
I'm guessing there's plenty more problems where that came from, so where do we go from here? It ain't simple ...
Cheers,
Posted Aug 31, 2021 14:02 UTC (Tue)
by patrakov (subscriber, #97174)
[Link]
Posted Aug 24, 2021 7:01 UTC (Tue)
by mezcalero (subscriber, #45103)
[Link] (6 responses)
Note that Windows is way ahead there, and exposes such pretty-much-UUIDs-but-not-really for NTFS already: https://devblogs.microsoft.com/oldnewthing/20110228-00/?p... — Maybe it's time for Linux just acknowledge that having such a universal 128bit ID is actually a really useful thing.
Note that btrfs documented that dirs that are subvolumes are recognizable by their inode nr 256. If they change that they'll break a good part of userspace (including systemd). But if they give my truly universally valid 128 bit IDs as replacement I'd be more than happy to fix the fallout – at least in my codebases – quickly.
(I think it would really make sense to add a flag returned by statx() that marks the subvolume dirs explicitly as subvolumes. Right now userspace is supposed to make the check "has this file BTRFS_SUPER_MAGIC and inode nr 256" which is just messy and requires two syscalls. i.e. STATX_ATTR_SUBVOLUME or FS_SUBVOLUME_FL would be great to have)
Lennart
Posted Aug 25, 2021 11:26 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (5 responses)
I also don't really see the use case outside of filesystems like btrfs which do everything differently mainly to be different. It is not as if they couldn't have split up the 64bit available to them into two numbers that would be more than enough for the numbers of files you find on a filesystem times the number of subvolumes.
Posted Aug 25, 2021 21:02 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 26, 2021 1:21 UTC (Thu)
by zblaxell (subscriber, #26385)
[Link] (2 responses)
It might be harder than it looks? So far btrfs, xfs, bcachefs, zfs, and overlayfs have all not done this.
bcachefs seems to have painted itself into the same corner as btrfs: 32-bit subvol ID, 64-bit inode, making a snapshot duplicates all existing inode numbers in the subvol. XFS experimented with subvols, gave up, and now recommends bcachefs instead. ZFS duplicates inode numbers--despite using only 48 bits of ino_t--and apologizes to no one.
Overlayfs takes up at least one bit of its own, which can interfere with any other filesystem's attempt to use all 64 bits of ino_t (indeed the btrfs NFS support patch reserves some bits for that). Overlayfs only does that sometimes--the rest of the time, it lets inode numbers from different lowerdirs collide freely.
Posted Aug 26, 2021 1:52 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
One reason that it is harder for btrfs is that btrfs never reuses inode numbers (well ... almost never).
So if you create a snapshot every minute you'll use 24bits of subvolume numbers in 31 years - even if you only keep a few around.
How long do we expect a filesystem to last for? 348 years is probably unrealistic - is 31?
64bits allows you to create one every microsecond and still survive for half a million years. That is much easier for a filesystem developer to live with.
I would like btrfs to re-use the numbers and impose these limits. This is far from straight forward. It is almost certainly technically possible without excessive cost (though with a non-zero cost). But it can be hard to motivate efforts to protect against uncertain future problems (.... I'm sure there is a well known example I could point to...).
Posted Sep 7, 2021 14:11 UTC (Tue)
by nye (subscriber, #51576)
[Link]
It's hard to make a direct comparison given the fundamental differences in the model of subvolumes vs ZFS' various dataset types, but FWIW, I have a running system - at home, so not exactly enterprise scale - where the total number of ZFS snapshots that have been made across filesystems/volumes in the pool over the last decade is probably around 15 million. Getting pretty close to 24 bits.
I don't know enough about btrfs to know if the equivalent setup to those filesystems and volumes would be based on some shared root there and competing for inodes, or entirely separate. I guess what that boils down to is that I don't know if the rough equivalent to a btrfs filesystem is a ZFS filesystem or a ZFS *pool*. Either way, once you're used to nearly-free snapshots, you can find yourself using a *lot* of them.
Posted Aug 27, 2021 6:10 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link]
Lennart
Posted Aug 24, 2021 16:16 UTC (Tue)
by wazoox (subscriber, #69624)
[Link] (9 responses)
"As such, if you want a performant, scalable, robust snapshotting
Posted Aug 24, 2021 18:00 UTC (Tue)
by sub2LWN (subscriber, #134200)
[Link]
Posted Aug 24, 2021 18:22 UTC (Tue)
by mbunkus (subscriber, #87248)
[Link] (1 responses)
Do you have any insight into their plans, timeframes, goals for getting it into mainline? I see that an unsuccessful attempt was made in December 2020, but after that… not easy to find more information for an outsider like me.
Posted Aug 25, 2021 14:52 UTC (Wed)
by wazoox (subscriber, #69624)
[Link]
There was some problem, then no news... OTOH at the time snapshots weren't even functional.
Posted Aug 24, 2021 23:07 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Quote from bcachefs.org:
I find all the worst-case O(N) searching for N snapshots in the design doc concerning.
This is what the bcachefs 'snapshot' branch does today:
Posted Aug 26, 2021 5:47 UTC (Thu)
by dgc (subscriber, #6611)
[Link] (4 responses)
https://lore.kernel.org/linux-btrfs/20210121222051.GB4626...
From a filesystem design perspective, COW metadata creates really nasty write amplification and memory footprint problems for pure metadata updates such as updating object reference counts during a snapshot. First the tree has to be stabilised (run all pending metadata COW and write it back), then a reference count update has to be run which then COWs every metadata block in the currently referenced metadata root tree. The metadata amplification is such that with enough previous snapshots, a new snapshot with just a few tens of MB of changed user data can amplify into 10s of GB of internal metadata COW.....
That explained why user data write COW performance on btrfs degraded quickly as snapshot count increases on this specific stress workload (1000 snapshots w/ 10,000 random 4kB overwrites per snapshot, so 40GB of total user data written)
In comparison, dm-snapshot performance on this workload is deterministic and constant as snapshot count increases, same as bcachefs. Bcachefs performed small COW 5x faster than dm-snapshot (largely due to dm-snapshot write amplification due to 64kB minimum COW block size). At 1 snapshot, btrfs COW is about 80% the speed of bcachefs. At 10 snapshots, bcachefs and dm-snapshot performance is unchanged and btrfs has degraded to about the same speed as dm-snapshot. At 100 snapshots, btrfs is bouncing between 1-5% the sustained user data write speed of bcachefs, and less than a quarter of the speed of dm-snapshot, and it doesn't regain any of the original performance as the snapshot count increases further.
That can be seen in workload runtimes - it ran in 20 minutes on bcachefs with each snapshot taking less than 30ms. It ran in about 40 minutes on dm-snapshot, with each snapshot taking less than 20ms. It took 5 hours for XFS+loopback+reflink to run (basically the XFS subvol architecture as a 10 line shell hack) because reflink on an image file with 2 million extents takes ~15s. It took about 9 hours for btrfs to run - a combination of slow user IO (sometimes only getting only *200* 4kB write IOPS from fio for minutes at a time) and the time to run the btrfs snapshot command increasing linearly with snapshot count, taking ~70s to run by the 1000th snapshot.
Sustained IO rates under that workload: bcachefs ~200 write IOPS, 100MB/s. XFS+reflink: ~15k write IOPS, 60MB/s. dm-snapshot: ~10k/10k read/write IOPS, 650/650 read/write MB/s. btrfs: 10-150k write IOPS, 5-10k read IOPS, 0.5-3.2GB/s write, 50MB/s read (9 hours averaging over 1GB/s write will make a serious dent in the production lifetime of most SSDs)
Write amplification as a factor of storage capacity used by that workload: bcachefs: 1.02 xfs+loop+reflink: 1.1 btrfs: ~4.5 dm-snapshot: 17 (because 64kB/4KB = minimum 16x write amplification for every random 4kB IO)
memory footprint: bcachefs: ~2GB. XFS+reflink: ~2.5GB. dm-snapshot: ~2.5GB. btrfs: Used all of the 16GB of RAM and was swapping, writeback throttling on both the root device (swap) and the target device (btrfs IO), userspace was getting blocked for tens of seconds at a time waiting on memory reclaim, swap, IO throttling, etc.
Sure, it's a worst case workload, but the point of running "worst case" workloads is finding out how the implementation handles those situations. It's the "worst case" workloads that generate all the customer support and escalation pain for engineering teams that have to make those subsystems work for their customers. Given that btrfs falls completely apart and makes the machine barely usable in scenarios that bcachefs does not even blink at, it's a fair indication of which filesystem architecture handles stress and adverse conditions/workloads better.
bcachefs also scales better than btrfs. btrfs *still* has major problems with btree lock contention. Even when you separate the namespace btrees by directing threads to different subvolumes, it just moves the lock contention to next btree in teh stack - which IIRC is the global chunk allocation btree. I first reported these scalability problems with btrfs over a decade ago, and it's never been addressed. IOWs, btrfs still generally shows the same negative scaling at concurrency levels as low as 4 threads (i.e. 4 threads is slower than 1 thread, despite burning 4 CPUs trying to do work) as it did a decade ago. In comparison, bcachefs concurrency under the same workloads and without using any subvolume tricks ends up scaling similarly to ext4 (i.e. limited by VFS inode cache hash locking at ~8 threads and 4-6x the performance of a single thread).
I can go on, but I've got lots of numbers from many different workloads that basically say the same thing - if you have a sustained IO and/or concurrency in your workload, btrfs ends up at the bottom of the pack for many important metrics - IO behaviour, filesystem memory footprint, CPU efficiency, scalability, average latency, long tail latency, etc. In some cases, btrfs is a *long* way behind the pack. And the comparison only gets worse for btrfs if you start to throw fsync() operations into the workload mix....
I'm definitely not saying that bcachefs is perfect - far from it - but I am using bcachefs as a baseline to demonstrate that it the poor performance and scalability of btrfs isn't "just what you get from COW filesystems". Competition is good - bcachefs shows that a properly designed and architected COW filesystem can perform extremely well under what are typically called "adverse workload conditions" for COW filesystems. As such, my testing really only serves to highlight the deficiencies in existing upstream snapshot solutions, and so...
"As such, if you want a performant, scalable, robust snapshotting
-Dave.
Posted Aug 30, 2021 9:25 UTC (Mon)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Posted Aug 31, 2021 1:27 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Posted Sep 3, 2021 2:06 UTC (Fri)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Sep 7, 2021 14:31 UTC (Tue)
by nye (subscriber, #51576)
[Link]
> I can't say very active though, we are working on spare time. Recently,
I'd say the last time it looked even vaguely healthy was 2014, and even that was after a couple of very light years, so I think it is probably never going to see the light of day, sadly.
Posted Aug 26, 2021 15:10 UTC (Thu)
by josefbacik (subscriber, #90083)
[Link] (5 responses)
The st_dev thing is unfortunate, but again is the result of a lack of interfaces. Very early on we had problems with rsync wandering into snapshots and copying loads of stuff. Find as well would get tripped up. The way these tools figure out if they've wandered into another file system is if the st_dev is different. So we made st_dev different for every subvolume. Is this a great solution? Absolutely not. Is there another option? Not really, this is how userspace interacts with the kernel, so we compromised in order to make userspace work well.
The next problem is that every subvolume has the same start inode number. The subvolume id is clearly different, but that '.' stat is going to be the same value for any subvolume. Again we need a way to tell an application that this is the subvolume root. Is this the best way forward? Absolutely not. Is there another option? Not really.
What we need is an interface to give userspace more information. I've suggested exporting UUID's via statx. We have a file system wide UUID and then we have per-subvolume UUID's. This is relatively straightforward and gives userspace a whole lot of information. They can tell if two subvolumes are on the same file system, and they can tell that they are in two different subvolumes.
Another solution would be to simply export the subvolume ID via statx, as that's just another u64. That is how we deal with NFS file handles, we build them with the subvolume id + the inode number. That combination is completely unique and is everything you need to find the inode. Now this doesn't solve the problem of figuring out if two different subvolumes are the same file system. This is why I think the UUID is more valuable. We could add yet another st_sbdev or something to export the device for the file system itself if we wanted to stick with the device number scheme, and then we would have everything we need to get all the information we would want out of the file system.
Btrfs was the first file system to do this, and we did it in a system that didn't envision this type of architecture. Because of that we had to make certain interface decisions to get the best outcome possible for userspace, as that is the _ONLY_ thing that matters. It doesn't matter how we organize ourselves in the kernel, because the kernel doesn't operate in a vacuum. It provides userspace the ability to do actual work, and as such we are confined to the interfaces that exist. Extending those interfaces is the only sane way forward, because we cannot un-ring the bell of the choices we've already made.
Posted Aug 26, 2021 23:22 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (3 responses)
Indeed, we cannot. I wonder if we can learn anything about future choices. And whether we can repair the current situation.
Modifying the st_dev was, in retrospect, an unfortunately choice. But it was also an easy choice - quickly providing a solution.
Another approach - which would have been more work and taken longer - would have been to work with rsync, find, etc to find a solution. One might have been to change them to check statfs().f_fsid instead of st_dev. This has the advantage that f_fsid is already available, but poorly specified and not widely used. That means it is less brittle and using it is less likely to break things.
This is, I think, an instance of the much broader "platform problem". It always seems easier to work-around weaknesses in the platform, rather than push for changes in the platform. But the long-term benefits come from doing the early work (painful though it may be) and improve the platform.
But note that there are two distinct problems here:
Adding subtree information in statx addresses '1' and could be used by find and rsync as needed. But that doesn't address '2'.
While I like the use of uuids for filesystems (and wish f_fsid was 128bits instead of 64), I don't think they are such a good idea for files within a filesystem. tree-id + file-id + generation should always be enough and while 128bit might be a good size for that, forcing them into UUID format doesn't seem to add value.
I seem to recall that when 'statx' was being proposed, lots of people had lots of ideas about extra things to add. The decision was to not add anything new at first. So if we want to add things now we need to make a strong case, and demonstrate at least one application that will immediately use the information. That would be a lot easier if we had a concrete problem to fix. The NFS issue is a concrete problem, but it doesn't actually require a user-space API change, so it is hard to use it as a lever to extend statx()....
Posted Aug 29, 2021 9:22 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Why is it necessary to use something that has the potential for collisions at all? Why not just hand out arbitrary or sequential numbers in a centralized fashion (like every other filesystem that isn't FAT)? Is there some rule that says you're not allowed to look at subvolume X when you make a new file in subvolume Y? Why would such a rule be necessary?
Posted Aug 29, 2021 12:31 UTC (Sun)
by foom (subscriber, #14868)
[Link]
Posted Aug 30, 2021 15:05 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link]
To get globally unique and stable inode numbers without a separate subvol ID, the filesystem would have to dynamically remap duplicate inode numbers from subvol-local values to globally-unique values every time a readdir() or stat() happened. This adds some overhead to all read operations that filesystem maintainers are reluctant to implement. They would prefer some more efficient way to tell an application "this is a distinct inode number namespace but not a distinct filesystem" so that applications that rely on the uniqueness feature can bear some of the costs (including opportunity costs) of implementing it, while not imposing new costs (such as new O(log(N)) search costs on every stat(), or exploding /proc/mounts and `df` output size) on applications that don't care about inode uniqueness.
The NFS server could maintain its own persistent unique inode numbers in a mapping table outside of the filesystem, and not send the filesystem's inode numbers to clients at all, but that has obvious and onerous runtime costs (the NFS server would have to maintain persistent state proportional to filesystem size).
Posted Sep 12, 2021 9:23 UTC (Sun)
by walex (guest, #69836)
[Link]
«This is a problem of interfaces. Btrfs has subvolumes, which are just their own b-tree, and the inode numbers are just a value inside of that tree. Since they are different trees you can share inode numbers across multiple trees. However each tree has their own unique ID.» This seems a weak defense of the current Btrfs situation, and it is based on a mispresentation of UNIX filesystem semantics, regardless of the details of the API: In particular there is no obligation for mounted root directories of filesystem instances to be registered in '/etc/fstab' or anywhere else. Each Btrfs pr ZFS subvolume or snapshot is just a different root directory in a filesystem instance, so Btrfs (and ZFS, and soon 'bcachefs') respect all these properties, well written NFS servers have no problems with them, and therefore there is simply no issue with them and NFS, and this whole discussion is pointless.
Posted Sep 12, 2021 19:36 UTC (Sun)
by nix (subscriber, #2304)
[Link]
That's not the only place that term is used! This is probably the only time in history that any component of Emacs has ever been compared to a professional wrestler (except for its weight and sumo wrestlers in particular): https://www.emacswiki.org/emacs/uniquify
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
- if a directory has the same inode number as an ancestor, find/du etc will refuse to enter that directory.
- if a 'tar' archive is being created of a tree, and two *different* files both have multiple links and both have the same inode number, then the second one found will not be included in the archive (I *think* tar doesn't track inode numbers for dirs or for objects with only one link).
- Other tools that collect files, like rsync and cpio, will have similar problems.
- various tools probably cache a dev/ino against a name, and if a subsequent stat shows that same dev/ino, they assume it is the same object. So if a given name referred to two different inodes over time, which happen to have the same inode number, such tools would behave incorrectly. (all these are unlikely with my overlay scheme - this one more so than most).
The "compare st_dev and st_ino" approach is only completely reliable when you have both files open. If you don't, it is possible for the first file to be deleted after you 'stat' it, and then for the second file to be created with the same inode number.
Use of "ctime" or even "btime" where supported, would help here.
So comparting dev, ino, and btime should be sufficient providing btime is supported. Almost.
Another possible (though unlikely) problem is that these objects might be on auto-mounted filesystems.
If you stat a file, get busy with something else and the filesystem gets unmounted, then some other filesystem gets mounted, the second filesystem *might* get the same st_dev as the first filesystem. So if you then stat a file on the new filesystem, it could be a completely different file on a different filesystem, but might have the same st_dev_ st_ino, and st_btime.
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
I'm not exactly certain if ZFS has this issue, but the ZFS software suite has its own SMB and NFS server implementations, which means that the issues that the Linux NFS server has do not matter for ZFS users, who typically use the ZFS NFS implementation instead.
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
ZFS doesn't have an NFS server. It has a bunch of user-space tools which provide a standard 'zfs' interface to the NFS support in the host kernel. On Linux it uses the Linux kerrnel NFS server.
There is a "main" filesystem, which uses 48bit inode numbers. and uses a fairly traditional NFS filehandle with the inode number and generation number.
Then there are "snapshots" under ".zfs/snapshot". The filehandle for objects in a snapshot have another 48bit number, presumably to identify which snapshot.
I don't *know" what inode number is presented to stat(), but I wouldn't be at all surprised to find that objects in .zfs/snapshot have the SAME inode number as the corresponding object in the main filesystem.
If you ask some tool like tar to look at the main filesystem as well as a snapshot, it might get confused. But them, I suspect it is really quite easy to avoid doing that.
With btrfs, subtree *can* be used as snapshots, but they can be used for other purposes too, and they can appear anywhere in the filesystem. With that extra flexibility comes extra responsibility....
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
I don't *know" what inode number is presented to stat(), but I wouldn't be at all surprised to find that objects in .zfs/snapshot have the SAME inode number as the corresponding object in the main filesystem.
If you ask some tool like tar to look at the main filesystem as well as a snapshot, it might get confused. But them, I suspect it is really quite easy to avoid doing that.
From a quick check, it appears that zfs does indeed present the same inode numbers in snapshots as it does in the main filesystem (on FreeBSD, anyway, though presumably on other kernels as well). Though yes, as hinted at elsewhere, it omits the magic .zfs directories from directory listings (getdents, etc.), so you'll only ever end up referencing a path in a snapshot if you really ask for it; basic directory recursion by find, tar, etc. will skip right over it without ever knowing it's there.
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
Wol
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
Nor does it reuse subvolume numbers.
If you create 100 new files per second, you'll use 40 bits of inode numbers in 348 years - no matter how many you keep.
These creation rates are high. Are they unrealistically high? Maybe.
If you were a filesystem developer, would you feel comfortable limiting subvolumes to 24bits and inodes to 40 bits?
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
https://lkml.org/lkml/2020/10/27/3684
The Btrfs inode-number epic (part 2: solutions)
The functionality and userspace interface for snapshots and subvolumes are roughly modelled after btrfs...
I wouldn't expect anything different. For over a decade, btrfs has had the only viable implementation of this interface to build on in Linux. Even if other filesystems implement subvols and snapshots, they'll be strongly compelled to follow whatever trail btrfs blazes for them now.
# bcachefs subvolume create foo
# date > foo/bar
# bcachefs subvolume snapshot foo quux
# find -ls
4096 0 drwxr-xr-x 3 root root 0 Aug 24 18:40 .
4098 0 drwxr-xr-x 2 root root 0 Aug 24 18:40 ./foo
4099 1 -rw-r--r-- 1 root root 29 Aug 24 18:40 ./foo/bar
4098 0 drwxr-xr-x 2 root root 0 Aug 24 18:40 ./quux
4099 1 -rw-r--r-- 1 root root 29 Aug 24 18:40 ./quux/bar
4097 0 drwx------ 2 root root 0 Aug 24 18:40 ./lost+found
# stat foo/bar quux/bar
File: foo/bar
Size: 29 Blocks: 1 IO Block: 512 regular file
Device: fd04h/64772d Inode: 4099 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
Birth: -
File: quux/bar
Size: 29 Blocks: 1 IO Block: 512 regular file
Device: fd04h/64772d Inode: 4099 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
Birth: -
Duplicate st_dev and st_ino, it's worse than btrfs. On the other hand:
# date > foo/second1
# date > quux/second2
# ls -li */second*
4100 -rw-r--r-- 1 root root 29 Aug 24 18:44 foo/second1
4101 -rw-r--r-- 1 root root 29 Aug 24 18:44 quux/second2
bcachefs will always give new files unique inode numbers, even in different subvols, because the code for creating a new file obtains a globally unique inode number. Possible point for bcachefs here--in this situation, btrfs uses a per-subvol inode number allocator, which would have given both new files inode 4100.
The Btrfs inode-number epic (part 2: solutions)
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
> we are working for snapshot prototype, and inode container improvement
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
ZFS (and others I think) address a similar problem by hiding things from readdir(). This might work adequately with a fixed name like ".zfs". It wouldn't work for btrfs which allows any name to be used for a subvolume.
1 - the platform provides no way to identify a subtree within a filesystem (project-id is close, but not quite the same)
2 - the platform limits inode numbers to 64 bits (in any given filesystem)
Addressing '2' requires a realistic assessment of how many bits are really needed to identify all possible objects. I think 64 is actually enough for the forseeable future, providing they are use wisely. Setting i_ino to a strong hash of whatever value the filesytem uses internally to find a file is a tempting idea. My last proposed solution for the NFS problem is to use a week hash (xor with bit shift). Maybe we should use a strong hash instead.
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)
The Btrfs inode-number epic (part 2: solutions)