The Btrfs inode-number epic (part 2: solutions)

By Jonathan Corbet
August 23, 2021

The first installment in this two-part series looked at the difficulties that arise when Btrfs filesystems containing subvolumes are exported via NFS. Btrfs has a couple of quirks that complicate life in this situation: the use of separate device numbers for subvolumes and the lack of unique inode numbers across the filesystem as a whole. Recently, Neil Brown set off on an effort to try to solve these problems, only to discover that the situation was even more difficult than expected and that many attempts would be required.

Take 1: internal mounts

Brown's first patch set attempted to resolve these problems by creating the concept of "internal mounts"; these would be automatically created by Btrfs for each visible subvolume. The automount mechanism is used to make those mounts appear. With a bit of tweaking, the kernel's NFS server daemon (nfsd) can recognize these special mounts and allow them to be crossed without an explicit mount on the remote side. With this setup, the device numbers shown by the system are as expected, and inode numbers are once again unique within a given mount.

At a first glance, this patch set seemed like a good solution to the problem. When presented with a description of this approach back in July, filesystem developer Christoph Hellwig responded: "This is what I've been asking for for years". With these changes, Btrfs appears to be a bit less weird and some longstanding problems are finally resolved.

This patch set quickly ran into trouble, though. Al Viro pointed out that the mechanism for querying device numbers could generate I/O while holding a lock that does not allow for such actions, thus deadlocking the system; without that query, though, the scheme for getting the device number from the filesystem will not work. One potential alternative, providing a separate superblock for each internal mount that would contain the needed information, is even worse. Many operations in the kernel's virtual filesystem layer involve iterating through the full list of mounted superblocks; adding thousands of them for Btrfs subvolumes would create a number of new performance problems that would take massive changes to fix.

Additionally, Amir Goldstein noted that the new mount structure could create trouble for overlayfs; it would also break some of his user-space tools. There is also the little issue of how all those internal mounts would show up in /proc/mounts; on systems with large numbers of subvolumes, that would turn /proc/mounts into a huge, unwieldy mess that could also expose the names of otherwise private subvolumes.

Take 2: file handles

Brown concluded that "with its current framing the problem is unsolvable". Specifically, the problem is the 64 bits set aside for the inode number, which are not enough for Btrfs even now. The problem gets worse with overlayfs, which must combine inode numbers from multiple filesystems, yielding something that is necessarily larger than any one filesystem's numbers. Brown described the current solution in overlayfs as "it over-loads the high bits and hopes the filesystem doesn't use them", which seems less than fully ideal. But, as long as inode numbers are limited to any fixed size, there is no way around the problem, he said.

It would be better, he continued, to use the file handle provided by many filesystems, primarily for use with NFS; a file's handle can be obtained with name_to_handle_at(). The handle is of arbitrary length, and it includes a generation number, which handily gets around the problems of inode-number reuse when a file is deleted. If user space were to use handles rather than inode numbers to check whether two files are the same, a lot of problems would go away.

Of course, some new problems would also materialize, mostly in the form of the need to make a lot of changes to user-space interfaces and programs. No files exported by the kernel (/proc files, for example) use handles now, so a set of new files that included the handles would have to be created. Any program that looks at inode numbers would have to be updated. The result would be a lot of broken user-space tools. Brown has repeatedly insisted that breaking things may be possible (and necessary):

If you refuse to risk breaking anything, then you cannot make progress. Providing people can choose when things break, and have advanced warning, they often cope remarkable well.

Incompatible changes remain a hard sell, though. Beyond that, to get the full benefit from the change, Btrfs would have to be changed to stop using artificial device numbers for subvolumes, which is not a small change either. And, as Viro pointed out, it is possible for two different file handles to refer to the same file.

In summary, this approach did not win the day either.

Take 3: mount options

Brown's third attempt approached the problem from a different direction, making all of the changes explicitly opt-in. Specifically, he added two new mount options for Btrfs filesystems that would change their behavior with regard to inode and device numbers.

The first option, inumbits=, changes how inode numbers are presented; the default value of zero causes the internal object ID to be used (as is currently the case for Btrfs). A non-zero value tells Btrfs to generate inode numbers that are "mostly unique" and that fit into the given number of bits. Specifically, to generate the inode number for a given object within a subvolume, Btrfs will:

Generate an "overlay" value from the subvolume number; this is done by byte-swapping the number so that the low-order bits (which vary the most between subvolumes) are in the most-significant bit positions.
The overlay is right-shifted to fit within the number of bits specified by inumbits=. If that number is 64, no shift need be done.
That overlay value is then XORed with the object number to produce the inode number presented to user space.

The resulting inode numbers will still be unique within any given subvolume; collisions within a large Btrfs filesystem can still happen, but they are less likely than before. Setting inumbits=64 minimizes the chances of duplicate inode numbers, but a lower number (such as 56) may make sense in situations (such as when overlayfs is in use) where the top bits are used by other subsystems.

The second mount option is numdevs=; it controls how many device numbers are used to represent subvolumes within the filesystem. The default value, numdevs=many, preserves the existing behavior of allocating a separate device number for every subvolume. Setting numdevs=1, instead, causes a single device number to be used for all subvolumes. When a filesystem is mounted with this option, tools like find and du will not be able to detect the crossing of a subvolume boundary, so their options to stay within a single filesystem may not work as expected. It is also possible to specify numdevs=2, which causes two device numbers to be used in an alternating manner when moving from one subvolume to the next; this makes tools like find work as expected.

Finally, this patch set also added the concept of a "tree ID" that can be fetched with the statx() system call. Btrfs would respond to that query with the subvolume ID, which applications could then use to reliably determine whether two files are contained within the same subvolume or not.

Btrfs developer Josef Bacik described this work as "a step in the right direction", but said that he wants to see a solution that does not require special mount options. "Mount options are messy, and are just going to lead to distros turning them on without understanding what's going on and then we have to support them forever". A proper solution, he said, does not present the possibility for users to make bad decisions. He suggested just using the new tree ID within nfsd to solve the NFS-specific problems, generating new inode numbers itself if need be.

Brown countered with a suggestion that, rather than adding mount options, he could just create a new filesystem type ("btrfs2 or maybe betrfs") that would use the new semantics. Bacik didn't like that idea either, though. Brown added that he would prefer not to do "magic transformations" of Btrfs inode numbers in nfsd; if a filesystem requires such operations, they should be done in the filesystem itself, he said. He then asked that the Btrfs developers make a decision on their preferred way to solve this problem, but did not get an answer.

Take 4: the uniquifier

On August 13, Brown returned with a minimal patch aimed at solving the NFS problems that started this whole discussion. It enables a filesystem to provide a "uniquifier" value associated with a file; this value, the name of which is arguably better suited to a professional wrestler, is only available within the kernel. The NFS server can then XOR that value with the file's inode number to obtain a number that is more likely to be unique. Btrfs provides the overlay value described above as this value; nfsd uses it, and the problem (mostly) goes away.

Bacik said that this approach was "reasonable" and acked it for the Btrfs filesystem. It thus looks like it could finally be a solution for the problem at hand. Or, at least, it's closer; Brown later realized that the changed inode numbers would create the dreaded "stale file handle" errors on existing clients when the transition happens. An updated version of the patch set adds a new flag in an unused byte of the NFS file handle to mark "new-style" inode numbers and prevent this error from occurring.

The second revision of the fourth attempt may indeed be the charm that makes some NFS-specific problems go away for Btrfs users. It is hard not to see this change — an internal process involving magic numbers that still is not guaranteed to create unique inode numbers — as a bit of a hack, though. Indeed, even Brown referred to "hacks to work around misfeatures in filesystems" when talking about this work. Hacks, though, can be part of life when dealing with production systems and large user bases; a cleaner and more correct solution may not be possible without breaking user systems. So the uniquifier may be as good as things can get until some other problem is severe enough to force the acceptance of a more disruptive solution.

Index entries for this article
Kernel	Filesystems/Btrfs

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 16:48 UTC (Mon) by jkingweb (subscriber, #113039) [Link]

A fascinating series, indeed. Thank you for this, intrepid editor!

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 18:15 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (6 responses)

Why not to switch inode numbers to 128 bits? It is significantly less drastic than using handles while in practice is enough to use various random generators and hashes that can produce unique numbers on demand.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 18:23 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Wouldn't that need *another* `_LARGE_FILE_SUPPORT` macro detection round during software configuration to move? Or would it be a `statx`-only change? If so, how would `stat` be expected to expose this information via `ino_t`?

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 19:43 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (3 responses)

64-bit inodes already give you room for ~18 quintillion inodes per filesystem. You do not actually need that many, or if you do, something has gone terribly wrong.

Sure, you can throw more bits at the problem, but you're just treating the symptoms. The real issue here is not "we don't have enough bits." It's "we can't agree on exactly how those bits should be allocated." One possibility: btrfs might decide to have unique inodes over the whole filesystem, and that would likely be challenging but technically possible (for example, when you create a subvolume, you allocate a new 32-bit inode prefix to that subvolume, and whenever any subvolume runs out of inode numbers, you give it another 32-bit prefix - since each prefix contains ~4 billion inode numbers, this allocation should happen rather infrequently, and since there are ~4 billion possible prefix values, the large size of these allocations will not easily cause a shortage).

But I doubt you can actually do that and still maintain on-disk compatibility with existing btrfs filesystems. Oh well.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:43 UTC (Mon) by willy (subscriber, #9762) [Link]

I would go further. Allocate in groups of 2^16. That way we run out of space in a group frequently and test the "allocate new prefix" path every day instead of once every dozen years.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:50 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

This is a variation on the swab64() strategy. Unless I've missed something, it doesn't require any on-disk format changes in most cases. The NFS server can already do it now, so there's no reason why btrfs couldn't do it itself. You'd use a mount option that says "crush all my inodes into one 64-bit namespace" and there would be a corresponding loss of maximum filesystem size/age. Internally the filesystem would still use separate subvol and inode, so you could remove the mount option and get the old behavior again.

If we know the highest-numbered subvol on the filesystem (which is a trivial tree lookup at mount time) and we use bit-swap instead of byte-swap, then we know which bits are subvol ID and which are inode (all bits that are not subvol ID are inode ID), so we have a nice pair of O(1) bidirectional conversion functions. We can also know when subvol and inode might potentially collide (it's not possible as long as the number of bits needed for the highest subvol ID and the highest inode do not total more than 64, but you probably want warnings around 56 or so).

If you ran out of inode bits in a subvol then you'd need a lookup table to map a discontiguous range of inodes to subvols. That table would require a disk format change, but most users will never occupy enough bits to need it (it will take decades, creating thousands of inodes every second and thousands of snapshots per day, to make the numbers bump). It could be created lazily when the free bits run out, but if that takes 20 years to happen then that code isn't going to be very well tested.

Alternatively btrfs could in the future do garbage collection to free up old object ID numbers, i.e. start at the highest inodes and pack them into the lowest-numbered available inode slots, and stop when it had freed up enough top bits. That wouldn't require an on-disk format change, it would just be a maintenance task to run at regular intervals, say, once every 15 years. This is roughly equivalent to creating an empty subvol and using 'cp -a --reflink' to move the data into files with smaller inode numbers, so if you are in really dire straits you don't need to wait for a special tool.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 12, 2021 19:41 UTC (Sun) by nix (subscriber, #2304) [Link]

> The real issue here is not "we don't have enough bits." It's "we can't agree on exactly how those bits should be allocated."

Another way of putting it: if you insist on stacking new bits on the front of an inode number for every new sort of thing that must be unique within a mount point (new btrfs subvolumes, new mount points within an NFS export, new this, new that), we can *never* have enough bits, because you can always add another layer of overlayfs or nfs exporting or whatever, and require more: and since most filesystems are using 64-bit inode numbers already, 64 bits is *never* enough to maintain guaranteed uniqueness in that space while adding more spaces as well on top.

(What saves us and lets us use kludges like the one in this article without disaster is that 64-bit spaces are, indeed, so large that we can just assume it is almost entirely empty and we can just pick more numbers at random, as long as they're not mostly-bits-zero or mostly-bits-1, and probably work nearly all the time, despite the birthday paradox. This is gross but probably good enough. I for one do not want a 128-bit ino_t flag day any time soon thankyouverymuch!)

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 16:51 UTC (Tue) by flussence (guest, #85566) [Link]

Everything seems to over time adopt UUIDs...

It'd work in theory, but it'd be an amount of churn comparable to replacing 32-bit time_t, or IPv4 (even on a closed internal network that's a Sisyphean task).

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 18:38 UTC (Mon) by martin.langhoff (guest, #61417) [Link] (3 responses)

> an internal process involving magic numbers that still is not guaranteed to create unique inode numbers

A tough tradeoff it seems. Questions...

What's the fallout if the inodes are not unique? Given that large modern systems can be really large, inode collisions might be just a fact of life.

and... the solution is an intermediate "let's limit the repercussions on other software" solution. Sure. So then... _is there a clear correct way to check for unique inode that is sane, clear of collisions and portable (across filesystems)?

In other words, if I was a maintainer of a deduplicator utility, or developing the next version of NFS, and I'm alert enough to be reading this article, is there a clear way to DTRT?

While today we want to not break the world, we're also building tomorrow...

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:04 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

> What's the fallout if the inodes are not unique?

This is unknowable in general - it depends on exactly what assumptions various applications make.

We know some specific problems.
- if a directory has the same inode number as an ancestor, find/du etc will refuse to enter that directory.
- if a 'tar' archive is being created of a tree, and two *different* files both have multiple links and both have the same inode number, then the second one found will not be included in the archive (I *think* tar doesn't track inode numbers for dirs or for objects with only one link).
- Other tools that collect files, like rsync and cpio, will have similar problems.
- various tools probably cache a dev/ino against a name, and if a subsequent stat shows that same dev/ino, they assume it is the same object. So if a given name referred to two different inodes over time, which happen to have the same inode number, such tools would behave incorrectly. (all these are unlikely with my overlay scheme - this one more so than most).

There are probably others. However most code would never notice.

> So then... _is there a clear correct way to check for unique inode that is sane, clear of collisions and portable

Probably not. Even the current best-case behaviour of file-systems like ext4 does not provide the guarantees that I have described tar as requiring (it is possible I've misrepresented 'tar' - I haven't checked the code).
The "compare st_dev and st_ino" approach is only completely reliable when you have both files open. If you don't, it is possible for the first file to be deleted after you 'stat' it, and then for the second file to be created with the same inode number.
Use of "ctime" or even "btime" where supported, would help here.
So comparting dev, ino, and btime should be sufficient providing btime is supported. Almost.
Another possible (though unlikely) problem is that these objects might be on auto-mounted filesystems.
If you stat a file, get busy with something else and the filesystem gets unmounted, then some other filesystem gets mounted, the second filesystem *might* get the same st_dev as the first filesystem. So if you then stat a file on the new filesystem, it could be a completely different file on a different filesystem, but might have the same st_dev_ st_ino, and st_btime.

Tracking the identity of filesystems (to detect these mounts) is not well supported. st_dev is, as I say, transient for some filesystems. The statfs() systemcall reports an "fsid", but this is poorly specified. The man page for statfs() says "Nobody knows what f_fsid is supposed to contain". Some filesystems (btrfs, xfs, ext4 and others) provide good values. Other filesystems do less useful things. Some just provide st_dev in a different encoding.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 5:40 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (1 responses)

With 128-bit inode and device numbers one can guarantee that for any practical purposes they are unique and never reused.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 16:33 UTC (Tue) by jonesmz (subscriber, #130234) [Link]

Would you propose using UUIDs specifically? Or just an arbitrary / implementation defined 128bit value?

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:25 UTC (Mon) by poc (subscriber, #47038) [Link] (7 responses)

Just as a matter of interest, does ZFS suffer from this problem, and if not, why not?

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 4:17 UTC (Tue) by Conan_Kudo (subscriber, #103240) [Link] (6 responses)

I'm not exactly certain if ZFS has this issue, but the ZFS software suite has its own SMB and NFS server implementations, which means that the issues that the Linux NFS server has do not matter for ZFS users, who typically use the ZFS NFS implementation instead.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 9:28 UTC (Tue) by neilbrown (subscriber, #359) [Link] (5 responses)

I just checked https://github.com/openzfs/zfs.git
ZFS doesn't have an NFS server. It has a bunch of user-space tools which provide a standard 'zfs' interface to the NFS support in the host kernel. On Linux it uses the Linux kerrnel NFS server.

ZFS is a substantially different filesystem to btrfs and is not directly comparable. It doesn't have anything with comparable flexibilty to btrfs subvols (which I want to call "subtrees").
There is a "main" filesystem, which uses 48bit inode numbers. and uses a fairly traditional NFS filehandle with the inode number and generation number.
Then there are "snapshots" under ".zfs/snapshot". The filehandle for objects in a snapshot have another 48bit number, presumably to identify which snapshot.
I don't *know" what inode number is presented to stat(), but I wouldn't be at all surprised to find that objects in .zfs/snapshot have the SAME inode number as the corresponding object in the main filesystem.
If you ask some tool like tar to look at the main filesystem as well as a snapshot, it might get confused. But them, I suspect it is really quite easy to avoid doing that.
With btrfs, subtree *can* be used as snapshots, but they can be used for other purposes too, and they can appear anywhere in the filesystem. With that extra flexibility comes extra responsibility....

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 10:29 UTC (Tue) by mtu (guest, #144375) [Link] (3 responses)

Interesting you should say that btrfs is more flexible than ZFS—I feel the other way around. Coming from ZFS on FreeBSD, I think btrfs' concept of subvolumes is annoyingly primitive.

Where btrfs has a flat list of "subvolumes", ZFS has a hierarchical tree of "datasets" (filesystems), each of which can have any number of read-only "snapshots"*, which can in turn be the basis of sparse copy-on-write "clones" that behave like datasets. Dataset properties (like mountpoints, compression and more advanced fs stuff like recordsize) are hereditary throughout the tree. Most dataset operations (like snapshotting, 'sending' into a flat bytestream or modifying properties) work recursively for any subtree.

In contrast, I feel that all that btrfs has to offer is: "Here's a flat list of a few hundred 'subvolumes', good luck keeping track and managing their properties and mounpoints, and try to not write to the ones you meant to keep as read-only snapshots."

* Concerning the matter at hand: Yes, snapshots are accessible through a dataset's ".zfs/" subdirectory (unless the feature is disabled for a given dataset). But they are usually only every exposed by explicit "cd" or path addressing from a shell, and never pop up to confuse find, NFS, samba, du or any other userspace application working recursively—at least that's my experience on FreeBSD, that's always had intimate and seamless integration of ZFS. On Linux, that seems to be a different story: https://github.com/openzfs/zfs/issues/6154

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 11:56 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

btrfs subvolumes are not a flat list; they can be created at any point in a normal directory hierarchy, including as children of other subvolumes. And all the other features you describe of ZFS snapshosts work in btrfs, too, just with different tooling.

It sounds like you've encountered one set of tooling to manage btrfs subvolumes, and assumed that the limits of that tooling are the limits of btrfs.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 14:42 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

> btrfs subvolumes are not a flat list

They were in 2008. Things have changed a little since then.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 16:17 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

This description of ZFS snapshots sounds so much less flexible than the btrfs version that I wonder if it's even an accurate description of ZFS.

btrfs snapshots are a lazy version of 'cp -a --reflink'. Users can drop a subvol anywhere in the filesystem and snapshot it anywhere else. This is part of the current problem--there isn't a single administrator-managed tree of subvols or snapshots, because ordinary applications can create and use subvols the same way they make directories (*). An existing NFS export can wake up one morning after an application software upgrade and suddenly find itself hosting a lot of subvols it didn't plan for. This proliferation of subvols is why the obvious solution (create distinct mount points for each and every subvol) isn't very popular (nor is the other obvious solution, lock down subvols so they aren't as trivial to use).

Unlike other popular snapshot systems, btrfs has no distinction between "base" and "clone" subvols. There is a notion of an "original" subvol and a "snapshot" subvol, but it's not part of the implementation, it's only a hint for administrators to label before-and-after snapshots for incremental send/receive. After a snapshot, both subvols are fully writable equal peers sharing ownership of their POSIX tree and data blocks, the same as if you had done cp -a --reflink atomically. Snapshots have a read-only bit that can be turned on or off (but turning it off means the subvol is no longer synchronized with copies on other filesystems, so it can't be used as a basis for incremental send/receive any more). You can chain snapshots (snap A to B, snap B to C, snap C to D...), with equal cost to write any subvol in the chain, and you can delete any of the snapshots in the chain with equal cost and without disrupting any other snapshot (other snapshot systems will have up to O(n) extra cost if there are n snapshots, or may not be able to delete the original subvol before deleting all snapshots). These properties greatly improve the usability of snapshots for applications since they can freely switch between treating them as subvol units or as individual files.

(*) If that seems weird, observe that a long time ago 'mkdir' required root privileges (**).

(**) OK there were different reasons for that. Still, ideas about what is "normal" for a filesystem and what is "privileged" do change over time.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 19:26 UTC (Tue) by zev (subscriber, #88455) [Link]

I don't *know" what inode number is presented to stat(), but I wouldn't be at all surprised to find that objects in .zfs/snapshot have the SAME inode number as the corresponding object in the main filesystem. If you ask some tool like tar to look at the main filesystem as well as a snapshot, it might get confused. But them, I suspect it is really quite easy to avoid doing that.

From a quick check, it appears that zfs does indeed present the same inode numbers in snapshots as it does in the main filesystem (on FreeBSD, anyway, though presumably on other kernels as well). Though yes, as hinted at elsewhere, it omits the magic .zfs directories from directory listings (getdents, etc.), so you'll only ever end up referencing a path in a snapshot if you really ask for it; basic directory recursion by find, tar, etc. will skip right over it without ever knowing it's there.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 6:21 UTC (Tue) by eru (subscriber, #2753) [Link] (3 responses)

A warning example of the trouble you get when you break a very old assumption. Or should it be even be called an invariant: Every mounted file system shall have inode numbers that are unique within it. In hindsight, at least, the correct solution would have been to not merge btrfs until it obeys this rule.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 29, 2021 4:55 UTC (Sun) by patrakov (subscriber, #97174) [Link] (2 responses)

This requirement is not really useful, as it applies to the whole filesystem and assumes that there is no concurrent mutation of the whole filesystem. It would be necessary to make a stronger requirement, that can at least apply to subtrees and tolerate mutations outside of the subtrees. And ideally, specify what happens with inode numbers now and in the past, if the subtree being walked through is also being mutated concurrently.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 31, 2021 13:44 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Yup. I've been puzzled by this and I think I've hit on part of the problem ...

The *obvious* fix is to make i-nodes unique at the root file-system level. It's not a problem to think of a snapshots sharing i-nodes if they're sharing the same file ...

BUT. As soon as you break the link by modifying the file in one snapshot, that means you need to change the i-node number. No problem? Until the user has been using hardlinks to avoid having multiple identical copies of large files. You're now forcing "create temp files, copy over original" behaviour onto user space and that breaks hard links ...

I'm guessing there's plenty more problems where that came from, so where do we go from here? It ain't simple ...

Cheers,
Wol

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 31, 2021 14:02 UTC (Tue) by patrakov (subscriber, #97174) [Link]

Correct. And this is even mentioned in other comments: "btrfs snapshots are a lazy version of 'cp -a --reflink'". Same inode numberss indicate hardlinks, and reflinks are a different beast. So it looks like a solution would involve generating inode numbers lazily when someone stats a file in the snapshot.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 7:01 UTC (Tue) by mezcalero (subscriber, #45103) [Link] (6 responses)

I very much like the idea to expose a second 64 bit identifier via statx(). This value together with the inode would be 128 bit then, which has the same size as a UUID, i.e. should be large enough to be effectively collision-free IRL by all established standards – at least if allocated ramdomly or via a hash. I mean, how cool would be that: having a truly universal identifier for an inode, that is not only valid on some specific file system but *world-wide*. (They could generate the exposed pair of inode ID and this new 2nd inode ID via a single block AES encryption, if they want something reversible, with a randomized — but public — key stored in the fs header somewhere). If files had these 128bit ids that are effectively uuids then the overlayfs problem, the nfs problem would all be so much simpler. After all all these issues just stem from the fact that 64bit is just too little to properly generate them randomized/hashed, and thus you need to manually avoid conflicts instead of just relying on the fact that the number space is sufficiently large that if you generate things randomized/hashed collisions are sufficiently unlikely.

Note that Windows is way ahead there, and exposes such pretty-much-UUIDs-but-not-really for NTFS already: https://devblogs.microsoft.com/oldnewthing/20110228-00/?p... — Maybe it's time for Linux just acknowledge that having such a universal 128bit ID is actually a really useful thing.

Note that btrfs documented that dirs that are subvolumes are recognizable by their inode nr 256. If they change that they'll break a good part of userspace (including systemd). But if they give my truly universally valid 128 bit IDs as replacement I'd be more than happy to fix the fallout – at least in my codebases – quickly.

(I think it would really make sense to add a flag returned by statx() that marks the subvolume dirs explicitly as subvolumes. Right now userspace is supposed to make the check "has this file BTRFS_SUPER_MAGIC and inode nr 256" which is just messy and requires two syscalls. i.e. STATX_ATTR_SUBVOLUME or FS_SUBVOLUME_FL would be great to have)

Lennart

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 25, 2021 11:26 UTC (Wed) by taladar (subscriber, #68407) [Link] (5 responses)

To me having globally unique inode numbers just sounds like another way to uniquely fingerprint a system that we need to protect from anyone who might want to do that.

I also don't really see the use case outside of filesystems like btrfs which do everything differently mainly to be different. It is not as if they couldn't have split up the 64bit available to them into two numbers that would be more than enough for the numbers of files you find on a filesystem times the number of subvolumes.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 25, 2021 21:02 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

If you have access to read arbitrary inode numbers, then you can in practice uniquely identify the system by sampling a bunch of files in home directory. And even a single-file 64-bit inode is already likely to be reasonably unique.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 1:21 UTC (Thu) by zblaxell (subscriber, #26385) [Link] (2 responses)

> It is not as if they couldn't have split up the 64bit available to them into two numbers

It might be harder than it looks? So far btrfs, xfs, bcachefs, zfs, and overlayfs have all not done this.

bcachefs seems to have painted itself into the same corner as btrfs: 32-bit subvol ID, 64-bit inode, making a snapshot duplicates all existing inode numbers in the subvol. XFS experimented with subvols, gave up, and now recommends bcachefs instead. ZFS duplicates inode numbers--despite using only 48 bits of ino_t--and apologizes to no one.

Overlayfs takes up at least one bit of its own, which can interfere with any other filesystem's attempt to use all 64 bits of ino_t (indeed the btrfs NFS support patch reserves some bits for that). Overlayfs only does that sometimes--the rest of the time, it lets inode numbers from different lowerdirs collide freely.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 1:52 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

> It might be harder than it looks?

One reason that it is harder for btrfs is that btrfs never reuses inode numbers (well ... almost never).
Nor does it reuse subvolume numbers.

So if you create a snapshot every minute you'll use 24bits of subvolume numbers in 31 years - even if you only keep a few around.
If you create 100 new files per second, you'll use 40 bits of inode numbers in 348 years - no matter how many you keep.

How long do we expect a filesystem to last for? 348 years is probably unrealistic - is 31?
These creation rates are high. Are they unrealistically high? Maybe.
If you were a filesystem developer, would you feel comfortable limiting subvolumes to 24bits and inodes to 40 bits?

64bits allows you to create one every microsecond and still survive for half a million years. That is much easier for a filesystem developer to live with.

I would like btrfs to re-use the numbers and impose these limits. This is far from straight forward. It is almost certainly technically possible without excessive cost (though with a non-zero cost). But it can be hard to motivate efforts to protect against uncertain future problems (.... I'm sure there is a well known example I could point to...).

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 7, 2021 14:11 UTC (Tue) by nye (subscriber, #51576) [Link]

> These creation rates are high. Are they unrealistically high? Maybe

It's hard to make a direct comparison given the fundamental differences in the model of subvolumes vs ZFS' various dataset types, but FWIW, I have a running system - at home, so not exactly enterprise scale - where the total number of ZFS snapshots that have been made across filesystems/volumes in the pool over the last decade is probably around 15 million. Getting pretty close to 24 bits.

I don't know enough about btrfs to know if the equivalent setup to those filesystems and volumes would be based on some shared root there and competing for inodes, or entirely separate. I guess what that boils down to is that I don't know if the rough equivalent to a btrfs filesystem is a ZFS filesystem or a ZFS *pool*. Either way, once you're used to nearly-free snapshots, you can find yourself using a *lot* of them.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 27, 2021 6:10 UTC (Fri) by mezcalero (subscriber, #45103) [Link]

You do realize all major Linux file system implementatios export a UUID identifying the file system universally? See blkid. If you are concerned about globally unique identifiers for the system: those existing ones are a lot better suited for that. Per-inode UUIDs are much less interesting for that.

Lennart

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 16:16 UTC (Tue) by wazoox (subscriber, #69624) [Link] (9 responses)

Dave Chinner in a discussion about snapshots just wrote that on the xfs ML:

"As such, if you want a performant, scalable, robust snapshotting
subvolume capable filesystem, bcachefs is the direction you should
be looking. All of the benefits of integrated subvolume snapshots,
yet none of the fundamental architectural deficiencies and design
flaws that limit the practical usability of btrfs for many important
workloads."

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 18:00 UTC (Tue) by sub2LWN (subscriber, #134200) [Link]

Here's a link to the quoted linux-xfs mail: https://lwn.net/ml/linux-xfs/20210823231235.GK3657114%40d...

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 18:22 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (1 responses)

I've looked into bcachefs a couple of times over the years now, and I'd really like to give it a serious go — but not at the cost of having to build my own kernel (as none of the distros relevant to me seem to carry it) and rescue ISO.

Do you have any insight into their plans, timeframes, goals for getting it into mainline? I see that an unsuccessful attempt was made in December 2020, but after that… not easy to find more information for an outsider like me.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 25, 2021 14:52 UTC (Wed) by wazoox (subscriber, #69624) [Link]

Here is the latest request from Kent for mainlining:
https://lkml.org/lkml/2020/10/27/3684

There was some problem, then no news... OTOH at the time snapshots weren't even functional.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 24, 2021 23:07 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

Quote from bcachefs.org:

The functionality and userspace interface for snapshots and subvolumes are roughly modelled after btrfs...

I wouldn't expect anything different. For over a decade, btrfs has had the only viable implementation of this interface to build on in Linux. Even if other filesystems implement subvols and snapshots, they'll be strongly compelled to follow whatever trail btrfs blazes for them now.

I find all the worst-case O(N) searching for N snapshots in the design doc concerning.

This is what the bcachefs 'snapshot' branch does today:

# bcachefs subvolume create foo
# date > foo/bar
# bcachefs subvolume snapshot foo quux
# find -ls
     4096      0 drwxr-xr-x   3 root     root            0 Aug 24 18:40 .
     4098      0 drwxr-xr-x   2 root     root            0 Aug 24 18:40 ./foo
     4099      1 -rw-r--r--   1 root     root           29 Aug 24 18:40 ./foo/bar
     4098      0 drwxr-xr-x   2 root     root            0 Aug 24 18:40 ./quux
     4099      1 -rw-r--r--   1 root     root           29 Aug 24 18:40 ./quux/bar
     4097      0 drwx------   2 root     root            0 Aug 24 18:40 ./lost+found
# stat foo/bar quux/bar
  File: foo/bar
  Size: 29              Blocks: 1          IO Block: 512    regular file
Device: fd04h/64772d    Inode: 4099        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
 Birth: -
  File: quux/bar
  Size: 29              Blocks: 1          IO Block: 512    regular file
Device: fd04h/64772d    Inode: 4099        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-24 18:40:37.278101823 -0400
Modify: 2021-08-24 18:40:37.290101816 -0400
Change: 2021-08-24 18:40:37.290101816 -0400
 Birth: -

Duplicate st_dev and st_ino, it's worse than btrfs. On the other hand:

# date > foo/second1
# date > quux/second2
# ls -li */second*
4100 -rw-r--r-- 1 root root 29 Aug 24 18:44 foo/second1
4101 -rw-r--r-- 1 root root 29 Aug 24 18:44 quux/second2

bcachefs will always give new files unique inode numbers, even in different subvols, because the code for creating a new file obtains a globally unique inode number. Possible point for bcachefs here--in this situation, btrfs uses a per-subvol inode number allocator, which would have given both new files inode 4100.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 5:47 UTC (Thu) by dgc (subscriber, #6611) [Link] (4 responses)

Yup. I stand by that statement. I've run tests, measured numbers and observed behaviours. I also have a fair understanding of differences in architetures and implementations. This thread is probably enlightening about the architectural deficiencies within btrfs, which would appear to be unfixable:

https://lore.kernel.org/linux-btrfs/20210121222051.GB4626...

From a filesystem design perspective, COW metadata creates really nasty write amplification and memory footprint problems for pure metadata updates such as updating object reference counts during a snapshot. First the tree has to be stabilised (run all pending metadata COW and write it back), then a reference count update has to be run which then COWs every metadata block in the currently referenced metadata root tree. The metadata amplification is such that with enough previous snapshots, a new snapshot with just a few tens of MB of changed user data can amplify into 10s of GB of internal metadata COW.....

That explained why user data write COW performance on btrfs degraded quickly as snapshot count increases on this specific stress workload (1000 snapshots w/ 10,000 random 4kB overwrites per snapshot, so 40GB of total user data written)

In comparison, dm-snapshot performance on this workload is deterministic and constant as snapshot count increases, same as bcachefs. Bcachefs performed small COW 5x faster than dm-snapshot (largely due to dm-snapshot write amplification due to 64kB minimum COW block size). At 1 snapshot, btrfs COW is about 80% the speed of bcachefs. At 10 snapshots, bcachefs and dm-snapshot performance is unchanged and btrfs has degraded to about the same speed as dm-snapshot. At 100 snapshots, btrfs is bouncing between 1-5% the sustained user data write speed of bcachefs, and less than a quarter of the speed of dm-snapshot, and it doesn't regain any of the original performance as the snapshot count increases further.

That can be seen in workload runtimes - it ran in 20 minutes on bcachefs with each snapshot taking less than 30ms. It ran in about 40 minutes on dm-snapshot, with each snapshot taking less than 20ms. It took 5 hours for XFS+loopback+reflink to run (basically the XFS subvol architecture as a 10 line shell hack) because reflink on an image file with 2 million extents takes ~15s. It took about 9 hours for btrfs to run - a combination of slow user IO (sometimes only getting only *200* 4kB write IOPS from fio for minutes at a time) and the time to run the btrfs snapshot command increasing linearly with snapshot count, taking ~70s to run by the 1000th snapshot.

Sustained IO rates under that workload: bcachefs ~200 write IOPS, 100MB/s. XFS+reflink: ~15k write IOPS, 60MB/s. dm-snapshot: ~10k/10k read/write IOPS, 650/650 read/write MB/s. btrfs: 10-150k write IOPS, 5-10k read IOPS, 0.5-3.2GB/s write, 50MB/s read (9 hours averaging over 1GB/s write will make a serious dent in the production lifetime of most SSDs)

Write amplification as a factor of storage capacity used by that workload: bcachefs: 1.02 xfs+loop+reflink: 1.1 btrfs: ~4.5 dm-snapshot: 17 (because 64kB/4KB = minimum 16x write amplification for every random 4kB IO)

memory footprint: bcachefs: ~2GB. XFS+reflink: ~2.5GB. dm-snapshot: ~2.5GB. btrfs: Used all of the 16GB of RAM and was swapping, writeback throttling on both the root device (swap) and the target device (btrfs IO), userspace was getting blocked for tens of seconds at a time waiting on memory reclaim, swap, IO throttling, etc.

Sure, it's a worst case workload, but the point of running "worst case" workloads is finding out how the implementation handles those situations. It's the "worst case" workloads that generate all the customer support and escalation pain for engineering teams that have to make those subsystems work for their customers. Given that btrfs falls completely apart and makes the machine barely usable in scenarios that bcachefs does not even blink at, it's a fair indication of which filesystem architecture handles stress and adverse conditions/workloads better.

bcachefs also scales better than btrfs. btrfs *still* has major problems with btree lock contention. Even when you separate the namespace btrees by directing threads to different subvolumes, it just moves the lock contention to next btree in teh stack - which IIRC is the global chunk allocation btree. I first reported these scalability problems with btrfs over a decade ago, and it's never been addressed. IOWs, btrfs still generally shows the same negative scaling at concurrency levels as low as 4 threads (i.e. 4 threads is slower than 1 thread, despite burning 4 CPUs trying to do work) as it did a decade ago. In comparison, bcachefs concurrency under the same workloads and without using any subvolume tricks ends up scaling similarly to ext4 (i.e. limited by VFS inode cache hash locking at ~8 threads and 4-6x the performance of a single thread).

I can go on, but I've got lots of numbers from many different workloads that basically say the same thing - if you have a sustained IO and/or concurrency in your workload, btrfs ends up at the bottom of the pack for many important metrics - IO behaviour, filesystem memory footprint, CPU efficiency, scalability, average latency, long tail latency, etc. In some cases, btrfs is a *long* way behind the pack. And the comparison only gets worse for btrfs if you start to throw fsync() operations into the workload mix....

I'm definitely not saying that bcachefs is perfect - far from it - but I am using bcachefs as a baseline to demonstrate that it the poor performance and scalability of btrfs isn't "just what you get from COW filesystems". Competition is good - bcachefs shows that a properly designed and architected COW filesystem can perform extremely well under what are typically called "adverse workload conditions" for COW filesystems. As such, my testing really only serves to highlight the deficiencies in existing upstream snapshot solutions, and so...

-Dave.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 30, 2021 9:25 UTC (Mon) by jezuch (subscriber, #52988) [Link] (3 responses)

I think I've seen a couple of mentions of bcachefs and it seemed interesting at that time, but I completely missed the point at which it transitioned from "a cool prototype" to "production-ready, stable and mature". Is it already? Perhaps it needs a PR department of sorts, because it sounds really amazing ;)

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 31, 2021 1:27 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

You haven't missed it--that point is still in the future. Subvols on bcachefs are only a few months old, and much closer to the "cool prototype" end of the spectrum than the other end. My last test run of bcachefs ended after 29 minutes with a readonly splat from the 'cp' command.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 3, 2021 2:06 UTC (Fri) by flussence (guest, #85566) [Link] (1 responses)

Speaking of cool-but-unreleased filesystems, has anyone heard from Tux3 lately? It was showing some mythical benchmark numbers where an in-memory loopback was outperforming tmpfs, then it fell off the face of the earth.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 7, 2021 14:31 UTC (Tue) by nye (subscriber, #51576) [Link]

The mailing list has had nothing but "is this project alive?" type messages for the last couple of years. The latest one of those to get an answer was last year:

> I can't say very active though, we are working on spare time. Recently,
> we are working for snapshot prototype, and inode container improvement

I'd say the last time it looked even vaguely healthy was 2014, and even that was after a couple of very light years, so I think it is probably never going to see the light of day, sadly.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 15:10 UTC (Thu) by josefbacik (subscriber, #90083) [Link] (5 responses)

This is a problem of interfaces. Btrfs has subvolumes, which are just their own b-tree, and the inode numbers are just a value inside of that tree. Since they are different trees you can share inode numbers across multiple trees. However each tree has their own unique ID.

The st_dev thing is unfortunate, but again is the result of a lack of interfaces. Very early on we had problems with rsync wandering into snapshots and copying loads of stuff. Find as well would get tripped up. The way these tools figure out if they've wandered into another file system is if the st_dev is different. So we made st_dev different for every subvolume. Is this a great solution? Absolutely not. Is there another option? Not really, this is how userspace interacts with the kernel, so we compromised in order to make userspace work well.

The next problem is that every subvolume has the same start inode number. The subvolume id is clearly different, but that '.' stat is going to be the same value for any subvolume. Again we need a way to tell an application that this is the subvolume root. Is this the best way forward? Absolutely not. Is there another option? Not really.

What we need is an interface to give userspace more information. I've suggested exporting UUID's via statx. We have a file system wide UUID and then we have per-subvolume UUID's. This is relatively straightforward and gives userspace a whole lot of information. They can tell if two subvolumes are on the same file system, and they can tell that they are in two different subvolumes.

Another solution would be to simply export the subvolume ID via statx, as that's just another u64. That is how we deal with NFS file handles, we build them with the subvolume id + the inode number. That combination is completely unique and is everything you need to find the inode. Now this doesn't solve the problem of figuring out if two different subvolumes are the same file system. This is why I think the UUID is more valuable. We could add yet another st_sbdev or something to export the device for the file system itself if we wanted to stick with the device number scheme, and then we would have everything we need to get all the information we would want out of the file system.

Btrfs was the first file system to do this, and we did it in a system that didn't envision this type of architecture. Because of that we had to make certain interface decisions to get the best outcome possible for userspace, as that is the _ONLY_ thing that matters. It doesn't matter how we organize ourselves in the kernel, because the kernel doesn't operate in a vacuum. It provides userspace the ability to do actual work, and as such we are confined to the interfaces that exist. Extending those interfaces is the only sane way forward, because we cannot un-ring the bell of the choices we've already made.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 26, 2021 23:22 UTC (Thu) by neilbrown (subscriber, #359) [Link] (3 responses)

> we cannot un-ring the bell of the choices we've already made.

Indeed, we cannot. I wonder if we can learn anything about future choices. And whether we can repair the current situation.

Modifying the st_dev was, in retrospect, an unfortunately choice. But it was also an easy choice - quickly providing a solution.
ZFS (and others I think) address a similar problem by hiding things from readdir(). This might work adequately with a fixed name like ".zfs". It wouldn't work for btrfs which allows any name to be used for a subvolume.

Another approach - which would have been more work and taken longer - would have been to work with rsync, find, etc to find a solution. One might have been to change them to check statfs().f_fsid instead of st_dev. This has the advantage that f_fsid is already available, but poorly specified and not widely used. That means it is less brittle and using it is less likely to break things.

This is, I think, an instance of the much broader "platform problem". It always seems easier to work-around weaknesses in the platform, rather than push for changes in the platform. But the long-term benefits come from doing the early work (painful though it may be) and improve the platform.

But note that there are two distinct problems here:
1 - the platform provides no way to identify a subtree within a filesystem (project-id is close, but not quite the same)
2 - the platform limits inode numbers to 64 bits (in any given filesystem)

Adding subtree information in statx addresses '1' and could be used by find and rsync as needed. But that doesn't address '2'.
Addressing '2' requires a realistic assessment of how many bits are really needed to identify all possible objects. I think 64 is actually enough for the forseeable future, providing they are use wisely. Setting i_ino to a strong hash of whatever value the filesytem uses internally to find a file is a tempting idea. My last proposed solution for the NFS problem is to use a week hash (xor with bit shift). Maybe we should use a strong hash instead.

While I like the use of uuids for filesystems (and wish f_fsid was 128bits instead of 64), I don't think they are such a good idea for files within a filesystem. tree-id + file-id + generation should always be enough and while 128bit might be a good size for that, forcing them into UUID format doesn't seem to add value.

I seem to recall that when 'statx' was being proposed, lots of people had lots of ideas about extra things to add. The decision was to not add anything new at first. So if we want to add things now we need to make a strong case, and demonstrate at least one application that will immediately use the information. That would be a lot easier if we had a concrete problem to fix. The NFS issue is a concrete problem, but it doesn't actually require a user-space API change, so it is hard to use it as a lever to extend statx()....

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 29, 2021 9:22 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (2 responses)

> Setting i_ino to a strong hash of whatever value the filesytem uses internally to find a file is a tempting idea. My last proposed solution for the NFS problem is to use a week hash (xor with bit shift). Maybe we should use a strong hash instead.

Why is it necessary to use something that has the potential for collisions at all? Why not just hand out arbitrary or sequential numbers in a centralized fashion (like every other filesystem that isn't FAT)? Is there some rule that says you're not allowed to look at subvolume X when you make a new file in subvolume Y? Why would such a rule be necessary?

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 29, 2021 12:31 UTC (Sun) by foom (subscriber, #14868) [Link]

Subvolume Y can be a writeable snapshot clone of subvolume X. And you want creating snapshots to be fast, and not use excess disk space. Needing to iterate over every file in X to assign each one a new centrally-assigned-unique inode in Y would be unfortunate both in terms of time and space it would take.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 30, 2021 15:05 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

Creating a subvol snapshot instantly duplicates every inode number that already exists in the snapshot source subvol to the target subvol. "Instantly" is a core requirement of the feature set--if a filesystem cannot create a subvol snapshot instantly, there is no performance or consistency advantage over userspace emulation with 'cp -a --reflink=always', and it is better not to implement the subvol feature set at all (which is roughly why XFS doesn't do it).

To get globally unique and stable inode numbers without a separate subvol ID, the filesystem would have to dynamically remap duplicate inode numbers from subvol-local values to globally-unique values every time a readdir() or stat() happened. This adds some overhead to all read operations that filesystem maintainers are reluctant to implement. They would prefer some more efficient way to tell an application "this is a distinct inode number namespace but not a distinct filesystem" so that applications that rely on the uniqueness feature can bear some of the costs (including opportunity costs) of implementing it, while not imposing new costs (such as new O(log(N)) search costs on every stat(), or exploding /proc/mounts and `df` output size) on applications that don't care about inode uniqueness.

The NFS server could maintain its own persistent unique inode numbers in a mapping table outside of the filesystem, and not send the filesystem's inode numbers to clients at all, but that has obvious and onerous runtime costs (the NFS server would have to maintain persistent state proportional to filesystem size).

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 12, 2021 9:23 UTC (Sun) by walex (guest, #69836) [Link]

«This is a problem of interfaces. Btrfs has subvolumes, which are just their own b-tree, and the inode numbers are just a value inside of that tree. Since they are different trees you can share inode numbers across multiple trees. However each tree has their own unique ID.»

This seems a weak defense of the current Btrfs situation, and it is based on a mispresentation of UNIX filesystem semantics, regardless of the details of the API:

In UNIX each filesystem instance has at least one root directory, and each root directory and its descendant i-nodes have a unique "device-id", which was once upon a time the device-id of the containing block device, but need not be (e.g. filesystem instances not contained in any block device).
In UNIX there is no restriction against having multiple filesystem instances in a block device, or having multiple root directories in a filesystem instance (even if both are not traditionally possible).
The only restriction is that (optionally) a filesystem type can deny hard links between a directory and an i-node with different device-ids.

In particular there is no obligation for mounted root directories of filesystem instances to be registered in '/etc/fstab' or anywhere else.

Each Btrfs pr ZFS subvolume or snapshot is just a different root directory in a filesystem instance, so Btrfs (and ZFS, and soon 'bcachefs') respect all these properties, well written NFS servers have no problems with them, and therefore there is simply no issue with them and NFS, and this whole discussion is pointless.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 12, 2021 19:36 UTC (Sun) by nix (subscriber, #2304) [Link]

> It enables a filesystem to provide a "uniquifier" value associated with a file; this value, the name of which is arguably better suited to a professional wrestler, is only available within the kernel.

That's not the only place that term is used! This is probably the only time in history that any component of Emacs has ever been compared to a professional wrestler (except for its weight and sumo wrestlers in particular): https://www.emacswiki.org/emacs/uniquify