Snapshots, inodes, and filesystem identifiers

By Jake Edge
May 18, 2022

A longstanding problem with Btrfs subvolumes and duplicate inode numbers was the topic of a late-breaking filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). The problem had cropped up in the bcachefs session but Josef Bacik deferred that discussion to this just-created session, which he led. The problem is not limited to Btrfs, though, since filesystem snapshots for other filesystems can have similar kinds of problems.

Background

Bacik started with an overview of the problem, in part because he has to re-explain it every few years when it is "discovered" again. Btrfs has subvolumes that contain their own unique inode-number space. Subvolumes can be used for snapshots, so a common use case is to have a subvolume for a home directory so that it can be snapshotted. A snapshot is just a metadata block with a pointer to an existing block and a reference count. That means it has the same files, the same data, and the same inode numbers as the subvolume at the time of the snapshot.

That situation confuses tools like rsync, so Chris Mason came up with a way to make separate subvolumes appear to be on different filesystems, which meant that the tools would do the right thing. rsync (or find) will use the st_dev value returned by stat() to decide if they have traversed into a different filesystem; otherwise, the duplicate inode numbers causes tools to think they have already seen the files. So Btrfs assigns an anonymous block device to each subvolume, which is what it reports via stat().

That was an easy way to solve the problem, but every time it comes up, "people yell and complain about how terrible and broken it is". There is no other filesystem that does this, he said, so it may not be a great solution, but it did resolve the problem at hand. Internally, Btrfs has a subvolume ID that distinguishes the different inode-number spaces; it is used when Btrfs is being exported via NFS or Ceph to create the unique ID (or filehandle) needed, which works well, he said.

On the client side, though, the fact that those identical inode numbers are on different subvolumes gets lost, at least for NFS. So if a directory containing a subvolume and its snapshots gets exported, the fact that they are separate subvolumes is not available to the client, so find and rsync get confused by the duplicate inode numbers. Periodically, someone encounters this problem and "then tells me all the ways that it is easy to fix"; they realize quickly that it is not that easy to fix, Bacik said. The most recent attempt to do so was by Neil Brown who tried multiple solutions but it still is not resolved.

Possible solution

What Bacik would like to do is to extend statx() to report the subvolume ID, which is something that Bruce Fields said would probably work for NFS. The current st_dev behavior is still sometimes problematic when tools want to know if two files are on the same filesystem, so he would like statx() to report two things: the universally unique ID (UUID) of the containing filesystem and some way to identify the subvolume. Btrfs has a unique object ID for the root of the filesystem, which is a 64-bit value, or the subvolume ID, which is a 128-bit UUID. Either of those could be used by NFS (and others) to determine if the inode numbers are in their own space. But the subvolume UUID is Btrfs-specific, while the root object ID may apply more widely.

Amir Goldstein asked how the situation was different for ext4 snapshots. Bacik said that the problem was the same for any filesystem that does snapshots. It is only different for snapshots at the block layer, for example using the Logical Volume Manager (LVM).

On the Zoom chat, Jeff Layton said that Bacik's idea would be formalizing the idea of filesystem and subvolume IDs, which might be a good thing, but other filesystems need to be considered. Bacik agreed, but said that all of the local filesystems he is aware of have a UUID; others wondered about filesystems like FAT. Ted Ts'o said that some FAT filesystems have a 32- or 64-bit ID, but not a UUID. That has come up before in the context of adding a generic mechanism to set the UUID on a filesystem, since some do not have that concept.

Ts'o also wondered what it meant when Bacik said that a file was in the same filesystem but in a different subvolume. One definition of "the same filesystem" might be that files can be renamed or hard linked within it, but he did not think that was true for Btrfs subvolumes, which Bacik confirmed. Ts'o said it will be important to clearly define what it means for two files to be in the same filesystem, since there may be different expectations among user-space tools. The main use for whether two files are on the same filesystem, Bacik said, is for maintenance tasks to determine which filesystem to mount or unmount, for example.

Not perfect

In general, this mechanism does not have to be perfect, Bacik said, it just needs to give NFS and others some additional information so that they can do whatever it is they need to do. NFS itself works fine, he said, because it uses the unique ID, but find and such have problems in those exported directories, so he wants to provide a standard way that network filesystem clients can differentiate those files with the same inode numbers.

David Howells wondered if statx() was the right place for this kind of information; it might make more sense in the statfs() information. While Bacik thought that might be a reasonable place to report the UUID for the filesystem, there is still a need to specify which filesystem a given file belongs to, which means statx(), he thinks. But, at some level, that is a "nice to have" feature; the real crux of the problem is being able to differentiate the inode-number spaces, which requires a way to identify the subvolume.

Ts'o pointed out that POSIX-following tools (e.g. rsync, find) are not going to change to start calling statx(); beyond that, those tools are already baked into various enterprise distributions and will need to be supported for a long time. That means the problem will still exist on exported filesystems, unless the NFS client does something different.

Bacik said that Btrfs has various unique IDs that can be used to recognize and handle the problem, somehow; he just wants to know which IDs are desired and how he should deliver them. Historically, his attitude has been "play stupid games, win stupid prizes"; he suggests not combining the local subvolume and the snapshot in the same export. "Problem solved."

Bacik said that Christoph Hellwig always suggests that each subvolume have its own VFS mount, but that is a non-starter, because each VFS mount needs its own superblock. That could potentially change, but the problem remains because there are often thousands of subvolumes on a given filesystem. Goldwyn Rodrigues pointed out that each mount gets its own thread, which is "another nightmare to take care of". He said there had been some work on "views" a few years back that had a lightweight superblock for each sub-mount, though he was not sure how far that work progressed.

Bacik said that he vaguely remembered that work, but, overall, he is tired of talking about this problem. His solution is to extend statx() to give NFS and others a way to figure things out. The st_dev solution will stay forever, he said, since it works for local filesystems. But for network filesystems, he suggests exporting the UUID of the filesystem and the UUID of the subvolume or the 64-bit object ID of the root, either of which would work. No one present really objected to that plan, so patches should presumably be forthcoming.

Index entries for this article
Kernel	Filesystems/Btrfs
Kernel	Network filesystems
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

Snapshots, inodes, and filesystem identifiers

Posted May 18, 2022 15:57 UTC (Wed) by SLi (subscriber, #53131) [Link] (1 responses)

Let me try to fix it ;)

I think the question of what it means for two files to be on the same filesystem hints that we are trying to build one tool that does a number of different, incompatible or at least very vaguely defined things.

Now, I get it that there are legacy tools using this notion of "same filesystem" as a proxy of various other things in a way that used to work when the world was simpler, and mostly still works in equally simple settings. It is of course necessary to try to break those tools as little as possible. It seems unlikely that they can be made to behave well in all possible use cases of today and the future.

Would it then not make more sense to think of the questions that the notion is used as a proxy for—things like "what do I need to unmount to make this file go away", "can I rename() this file to this directory", etc? Hopefully these simpler notions could then be used to define a least unreasonable heuristic for the inode/device number API, while tools that need to care would be able to ask for the precise information they need... without building into it more assumptions about the impossibility of cross-filesystem renames, or how many filesystems a file belongs to.

Snapshots, inodes, and filesystem identifiers

Posted May 18, 2022 19:21 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

You missed the point, maybe, or I got there the long way around...

The question some tools are asking is "are these two names aliases for the same file" or "is this file now the same as the file whose metadata I cached earlier?" Both questions can be efficiently answered in part by assigning a persistent and reliably unique identifier to de-alias the file, i.e. the inode number.

The problem is that at least two Linux filesystems (not counting ZFS) can now instantly create a writable copy of the entire inode number namespace populated with identical sets of existing files, i.e. a snapshot or subvol. Inode number isn't unique within a filesystem any more. Now at least two numbers are needed: one for the file's inode number, and the other for the inode's namespace.

The workaround for the above problem was to make distinct files that have the same inode number _appear_ to be on distinct filesystems, i.e. move the subvol ID into the filesystem device ID field, because the existing tools were already looking for filesystem boundaries to tell them when they entered a different inode number namespace. That works quite nicely for things like find and rsync.

The workaround messes up _other_ tools that needed a unique identifier to de-alias filesystems (i.e. to pass to other kernel API to figure out which block devices the filesystem occupies and what mount points need to be umounted). A filesystem can be mounted multiple times and mounted anywhere, and a lot of tools need robust information about where mounts have happened, so this is a serious need. Tools like findmnt have appeared to try to solve this problem, and 'find' and 'rm' have grown their own code to try to detect things like mount loops.

The solution that didn't work was to say "OK we'll treat every subvol as a distinct filesystem", because it just highlights how broken and unscalable the mountinfo interface is. findmnt is OK with 100 or so mount points, but starts behaving badly when you have tens of thousands of them, and some users want millions.

I've looked at /proc/*/mountinfo and findmnt and...I don't want to save them, I just want to set all of them on fire and start over. Like:
* statx tells you the filesystem instance ID for each file. Look up that instance ID in /sys/blah/blah/ID/blah/blah to find a driver (filesystem) and block device list (or remote address or whatever) or UUID or whatever else you wanted to know or the filesystem could tell you.
* Instance IDs would be unique over the lifetime the kernel, 64 bits long, increment by 1 each time something new is mounted, never repeated.
* Bind mounts to the same filesystem get the same instance ID, but if the filesystem is completely umounted and mounted again it gets a new instance ID.
* If you look at two files and they have the same instance ID, you can expect things like reflinks and dedup to work sometimes, if they're different instance IDs then no chance.
* Note this instance ID concept is intentionally different from the filesystem UUID because UUIDs aren't unique--if you mount an identical ext3 image 20 times, you have 20 mount points with the same UUID, but 20 distinct instance IDs.
* st_dev is unique for every bind mount _and_ every inode namespace transition (more or less as it is now).

Snapshots, inodes, and filesystem identifiers

Posted May 19, 2022 2:40 UTC (Thu) by developer122 (guest, #152928) [Link]

It sounds like this information will be made available, but nobody (no tools) will actually make use of it for a long time.

Snapshots, inodes, and filesystem identifiers

Posted May 19, 2022 4:27 UTC (Thu) by neilbrown (subscriber, #359) [Link] (3 responses)

> What Bacik would like to do is to extend statx() to report the subvolume ID, which is something that Bruce Fields said would probably work for NFS

I wonder what Bruce might have meant by that. NFS doesn't export the statx system call, so it couldn't make the subvolume ID available to the client.
NFSd could mix the subvolume ID in with the filesystem ID, but that would cause subvolume names - which are sometimes considered to be private - to be publicly visible on any client used to access them.
NFSd could mix it in to the inode number. That only provides a 99.999+% solution, which isn't perfect enough for some. However it is the *only* credible solution for those "POSIX-following tools".

Really, we don't need to export any new information. We already export the filehandle to user-space, and that is even exported over NFS.
We just need to change applications to access and depend on the filehandle (when available) instead of depending on the inode number.

(I almost wrote a patch for find, but I lost interest.....)

Snapshots, inodes, and filesystem identifiers

Posted May 22, 2022 2:03 UTC (Sun) by bfields (subscriber, #19510) [Link] (2 responses)

> > What Bacik would like to do is to extend statx() to report the subvolume ID, which is something that Bruce Fields said would probably work for NFS

> I wonder what Bruce might have meant by that.

Yeah, FWIW, I took a couple minutes to look through old email and I can't figure out what that's referring to.

Snapshots, inodes, and filesystem identifiers

Posted May 22, 2022 13:16 UTC (Sun) by jake (editor, #205) [Link] (1 responses)

> Yeah, FWIW, I took a couple minutes to look through old email and I can't figure out what that's referring to.

It's certainly possible that I misunderstood Josef, so maybe that's where the disconnect is ...

jake

Snapshots, inodes, and filesystem identifiers

Posted May 23, 2022 21:07 UTC (Mon) by bfields (subscriber, #19510) [Link]

It is a bit of a game of "telephone", I wouldn't be too quick to rule out the possibility that I said something dumb....

Snapshots, inodes, and filesystem identifiers

Posted Jul 18, 2022 13:19 UTC (Mon) by walex (guest, #69836) [Link]

This is a bit late, but just-in-case someone read this sometime:

* Most of the article and previous comments are based on being confused about UNIX filesystem semantics, however simple they are, in particular confused about block devices, filesystem instances, roots, and namespaces.

* UNIX filesystem semantics require don't define a filesystem, and in particular do not require that a filesystem instance be associated with a block device at all, or that there be a single filesystem instance in a block device, or that a filesystem instance be contained in a single block device, or that a filesystem instance have a single "root" directory.

* The essential requirements for filesystem instances is that each must have a locally unique "device id", and cross-linking across filesystems with different "device ids" *may* be forbidden.

* Btrfs and ZFS do not violate any of the UNIX filesystem semantics: they are are designs in which a set multiple block devices can contain multiple filesystem instances (called "subvolumes") each of which can span more than one of those devices, and each filesystem instance has a locally unique device-id, and a root directory.

* The main problem with that is that a lot of people commenting on Btrfs and ZFS think since most UNIX filesystem types do have a single filesystem instance in a single block device that is a requirement.

* The second problem is that NFS exports are good for exporting only a single filesystem, and people who ignore that ZFS and Btrfs have multiple filesystem instances per storage area expect NFS servers to recursively export all filesystem instances in the storage area, which is just wrong (some NFS servers support recursive exports by doing a bit of automagic work).

* If there is something wrong that Btrfs and ZFS do is that by default all filesystem instances in a storage area are mounted automatically, which gives the impression that they are directories to people who ignore that they are separate filesystem instances.

Some people may argue that if a storage area has dozens of thousand of filesystem instances (subvolumes) it is crazy to export each one separately.

Unfortunately that violates UNIX filesystem semantics: each client NFS mounted filesystem instance must have a separate unique local device-id on the client, which should not be shared by multiple server exported instances.

It is not difficult (but subtle) to write an NFS client and server that handle recursive submounts transparently, exporting them from the server and mounting them on the client transparently and dynamically.

PS I used to argue that Btrfs and ZFS subvolumes are actually independent root directories in a single filesystem instance, where "independent" means with a distinct device-id, but that may confuse people with the case where a single filesystem instance has multiple non-indepedent root directories, e.g. where it is possible to choose at mount time a different i-number to mount than the default one.