Snapshots, inodes, and filesystem identifiers

Posted May 18, 2022 19:21 UTC (Wed) by zblaxell (subscriber, #26385)
In reply to: Snapshots, inodes, and filesystem identifiers by SLi
Parent article: Snapshots, inodes, and filesystem identifiers

You missed the point, maybe, or I got there the long way around...

The question some tools are asking is "are these two names aliases for the same file" or "is this file now the same as the file whose metadata I cached earlier?" Both questions can be efficiently answered in part by assigning a persistent and reliably unique identifier to de-alias the file, i.e. the inode number.

The problem is that at least two Linux filesystems (not counting ZFS) can now instantly create a writable copy of the entire inode number namespace populated with identical sets of existing files, i.e. a snapshot or subvol. Inode number isn't unique within a filesystem any more. Now at least two numbers are needed: one for the file's inode number, and the other for the inode's namespace.

The workaround for the above problem was to make distinct files that have the same inode number _appear_ to be on distinct filesystems, i.e. move the subvol ID into the filesystem device ID field, because the existing tools were already looking for filesystem boundaries to tell them when they entered a different inode number namespace. That works quite nicely for things like find and rsync.

The workaround messes up _other_ tools that needed a unique identifier to de-alias filesystems (i.e. to pass to other kernel API to figure out which block devices the filesystem occupies and what mount points need to be umounted). A filesystem can be mounted multiple times and mounted anywhere, and a lot of tools need robust information about where mounts have happened, so this is a serious need. Tools like findmnt have appeared to try to solve this problem, and 'find' and 'rm' have grown their own code to try to detect things like mount loops.

The solution that didn't work was to say "OK we'll treat every subvol as a distinct filesystem", because it just highlights how broken and unscalable the mountinfo interface is. findmnt is OK with 100 or so mount points, but starts behaving badly when you have tens of thousands of them, and some users want millions.

I've looked at /proc/*/mountinfo and findmnt and...I don't want to save them, I just want to set all of them on fire and start over. Like:
* statx tells you the filesystem instance ID for each file. Look up that instance ID in /sys/blah/blah/ID/blah/blah to find a driver (filesystem) and block device list (or remote address or whatever) or UUID or whatever else you wanted to know or the filesystem could tell you.
* Instance IDs would be unique over the lifetime the kernel, 64 bits long, increment by 1 each time something new is mounted, never repeated.
* Bind mounts to the same filesystem get the same instance ID, but if the filesystem is completely umounted and mounted again it gets a new instance ID.
* If you look at two files and they have the same instance ID, you can expect things like reflinks and dedup to work sometimes, if they're different instance IDs then no chance.
* Note this instance ID concept is intentionally different from the filesystem UUID because UUIDs aren't unique--if you mount an identical ext3 image 20 times, you have 20 mount points with the same UUID, but 20 distinct instance IDs.
* st_dev is unique for every bind mount _and_ every inode namespace transition (more or less as it is now).