The Btrfs inode-number epic (part 2: solutions)

Posted Aug 29, 2021 9:22 UTC (Sun) by NYKevin (subscriber, #129325)
In reply to: The Btrfs inode-number epic (part 2: solutions) by neilbrown
Parent article: The Btrfs inode-number epic (part 2: solutions)

> Setting i_ino to a strong hash of whatever value the filesytem uses internally to find a file is a tempting idea. My last proposed solution for the NFS problem is to use a week hash (xor with bit shift). Maybe we should use a strong hash instead.

Why is it necessary to use something that has the potential for collisions at all? Why not just hand out arbitrary or sequential numbers in a centralized fashion (like every other filesystem that isn't FAT)? Is there some rule that says you're not allowed to look at subvolume X when you make a new file in subvolume Y? Why would such a rule be necessary?

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 29, 2021 12:31 UTC (Sun) by foom (subscriber, #14868) [Link]

Subvolume Y can be a writeable snapshot clone of subvolume X. And you want creating snapshots to be fast, and not use excess disk space. Needing to iterate over every file in X to assign each one a new centrally-assigned-unique inode in Y would be unfortunate both in terms of time and space it would take.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 30, 2021 15:05 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

Creating a subvol snapshot instantly duplicates every inode number that already exists in the snapshot source subvol to the target subvol. "Instantly" is a core requirement of the feature set--if a filesystem cannot create a subvol snapshot instantly, there is no performance or consistency advantage over userspace emulation with 'cp -a --reflink=always', and it is better not to implement the subvol feature set at all (which is roughly why XFS doesn't do it).

To get globally unique and stable inode numbers without a separate subvol ID, the filesystem would have to dynamically remap duplicate inode numbers from subvol-local values to globally-unique values every time a readdir() or stat() happened. This adds some overhead to all read operations that filesystem maintainers are reluctant to implement. They would prefer some more efficient way to tell an application "this is a distinct inode number namespace but not a distinct filesystem" so that applications that rely on the uniqueness feature can bear some of the costs (including opportunity costs) of implementing it, while not imposing new costs (such as new O(log(N)) search costs on every stat(), or exploding /proc/mounts and `df` output size) on applications that don't care about inode uniqueness.

The NFS server could maintain its own persistent unique inode numbers in a mapping table outside of the filesystem, and not send the filesystem's inode numbers to clients at all, but that has obvious and onerous runtime costs (the NFS server would have to maintain persistent state proportional to filesystem size).