The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 19:43 UTC (Mon) by NYKevin (subscriber, #129325)
In reply to: The Btrfs inode-number epic (part 2: solutions) by ibukanov
Parent article: The Btrfs inode-number epic (part 2: solutions)

64-bit inodes already give you room for ~18 quintillion inodes per filesystem. You do not actually need that many, or if you do, something has gone terribly wrong.

Sure, you can throw more bits at the problem, but you're just treating the symptoms. The real issue here is not "we don't have enough bits." It's "we can't agree on exactly how those bits should be allocated." One possibility: btrfs might decide to have unique inodes over the whole filesystem, and that would likely be challenging but technically possible (for example, when you create a subvolume, you allocate a new 32-bit inode prefix to that subvolume, and whenever any subvolume runs out of inode numbers, you give it another 32-bit prefix - since each prefix contains ~4 billion inode numbers, this allocation should happen rather infrequently, and since there are ~4 billion possible prefix values, the large size of these allocations will not easily cause a shortage).

But I doubt you can actually do that and still maintain on-disk compatibility with existing btrfs filesystems. Oh well.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:43 UTC (Mon) by willy (subscriber, #9762) [Link]

I would go further. Allocate in groups of 2^16. That way we run out of space in a group frequently and test the "allocate new prefix" path every day instead of once every dozen years.

The Btrfs inode-number epic (part 2: solutions)

Posted Aug 23, 2021 22:50 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

This is a variation on the swab64() strategy. Unless I've missed something, it doesn't require any on-disk format changes in most cases. The NFS server can already do it now, so there's no reason why btrfs couldn't do it itself. You'd use a mount option that says "crush all my inodes into one 64-bit namespace" and there would be a corresponding loss of maximum filesystem size/age. Internally the filesystem would still use separate subvol and inode, so you could remove the mount option and get the old behavior again.

If we know the highest-numbered subvol on the filesystem (which is a trivial tree lookup at mount time) and we use bit-swap instead of byte-swap, then we know which bits are subvol ID and which are inode (all bits that are not subvol ID are inode ID), so we have a nice pair of O(1) bidirectional conversion functions. We can also know when subvol and inode might potentially collide (it's not possible as long as the number of bits needed for the highest subvol ID and the highest inode do not total more than 64, but you probably want warnings around 56 or so).

If you ran out of inode bits in a subvol then you'd need a lookup table to map a discontiguous range of inodes to subvols. That table would require a disk format change, but most users will never occupy enough bits to need it (it will take decades, creating thousands of inodes every second and thousands of snapshots per day, to make the numbers bump). It could be created lazily when the free bits run out, but if that takes 20 years to happen then that code isn't going to be very well tested.

Alternatively btrfs could in the future do garbage collection to free up old object ID numbers, i.e. start at the highest inodes and pack them into the lowest-numbered available inode slots, and stop when it had freed up enough top bits. That wouldn't require an on-disk format change, it would just be a maintenance task to run at regular intervals, say, once every 15 years. This is roughly equivalent to creating an empty subvol and using 'cp -a --reflink' to move the data into files with smaller inode numbers, so if you are in really dire straits you don't need to wait for a special tool.

The Btrfs inode-number epic (part 2: solutions)

Posted Sep 12, 2021 19:41 UTC (Sun) by nix (subscriber, #2304) [Link]

> The real issue here is not "we don't have enough bits." It's "we can't agree on exactly how those bits should be allocated."

Another way of putting it: if you insist on stacking new bits on the front of an inode number for every new sort of thing that must be unique within a mount point (new btrfs subvolumes, new mount points within an NFS export, new this, new that), we can *never* have enough bits, because you can always add another layer of overlayfs or nfs exporting or whatever, and require more: and since most filesystems are using 64-bit inode numbers already, 64 bits is *never* enough to maintain guaranteed uniqueness in that space while adding more spaces as well on top.

(What saves us and lets us use kludges like the one in this article without disaster is that 64-bit spaces are, indeed, so large that we can just assume it is almost entirely empty and we can just pick more numbers at random, as long as they're not mostly-bits-zero or mostly-bits-1, and probably work nearly all the time, despite the birthday paradox. This is gross but probably good enough. I for one do not want a 128-bit ino_t flag day any time soon thankyouverymuch!)