The Btrfs inode-number epic (part 1: the problem)
One of the key Btrfs features is subvolumes, which are essentially independent filesystems maintained within a single storage volume. Snapshots are one commonly used form of subvolume; they allow the storage of copies of the state of another subvolume at a given point in time, with the underlying data shared to the extent that it has not been changed since each snapshot was taken. There are other applications for subvolumes as well, and they tend to be heavily used; Btrfs filesystems can contain thousands of subvolumes.
Btrfs subvolumes bring some interesting quirks with them. They can be mounted independently, as if they were separate filesystems, but they also appear as a part of the filesystem hierarchy as seen from the root. So one can mount subvolumes, but a subvolume can be accessed without being mounted if a higher-level directory is mounted. Imagine, for example, that /dev/sda1 contains a Btrfs filesystem that has been mounted on /butter. One could create a pair of subvolumes with commands like:
# cd /butter # btrfs subvolume create subv1 # btrfs subvolume create subv2
The root of /butter will now appear to contain two directories (subv1 and subv2):
# tree /butter /butter ├── subv1 └── subv2 2 directories, 0 files
They behave like directories most of the time but, since they are actually subvolumes, there are some differences; one cannot rename a file from one to the other, for example. A suitably privileged user can now mount either subv1 or subv2 (or both) as independent filesystems. But, as long as /butter remains mounted, both subvolumes are visible as if they were part of the same filesystem. There are some interesting consequences from this behavior, as will be seen.
Btrfs uses a subvolume ID number internally to identify subvolumes, but there is no way to make that number directly visible to user space. Instead, the filesystem allocates a separate device number (the usual major/minor pair) for each subvolume; that number can be seen with a system call like stat(). If the subvolumes are not explicitly mounted, though, numbers do not show up in files like /proc/self/mountinfo, leading to inconsistent views of how the filesystem is put together. [Update: as Brown pointed out to us privately, the numbers do not show up there even if the subvolumes are explicitly mounted.] A call to stat() on a file within a subvolume will return a device number that does not exist in files like mountinfo, a situation that occasionally confuses unaware applications.
It gets worse. Since Btrfs has a unique internal ID for each subvolume, it feels no particular need to keep inode numbers unique across those subvolumes. As a result, a process walking a Btrfs filesystem from the root may well encounter multiple files with the same inode number. Tools like find use inode numbers as a way of tracking which files they have already seen and detecting filesystem loops. For a locally mounted Btrfs filesystem, things mostly work as expected because, even though two files on different subvolumes may have the same inode number, they will have different device numbers and are thus distinct.
The kernel's NFS daemon, though, has a harder time of things. It cannot present all of those artificial device numbers to NFS clients, because that would require all of the subvolumes — again, possibly thousands of them — to show up as separate mounts on the client. So a Btrfs filesystem exported via NFS shows the same device number (the device number of the root) on all subvolumes. That works most of the time, but it can make it impossible to use a tool like find on an NFS-mounted Btrfs filesystem with subvolumes. The single device number makes it impossible to distinguish files with the same inode number on different subvolumes, causing find to abort with a message about filesystem loops. This leads to occasional complaints from users and a desire to somehow improve the situation.
These problems are not new; they have been known and understood for years.
The level of complaints seems to be rising, though, perhaps as a
consequence of increased use of Btrfs in production situations. In theory,
the way to solve these problems is understood as well — though not all
developers have the same understanding, as Neil Brown found out when he
took on the task of fixing Btrfs filesystems exported via NFS. The second
and last article in this series, published on August 23, explores
various attempted solutions to this problem and why it turns out to be so
hard to fix.
Index entries for this article | |
---|---|
Kernel | Filesystems/Btrfs |
Posted Aug 20, 2021 15:47 UTC (Fri)
by rvolgers (guest, #63218)
[Link] (1 responses)
Unioning file systems: Architecture, features, and design choices (2009) - https://lwn.net/Articles/324291/
Overlayfs issues and experiences (2015) - https://lwn.net/Articles/636943/
Posted Aug 20, 2021 16:49 UTC (Fri)
by tux3 (subscriber, #101245)
[Link]
This situation is made more confusing by the fact that some tools (like mv) handle this transparently (by recursive copy), while other tools appear to mysteriously work for some destination folders and fail for others.
Posted Aug 20, 2021 17:26 UTC (Fri)
by smfrench (subscriber, #124116)
[Link] (1 responses)
Posted Aug 21, 2021 2:02 UTC (Sat)
by jra (subscriber, #55261)
[Link]
https://www.samba.org/samba/docs/current/man-html/vfs_fil...
that allow fileid (inode in Windows) manipulation in user space.
Posted Aug 21, 2021 1:35 UTC (Sat)
by Fowl (subscriber, #65667)
[Link] (1 responses)
I guess we’ll find out in part two! Cliffhanger
Posted Aug 22, 2021 11:48 UTC (Sun)
by rincebrain (subscriber, #69638)
[Link]
* Btrfs subvols _can_ be mounted independently of their "parent", but do not show up as distinct mounted filesystems unless you do that
I look forward to seeing what expectations various solutions break, as I'd probably default to wanting the subvols to have to be explicit mounts, but if someone was relying on not needing crossmnt or similar, they might be surprised one day...
Posted Aug 22, 2021 22:49 UTC (Sun)
by neilbrown (subscriber, #359)
[Link]
Posted Aug 23, 2021 8:33 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (3 responses)
The question is then, what does btrfs bring to the table that makes it useful enough for everyone to invest that much effort into it?
Posted Aug 23, 2021 14:21 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted Aug 23, 2021 17:29 UTC (Mon)
by anselm (subscriber, #2796)
[Link]
All those people running Docker containers would probaby be intrigued to hear about that.
Posted Aug 23, 2021 17:49 UTC (Mon)
by niner (subscriber, #26151)
[Link]
Posted Aug 23, 2021 13:17 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link]
There's no generic VFS (or NFS) way to do it, but there is an ioctl that returns the subvol ID number of an open fd on btrfs (INO_LOOKUP on the constant subvol root inode number).
Posted Aug 24, 2021 19:16 UTC (Tue)
by dullfire (guest, #111432)
[Link] (5 responses)
Jonathan, sorry to have to point this out, but you always mount btrfs on /bread
Posted Aug 25, 2021 9:33 UTC (Wed)
by sdalley (subscriber, #18550)
[Link] (4 responses)
Posted Aug 27, 2021 12:12 UTC (Fri)
by alonz (subscriber, #815)
[Link] (2 responses)
Posted Aug 27, 2021 15:08 UTC (Fri)
by sdalley (subscriber, #18550)
[Link]
Posted Aug 29, 2021 1:22 UTC (Sun)
by ghane (guest, #1805)
[Link]
Posted Aug 31, 2021 14:21 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
The Btrfs inode-number epic (part 1: the problem)
Union filesystems and other fun sources of EXDEV
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
* They don't promise unique inode numbers between them, so this would break far more things, except dev_id is unique, so most things notice that
* Mounting over NFS doesn't expose the dev_id being unique any more, resulting in fire
The Btrfs inode-number epic (part 1: the problem)
- they cannot be independently synced with syncfs()
- statfs() does not provide different information for different subvols (except the f_fsid) - so they are identical in "df".
- they do NOT appear in /proc/mounts. The entries that appear in /proc/mounts are no different to bind-mounts for some arbitrary directory within the filesystem
btrfs "subvolumes" are much more like project-quota trees as found in xfs, ext4, and f2fs, which also prevent renames from one tree to another and so are, in some sense, independent subtrees.
The particular value-add of btrfs "subvolumes" is that you can effectively create a reflink to one.
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
Meanwhile we continue to enjoy btrfs detecting and fixing otherwise silent data corruption via its checksums, live migrations, almost instant snapshots locally and very fast offsite backups via send/receive and keep wondering what those correctness problems could be that keep you from doing the same.
The Btrfs inode-number epic (part 1: the problem)
The Btrfs inode-number epic (part 1: the problem)
Well, I mount it on The Btrfs inode-number epic (part 1: the problem)
/toast
and make sure it contains files marmalade
and melted-cheese
...
Both at the same time??? Man, you have weird tastes ;-)
The Btrfs inode-number epic (part 1: the problem)
You bet. And copying The Btrfs inode-number epic (part 1: the problem)
peanut-butter
prior to the other two makes it even nicer...
The Btrfs inode-number epic (part 1: the problem)
I'd swap melted-cheese... for sausage (and toast for fried-bread).
The Btrfs inode-number epic (part 1: the problem)
Wol