|
|
Subscribe / Log in / New account

The Btrfs inode-number epic (part 1: the problem)

By Jonathan Corbet
August 20, 2021
Unix-like systems — and their users — tend to expect all filesystems to behave in the same way. But those users are also often interested in fancy new filesystems offering features that were never envisioned by the developers of the Unix filesystem model; that has led to a number of interesting incompatibilities over time. Btrfs is certainly one of those filesystems; it provides a long list of features that are found in few other systems, and some of those features interact poorly with the traditional view of how filesystems work. Recently, Neil Brown has been trying to resolve a specific source of confusion relating to how Btrfs handles inode numbers.

One of the key Btrfs features is subvolumes, which are essentially independent filesystems maintained within a single storage volume. Snapshots are one commonly used form of subvolume; they allow the storage of copies of the state of another subvolume at a given point in time, with the underlying data shared to the extent that it has not been changed since each snapshot was taken. There are other applications for subvolumes as well, and they tend to be heavily used; Btrfs filesystems can contain thousands of subvolumes.

Btrfs subvolumes bring some interesting quirks with them. They can be mounted independently, as if they were separate filesystems, but they also appear as a part of the filesystem hierarchy as seen from the root. So one can mount subvolumes, but a subvolume can be accessed without being mounted if a higher-level directory is mounted. Imagine, for example, that /dev/sda1 contains a Btrfs filesystem that has been mounted on /butter. One could create a pair of subvolumes with commands like:

    # cd /butter
    # btrfs subvolume create subv1
    # btrfs subvolume create subv2

The root of /butter will now appear to contain two directories (subv1 and subv2):

    # tree /butter
    /butter
    ├── subv1
    └── subv2

    2 directories, 0 files

They behave like directories most of the time but, since they are actually subvolumes, there are some differences; one cannot rename a file from one to the other, for example. A suitably privileged user can now mount either subv1 or subv2 (or both) as independent filesystems. But, as long as /butter remains mounted, both subvolumes are visible as if they were part of the same filesystem. There are some interesting consequences from this behavior, as will be seen.

Btrfs uses a subvolume ID number internally to identify subvolumes, but there is no way to make that number directly visible to user space. Instead, the filesystem allocates a separate device number (the usual major/minor pair) for each subvolume; that number can be seen with a system call like stat(). If the subvolumes are not explicitly mounted, though, numbers do not show up in files like /proc/self/mountinfo, leading to inconsistent views of how the filesystem is put together. [Update: as Brown pointed out to us privately, the numbers do not show up there even if the subvolumes are explicitly mounted.] A call to stat() on a file within a subvolume will return a device number that does not exist in files like mountinfo, a situation that occasionally confuses unaware applications.

It gets worse. Since Btrfs has a unique internal ID for each subvolume, it feels no particular need to keep inode numbers unique across those subvolumes. As a result, a process walking a Btrfs filesystem from the root may well encounter multiple files with the same inode number. Tools like find use inode numbers as a way of tracking which files they have already seen and detecting filesystem loops. For a locally mounted Btrfs filesystem, things mostly work as expected because, even though two files on different subvolumes may have the same inode number, they will have different device numbers and are thus distinct.

The kernel's NFS daemon, though, has a harder time of things. It cannot present all of those artificial device numbers to NFS clients, because that would require all of the subvolumes — again, possibly thousands of them — to show up as separate mounts on the client. So a Btrfs filesystem exported via NFS shows the same device number (the device number of the root) on all subvolumes. That works most of the time, but it can make it impossible to use a tool like find on an NFS-mounted Btrfs filesystem with subvolumes. The single device number makes it impossible to distinguish files with the same inode number on different subvolumes, causing find to abort with a message about filesystem loops. This leads to occasional complaints from users and a desire to somehow improve the situation.

These problems are not new; they have been known and understood for years. The level of complaints seems to be rising, though, perhaps as a consequence of increased use of Btrfs in production situations. In theory, the way to solve these problems is understood as well — though not all developers have the same understanding, as Neil Brown found out when he took on the task of fixing Btrfs filesystems exported via NFS. The second and last article in this series, published on August 23, explores various attempted solutions to this problem and why it turns out to be so hard to fix.

Index entries for this article
KernelFilesystems/Btrfs


to post comments

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 20, 2021 15:47 UTC (Fri) by rvolgers (guest, #63218) [Link] (1 responses)

The rename behavior issue and persistent inode numbering problems also come up in the various union filesystems, for example:

Unioning file systems: Architecture, features, and design choices (2009) - https://lwn.net/Articles/324291/

Overlayfs issues and experiences (2015) - https://lwn.net/Articles/636943/

Union filesystems and other fun sources of EXDEV

Posted Aug 20, 2021 16:49 UTC (Fri) by tux3 (subscriber, #101245) [Link]

This sometimes shows up inside docker containers. rename(2) will return an error if the source and destination folders happen to be on a different layer, which may be less than transparent to users unfamiliar with the specifics of docker's default storage driver.

This situation is made more confusing by the fact that some tools (like mv) handle this transparently (by recursive copy), while other tools appear to mysteriously work for some destination folders and fail for others.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 20, 2021 17:26 UTC (Fri) by smfrench (subscriber, #124116) [Link] (1 responses)

This problem of how to expose subvolumes over a network file system is probably easier for SMB3 (e.g. Samba or ksmbd) to deal with because when you have a subvolume you can redirect the client transparently back to the subvolume with a DFS referral.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 21, 2021 2:02 UTC (Sat) by jra (subscriber, #55261) [Link]

Well we (Samba) do have problems with inode duplication. However, because we do all access via a user-pluggable VFS API we have modules like vfs_fileid:

https://www.samba.org/samba/docs/current/man-html/vfs_fil...

that allow fileid (inode in Windows) manipulation in user space.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 21, 2021 1:35 UTC (Sat) by Fowl (subscriber, #65667) [Link] (1 responses)

So the problem is that NFS doesn’t support nested mountpoints - and btrfs makes them more common? Or that those mountpoints don’t appear in the list of mounts and this confuses the NFS server?

I guess we’ll find out in part two! Cliffhanger

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 22, 2021 11:48 UTC (Sun) by rincebrain (subscriber, #69638) [Link]

AIUI what is being described is that:

* Btrfs subvols _can_ be mounted independently of their "parent", but do not show up as distinct mounted filesystems unless you do that
* They don't promise unique inode numbers between them, so this would break far more things, except dev_id is unique, so most things notice that
* Mounting over NFS doesn't expose the dev_id being unique any more, resulting in fire

I look forward to seeing what expectations various solutions break, as I'd probably default to wanting the subvols to have to be explicit mounts, but if someone was relying on not needing crossmnt or similar, they might be surprised one day...

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 22, 2021 22:49 UTC (Sun) by neilbrown (subscriber, #359) [Link]

I disagree with the assertion (commonly held) that btrfs subvols are "essentially independent filesystems".
- they cannot be independently synced with syncfs()
- statfs() does not provide different information for different subvols (except the f_fsid) - so they are identical in "df".
- they do NOT appear in /proc/mounts. The entries that appear in /proc/mounts are no different to bind-mounts for some arbitrary directory within the filesystem

btrfs "subvolumes" are much more like project-quota trees as found in xfs, ext4, and f2fs, which also prevent renames from one tree to another and so are, in some sense, independent subtrees.
The particular value-add of btrfs "subvolumes" is that you can effectively create a reflink to one.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 23, 2021 8:33 UTC (Mon) by taladar (subscriber, #68407) [Link] (3 responses)

Sounds to me as if btrfs is pretty much unusable in any place where correctness matters (so 99% of use cases in practice) without giving it another 20 years for each and every tool to work around its idiosyncrasies.

The question is then, what does btrfs bring to the table that makes it useful enough for everyone to invest that much effort into it?

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 23, 2021 14:21 UTC (Mon) by Paf (subscriber, #91811) [Link] (1 responses)

As comments above indicate, this is also a problem with the various union file systems. Are they also unusable in production?

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 23, 2021 17:29 UTC (Mon) by anselm (subscriber, #2796) [Link]

All those people running Docker containers would probaby be intrigued to hear about that.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 23, 2021 17:49 UTC (Mon) by niner (subscriber, #26151) [Link]

You can of course wait for another 20 years before using btrfs.
Meanwhile we continue to enjoy btrfs detecting and fixing otherwise silent data corruption via its checksums, live migrations, almost instant snapshots locally and very fast offsite backups via send/receive and keep wondering what those correctness problems could be that keep you from doing the same.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 23, 2021 13:17 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

> Btrfs uses a subvolume ID number internally to identify subvolumes, but there is no way to make that number directly visible to user space

There's no generic VFS (or NFS) way to do it, but there is an ioctl that returns the subvol ID number of an open fd on btrfs (INO_LOOKUP on the constant subvol root inode number).

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 24, 2021 19:16 UTC (Tue) by dullfire (guest, #111432) [Link] (5 responses)

> Imagine, for example, that /dev/sda1 contains a Btrfs filesystem that has been mounted on /butter.

Jonathan, sorry to have to point this out, but you always mount btrfs on /bread

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 25, 2021 9:33 UTC (Wed) by sdalley (subscriber, #18550) [Link] (4 responses)

Well, I mount it on /toast and make sure it contains files marmalade and melted-cheese...

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 27, 2021 12:12 UTC (Fri) by alonz (subscriber, #815) [Link] (2 responses)

Both at the same time??? Man, you have weird tastes ;-)

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 27, 2021 15:08 UTC (Fri) by sdalley (subscriber, #18550) [Link]

You bet. And copying peanut-butter prior to the other two makes it even nicer...

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 29, 2021 1:22 UTC (Sun) by ghane (guest, #1805) [Link]

May I point out the flame-wars you get if you use the O_MARMITE flag? And there is a similar, but incompatible, O_VEGEMITE flag as well.

The Btrfs inode-number epic (part 1: the problem)

Posted Aug 31, 2021 14:21 UTC (Tue) by Wol (subscriber, #4433) [Link]

I'd swap melted-cheese... for sausage (and toast for fried-bread).

Cheers,
Wol


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds