LWN.net Weekly Edition for August 26, 2021

Welcome to the LWN.net Weekly Edition for August 26, 2021

This edition contains the following feature content:

DVB, header files, and user-space regressions: does the kernel's no-regressions policy extend to moving a header file?
Adding a "duress" password with PAM Duress: using PAM to detect and respond to under-duress logins.
The shrinking role of ETXTBSY: an obscure, Unix-derived behavior may eventually go away entirely.
The Btrfs inode-number epic (part 1: the problem): the Btrfs filesystem has some surprising behaviors that makes it hard to serve via NFS.
The Btrfs inode-number epic (part 2: solutions): the even more surprising series of attempts required to work around those Btrfs behaviors.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

DVB, header files, and user-space regressions

By Jake Edge
August 25, 2021

A regression that was recently reported for 5.14 in the media subsystem is a bit of a strange beast. The kernel's user-space binary interface (ABI) was not changed, which is the usual test for a patch to get reverted, but the report still led to a reversion. The change did lead to problems building a user-space application because it moved some header files to staging/ as part of a cleanup for a deprecated—though apparently still functioning—driver for a Digital Video Broadcasting (DVB) device. There are a few different issues tangled together here, but the reversion of a regression in the user-space API (and not ABI) is a new wrinkle.

Soeren Moch reported the regression in a lengthy message on August 11. Parts of the message were aimed at Linus Torvalds and requested the reversion of three patches committed by media subsystem maintainer Mauro Carvalho Chehab. Those patches moved the av7110 driver to staging and followed up by moving some documentation and header files there as well. But that broke the ability to build parts of the Video Disk Recorder (VDR) application, which is shipped with multiple Linux distributions, Moch said.

He pointed to a Red Hat bug that had been filed by another member of the VDR community as evidence of the problem. VDR uses header files from the uapi/linux/dvb directory; when those got moved, VDR would no longer build on the kernel used for the upcoming Fedora 35. Beyond that, Moch disagreed with moving "the long existing and working av7110 driver" to staging/.

Much of the email was addressed to Chehab; it complained about the deprecation of the existing DVB API [Update: As noted in the comments, this is the "full-featured" DVB API implemented by av7110 and used by VDR.], the removal of av7110 support, and Chehab's unwillingness to merge another driver that also implements the API. There are, Moch said, working devices and existing DVB streams that can be accessed using those drivers. He took issue with Chehab's commit message on the patch moving av7110:

Your commit message "the decoder supports only MPEG2, with is not compatible with several modern DVB streams" is at least misleading. The most popular satellite TV provider in Germany (Astra) still transmits most of the interesting programs MPEG-2 encoded, so also this is actively used and no reason to retire this card.

But Chehab sees things differently; the av7110 hardware stopped being produced 15 years ago and the DVB API that the driver implements is not useful for today's hardware:

The API that got removed was written to control the av7110 MPEG-2 decoder, and was never integrated at the DVB core: the av7110 had a driver-specific implementation inside its code.
Besides that, the API was never fully documented: there are several ioctls from the now removed API that never had any in-kernel implementation, nor had and descriptions at the specs. None of the current upstream maintainers have any [clue] about what such ioctls are supposed to do, nor do we have any av7110 hardware to test it.

There is a clear disagreement about some of that, and Moch repeatedly offers to maintain the av7110 driver going forward. He would also like to see his saa716x driver, which also implements the DVB API, get merged and said he would maintain that as well. According to Moch, there is an active community that is using the hardware with VDR and the DVB API fits its needs well. But Chehab said:

Keeping av7110 in-kernel has been a waste of limited upstream development resources. A couple of years ago, we needed to fix the av7110 API due to year-2038 issues. From time to time, we get bugs affecting it (even security ones), as the code has been bit-rotting for a long time. The most recent one probably broke the driver without nobody noticing it for a couple of Kernel reviews, as mentioned above.

Moch disputes that characterization and had some strongly worded complaints about Chehab's handling of the media subsystem, part of which he addressed to Torvalds. There is clearly a long-running dispute being aired in the thread; Torvalds stepped in to clarify a few things. He said that the regression policy is solely aimed at the ABI of the kernel, which did not change: "old programs continue to work". He also lamented that VDR was using the kernel headers directly; that is not how it is supposed to work:

We very much try to discourage user space applications from using the kernel header files directly - even projects like glibc etc are supposed to _copy_ them, not include the kernel headers.
Exactly because re-organization and changes to the kernel tree shouldn't be something that then causes random problems elsewhere that are so hard to test - and synchronize - from the kernel standpoint (or from the standpoint of the other end).

On the other hand, though, the header files in question were in the uapi directory: "while it is annoying how user space programs used them this way, I think it's also not entirely unreasonable". So he reverted the header-file move, while leaving the other changes (moving av7110 and the DVB API documentation to staging/) intact. He suggested that VDR and any other application using the header files copy them into their tree. He agreed with Chehab that moving av7110 out of the main tree was the right move:

I'm not convinced that it makes sense to move the av7110 driver back from staging - it may continue to work, but it _is_ old and there is no maintenance - and I would certainly suggest that any other out-of-tree driver that uses these old interfaces that nothing else implements shouldn't do so, considering that nothing else implements them.

Both Torvalds and Chehab seem to ignore or discount Moch's offer to maintain the driver, though he presumably can do so in staging/. Keeping the driver working is a good way to ensure that it does not get removed altogether. Moch's questions about the eventual fate of av7110 remain unanswered, however:

How long can this driver stay in staging? Would you move the driver back from staging when I do proper maintenance for it? Is it normal linux policy to remove drivers after a certain period of time, even if a driver still has users and someone that volunteers to maintain it?

So, for now at least, users of VDR will be able to build it for recent kernels; moving forward, copying the header files should ensure that it continues to build. If the driver gets removed, though, users will also need to build the driver in order to make things work. Chehab said that he is open to discussing an API for DVB, "but such API should support modern embedded hardware, and should be designed to allow it to be extended to support to future needs". That could lead to a reworked saa716x driver that would be acceptable for inclusion in the mainline, but even that hardware is quite old and it is not clear that the VDR community is interested in making those kinds of changes.

It is somewhat surprising to see a working driver with an active community getting this treatment, but Chehab is adamant that the DVB API is deprecated as part of the media subsystem. In the thread, there is a strong undercurrent of displeasure with the treatment of the DVB API over the years and with Chehab's handling of it, but Linux maintainers have a wide remit over their subsystems. The two "sides" seem pretty dug in, at this point, so it may take some other interested parties stepping up to find some kind of workable solution—or the VDR community will need to continue using drivers that are either in staging/ or out of the mainline entirely.

This event does demonstrate a bit of an expansion of the regression policy for the kernel. Even though he was not happy about the use of the header files, Torvalds did not want to break an existing user-space application, thus the revert. One wonders how many other applications or libraries have not gotten the memo about not using the kernel user-space API header files directly, however. That could lead to problems with reorganizations and cleanups down the road.

Comments (15 posted)

Adding a "duress" password with PAM Duress

By Jake Edge
August 24, 2021

Users often store a lot of sensitive information on their computers—from credentials to banned texts to family photos—that they might normally expect to be protected by the login password of their account. Under some circumstances, though, users can be required to log into their system so that some third party (e.g. government agent) can examine and potentially copy said data. A new project, PAM Duress, provides a way to add other passwords to an account, each with its own behavior, which might be a way to avoid granting full access to the system, though the legality is in question.

As its name would imply, PAM Duress is a pluggable authentication module (PAM), which is the mechanism used on Linux and other Unix operating systems to easily allow adding different kinds of authentication methods. PAM is not exactly standardized, however, so there are multiple implementations of it, including Linux PAM that is used by Linux distributions. The Duress module allows administrators to configure the system to check for one or more extra passwords if the normal password associated with the user account does not match what is provided.

The project page gives a few examples of the kind of actions that could be triggered by an alternate password:

This functionality could be used to allow someone pressed to give a password under [coercion] to provide a password that grants access but in the background runs scripts to clean up sensitive data, close connections to other networks to limit lateral movement, and/or to send off a [notification] or alert (potentially one with detailed information like location, visible wifi hotspots, a picture from the camera, a link to a stream from the microphone, etc). You could even spawn a process to remove the pam_duress module so the threat actor won't be able to see if the duress module was available.

Scripts or binaries that are meant to be used when duress passwords are entered (i.e. duress programs) can be placed in either /etc/duress.d, for global duress actions, or ~/.duress for those specific to a particular user. In either case, the duress_sign script is used to process a file into something that PAM Duress can use. The script will prompt for a password, which it hashes with SHA-256, using the SHA-256 hash of the file's contents as the salt; the resulting hash value is written to duress-script-name.sha256. That stored hash protects against changes to the script and securely "stores" the password within a single hash value.

In order to use PAM Duress, the system's PAM authentication file (e.g. /etc/pam.d/common-auth) needs to be set up to try the user's password first (with the normal pam_unix.so module); if that fails, it should be configured to invoke pam_duress.so to check for any matching duress passwords. As shown in the documentation, that might look like the following:

    # Example /etc/pam.d/common-auth
    auth    [success=2 default=ignore]      pam_unix.so
    auth    [success=1 default=ignore]      pam_duress.so

    auth    requisite                       pam_deny.so

When invoked, PAM Duress will look at each file in the two duress program directories, hashing the provided password using the hash of the file's contents as salt, then comparing it to the value in the corresponding .sha256 file; if there is a match, the duress program gets invoked. If any file matches, success will be returned to PAM, which will result in access being granted and the user's login shell being invoked after the duress programs are run.

A simple example to demonstrate the module is provided in the repository. The pushover.sh script uses the API of the Pushover service to send a message reporting that the duress password has been used. The message includes the username of who logged in along with the local and external IP addresses of the system.

On the page describing the example, the script is signed with a password that is meant to be given to each person when it gets installed on their system. The example points out that the password should be changed if someone who knows it leaves the organization, but other strategies could be used (e.g. individual passwords). The integrity hash on the scripts seems to mostly be aimed at preventing inadvertent changes, since an attacker with access to the filesystem has lots of other ways to compromise the system. The demonstration script uses a configuration file for API keys that has no hash protection so it could be changed without detection—not that it really matters in the grand scheme of things.

Sending a message when the duress password has been used seems relatively innocuous from a legal perspective, but some of the other possible uses may not be. In a lengthy Hacker News discussion about PAM Duress, multiple commenters pointed out that deleting files in the background when requested to log in at a border crossing may well be an offense—if it is detected. It is, as with everything regarding security, a question of the type of threats being protected against. As "jeroenhd" put it:

If you're held under gunpoint, that script that wipes your entire hard drive will only make your day worse.
AFAIK if you actually get detained and questioned at airports, your drive will already get imaged before any password is even tried. You may be able to get away with this on a mobile device where this feature isn't generally expected (because who uses Linux on a smartphone in the first place).
I always wonder at what scenarios like these are supposed to be about. If saying no is not an option, pissing off your captors by giving them fake info probably isn't either.
I don't know what law enforcement would be looking for on my work drive, but if saying no is no longer an option, my encryption password isn't worth getting shot over.

"Spooky23" said that the idea of handling the situation with a technical means like PAM Duress is "silly nerd porn". Someone who has data that the authorities want access to should not be crossing a border with it on their device: "The only way to win is not to play." But others pointed out that the intent of a duress password is to "give the appearance of compliance" as "dredmorbius" said, which may be enough to satisfy the investigator and head off further inquiry.

No discussion of this sort ever happens without a reference to the xkcd pipe wrench scenario. Technical means to avert the often arbitrary and capricious nature of border searches and the like are certainly attractive to those of a technical bent. But providing plausible deniability by way of hidden encrypted disk partitions, filesystems that look like gibberish, or other schemes of that nature can all fall prey to the wrench scenario. If the authorities (or criminals, though it is sometimes hard to tell the difference these days) want to access the data badly enough, they are going to find a—quite possibly non-technical—means to do so.

As might be guessed, there were suggestions of other ideas for technical measures, such as a mechanism to "nuke" an encrypted disk if a duress password is entered. There was also a pointer to an earlier project, pam_duress, which has much the same focus as its nearly identical name would imply. In addition, there were suggestions of storing the sensitive data elsewhere (e.g. encrypted in the cloud) for restoration once the destination has been reached.

In the final analysis, having no sensitive data when crossing a border or otherwise being in a situation where there is a potential for duress being applied is the safest solution. Storing said data "elsewhere" has its own set of risks, of course—the cloud is hardly beyond the reach of nation-state actors. Dangerous-to-possess data, which is often defined quite differently in various parts of the world, is difficult to handle; technical means can certainly help but they are no panacea.

Comments (45 posted)

The shrinking role of ETXTBSY

By Jonathan Corbet
August 19, 2021

Unix-like systems abound with ways to confuse new users, many of which have been present since long before Linux entered the scene. One consistent source of befuddlement is the "text file is busy" (ETXTBSY) error message that is delivered in response to an attempt to overwrite an executable image file. Linux is far less likely to deliver ETXTBSY results than it once was, but they do still happen on occasion. Recent work to simplify the mechanism behind ETXTBSY has raised a more fundamental question: does this error check have any value at all?

The "text" that is busy in this case refers to a program's executable code — it's text that is read by the CPU rather than by humans. When a program is run, its executable text is mapped into the running process's address space. When this happens, Unix systems have traditionally prevented the file containing that text from being modified; the alternative is to allow the code being run to be changed arbitrarily, which rarely leads to happy outcomes. For extra fun, the changed code will only be read if it is faulted into RAM, meaning that said unhappy outcomes might not happen until hours (or days) after the file has been overwritten. Rather than repeatedly explain to users why their programs have crashed in mysterious ways, Unix kernel developers chose many years ago to freeze the underlying file while those programs run — leading to the need to explain ETXTBSY errors instead.

Perhaps the easiest way to generate such an error is to try to rebuild a program while some process is still running it. Developers (those working in compiled languages, anyway) tend to learn early on to respond to "text file busy" errors by killing off the program they are debugging and rerunning make.

How it works

Deep within the kernel, the inode structure is used to represent files; one field within that structure is an atomic_t called i_writecount. Normally, this field can be thought of as a count of the number of times that the file is held open for writing. If, however, i_writecount is less than zero, it is interpreted instead as a count of the number of times that the writing of this file is being blocked. If the file is an executable file, then each process that runs it will decrement i_writecount for the duration of that execution. This field thus functions as a sort of simple lock. If its value is negative, the file cannot be opened for write access; if, instead, its value is positive, attempts to block write access will fail. (Similarly, an attempt to execute a file that is currently open for writing will fail with ETXTBSY).

In current kernels, it is possible to attempt to block write access with a call to deny_write_access(), but the more common way is to create a memory mapping with the VM_DENYWRITE flag set. So, for example, the execve() system call will map the code sections of the executable file into memory with VM_DENYWRITE; that mapping causes i_writecount to be decremented (this will fail if the file is open for writing, of course). When the mapping goes away (the running program exits or calls execve()), i_writecount will be incremented again; if it reaches zero, the file will once again become writable.

Back in the early days of Linux, prior to the Git era, the mmap() system call supported a flag called MAP_DENYWRITE that would cause VM_DENYWRITE to be set within the kernel and thus block write access to the mapped file for the duration of the mapping. There was a problem with this option, though: any process that could open a file for read access could map it with MAP_DENYWRITE and prevent any other process on the system from writing that file. That is, at best, an invitation to denial-of-service attacks, so it was removed long ago. Calls to mmap() with that flag set will succeed, but the flag is simply ignored.

Shared libraries

The removal of MAP_DENYWRITE had an interesting, if obscure, side effect. One may think of a file, such as /usr/bin/cat, as containing an executable program. In truth, though, much of the code that will be executed when somebody runs cat is not found in that file; instead, it is in a vast number of shared libraries. Those files contain executable code just like the nominal executable file, so one would think that they, too, would be protected from writing while in use.

Once upon a time, that was indeed the case; the ancient uselib() system call will map libraries with writing blocked. It may well be, though, that there are no systems still using uselib(); instead, on current systems, shared libraries are mapped from user space with mmap(). The MAP_DENYWRITE flag was created for just this use case, so that shared libraries could not be written while in use. When MAP_DENYWRITE went away, so did that protection; current Linux systems will happily allow a suitably privileged user to overwrite in-use, shared libraries.

The end result of this history is that the memory-management subsystem has a bunch of leftover code, in the form of the support for MAP_DENYWRITE and VM_DENYWRITE, that no longer has any real purpose. So David Hildenbrand decided to take it out. With this patch set installed, execve() will simply call deny_write_access() directly, and mmap() no longer has to consider that case at all. This results in a user-space API change: uselib() no longer blocks write access to shared libraries. Nobody expects anybody to notice.

An idea whose time has passed?

In response to Hildenbrand's patch set, GNU C library developer Florian Weimer pointed out that the library has "a persistent issue with people using cp (or similar tools) to replace system libraries". He did not say that library developers have long since tired of explaining to those users why their applications crashed in mysterious ways, but there was no need to. It would be nice, he said, to provide a way to prevent this sort of error or, at least, a way to deterministically tell that a crash was caused by an overwritten library. There are a number of ways that could be established without bringing back MAP_DENYWRITE, he said.

The discussion wandered into other ways to protect shared libraries from being overwritten while in use; Eric Biederman suggested installing them with the immutable bit set, for example. But Linus Torvalds made it clear that he thought the problem was elsewhere:

The kernel ETXTBUSY thing is purely a courtesy feature, and as people have noticed it only really works for the main executable because of various reasons. It's not something user space should even rely on, it's more of a "ok, you're doing something incredibly stupid, and we'll help you avoid shooting yourself in the foot when we notice".

After Torvalds repeated that point a couple of times, Andy Lutomirski suggested just removing the write-blocking mechanism altogether:

It’s at best erratic — it only applies for static binaries, and it has never once saved me from a problem I care about. If the program I’m recompiling crashes, I don’t care — it’s probably already part way through dying from an unrelated fatal signal. What actually happens is that I see -ETXTBUSY, think “wait, this isn’t Windows, why are there file sharing rules,” then think “wait, Linux has *one* half baked file sharing rule,” and go on with my life.

Torvalds was amenable to the idea, though he worried that some application somewhere might depend on the ETXTBSY behavior. But he noted that it has been steadily weakened over time, and nobody has complained so far. Removing it could be tried, he continued: "Worst comes to worst, we'll have to put it back, but at least we'd know what crazy thing still wants it".

Al Viro worried, though, that some installation scripts might depend on this behavior; Christian Brauner added that allowing busy executable files to be written could make some security exploits easier. Hildenbrand said that his patch set already makes the write-blocking behavior much simpler, and that he would be in favor of leaving it in place for now. The second version of the patch set, posted on August 16, retains the ETXTBSY behavior for the main executable file.

Hildenbrand's simplification work seems sure to land during the 5.15 merge window; whether ETXTBSY will disappear entirely is rather less certain. Getting rid of it strikes some developers as a nice cleanup, but there is nothing forcing that removal to happen at this time. Meanwhile, the potential for user-space regressions always exists when behavior is changed in this way. The safe approach is thus to leave ETXTBSY in place for now.

[Postscript: Lutomirski pointed to mandatory locks as the one other place in the kernel that implements unwelcome file-sharing rules. That feature is indeed unpopular; the kernel document on mandatory locks starts with a section on why they should not be used. In 2015, a configuration option was added to make mandatory locks optional, and some distributors have duly disabled them. One potential outcome of the ETXTBSY discussion looks likely to be an effort to get other distributors to do the same until it becomes clear that mandatory locks can safely be removed. Stay tuned.]

Comments (82 posted)

The Btrfs inode-number epic (part 1: the problem)

By Jonathan Corbet
August 20, 2021

Unix-like systems — and their users — tend to expect all filesystems to behave in the same way. But those users are also often interested in fancy new filesystems offering features that were never envisioned by the developers of the Unix filesystem model; that has led to a number of interesting incompatibilities over time. Btrfs is certainly one of those filesystems; it provides a long list of features that are found in few other systems, and some of those features interact poorly with the traditional view of how filesystems work. Recently, Neil Brown has been trying to resolve a specific source of confusion relating to how Btrfs handles inode numbers.

One of the key Btrfs features is subvolumes, which are essentially independent filesystems maintained within a single storage volume. Snapshots are one commonly used form of subvolume; they allow the storage of copies of the state of another subvolume at a given point in time, with the underlying data shared to the extent that it has not been changed since each snapshot was taken. There are other applications for subvolumes as well, and they tend to be heavily used; Btrfs filesystems can contain thousands of subvolumes.

Btrfs subvolumes bring some interesting quirks with them. They can be mounted independently, as if they were separate filesystems, but they also appear as a part of the filesystem hierarchy as seen from the root. So one can mount subvolumes, but a subvolume can be accessed without being mounted if a higher-level directory is mounted. Imagine, for example, that /dev/sda1 contains a Btrfs filesystem that has been mounted on /butter. One could create a pair of subvolumes with commands like:

    # cd /butter
    # btrfs subvolume create subv1
    # btrfs subvolume create subv2

The root of /butter will now appear to contain two directories (subv1 and subv2):

    # tree /butter
    /butter
    ├── subv1
    └── subv2

    2 directories, 0 files

They behave like directories most of the time but, since they are actually subvolumes, there are some differences; one cannot rename a file from one to the other, for example. A suitably privileged user can now mount either subv1 or subv2 (or both) as independent filesystems. But, as long as /butter remains mounted, both subvolumes are visible as if they were part of the same filesystem. There are some interesting consequences from this behavior, as will be seen.

Btrfs uses a subvolume ID number internally to identify subvolumes, but there is no way to make that number directly visible to user space. Instead, the filesystem allocates a separate device number (the usual major/minor pair) for each subvolume; that number can be seen with a system call like stat(). If the subvolumes are not explicitly mounted, though, numbers do not show up in files like /proc/self/mountinfo, leading to inconsistent views of how the filesystem is put together. [Update: as Brown pointed out to us privately, the numbers do not show up there even if the subvolumes are explicitly mounted.] A call to stat() on a file within a subvolume will return a device number that does not exist in files like mountinfo, a situation that occasionally confuses unaware applications.

It gets worse. Since Btrfs has a unique internal ID for each subvolume, it feels no particular need to keep inode numbers unique across those subvolumes. As a result, a process walking a Btrfs filesystem from the root may well encounter multiple files with the same inode number. Tools like find use inode numbers as a way of tracking which files they have already seen and detecting filesystem loops. For a locally mounted Btrfs filesystem, things mostly work as expected because, even though two files on different subvolumes may have the same inode number, they will have different device numbers and are thus distinct.

The kernel's NFS daemon, though, has a harder time of things. It cannot present all of those artificial device numbers to NFS clients, because that would require all of the subvolumes — again, possibly thousands of them — to show up as separate mounts on the client. So a Btrfs filesystem exported via NFS shows the same device number (the device number of the root) on all subvolumes. That works most of the time, but it can make it impossible to use a tool like find on an NFS-mounted Btrfs filesystem with subvolumes. The single device number makes it impossible to distinguish files with the same inode number on different subvolumes, causing find to abort with a message about filesystem loops. This leads to occasional complaints from users and a desire to somehow improve the situation.

These problems are not new; they have been known and understood for years. The level of complaints seems to be rising, though, perhaps as a consequence of increased use of Btrfs in production situations. In theory, the way to solve these problems is understood as well — though not all developers have the same understanding, as Neil Brown found out when he took on the task of fixing Btrfs filesystems exported via NFS. The second and last article in this series, published on August 23, explores various attempted solutions to this problem and why it turns out to be so hard to fix.

Comments (18 posted)

The Btrfs inode-number epic (part 2: solutions)

By Jonathan Corbet
August 23, 2021

The first installment in this two-part series looked at the difficulties that arise when Btrfs filesystems containing subvolumes are exported via NFS. Btrfs has a couple of quirks that complicate life in this situation: the use of separate device numbers for subvolumes and the lack of unique inode numbers across the filesystem as a whole. Recently, Neil Brown set off on an effort to try to solve these problems, only to discover that the situation was even more difficult than expected and that many attempts would be required.

Take 1: internal mounts

Brown's first patch set attempted to resolve these problems by creating the concept of "internal mounts"; these would be automatically created by Btrfs for each visible subvolume. The automount mechanism is used to make those mounts appear. With a bit of tweaking, the kernel's NFS server daemon (nfsd) can recognize these special mounts and allow them to be crossed without an explicit mount on the remote side. With this setup, the device numbers shown by the system are as expected, and inode numbers are once again unique within a given mount.

At a first glance, this patch set seemed like a good solution to the problem. When presented with a description of this approach back in July, filesystem developer Christoph Hellwig responded: "This is what I've been asking for for years". With these changes, Btrfs appears to be a bit less weird and some longstanding problems are finally resolved.

This patch set quickly ran into trouble, though. Al Viro pointed out that the mechanism for querying device numbers could generate I/O while holding a lock that does not allow for such actions, thus deadlocking the system; without that query, though, the scheme for getting the device number from the filesystem will not work. One potential alternative, providing a separate superblock for each internal mount that would contain the needed information, is even worse. Many operations in the kernel's virtual filesystem layer involve iterating through the full list of mounted superblocks; adding thousands of them for Btrfs subvolumes would create a number of new performance problems that would take massive changes to fix.

Additionally, Amir Goldstein noted that the new mount structure could create trouble for overlayfs; it would also break some of his user-space tools. There is also the little issue of how all those internal mounts would show up in /proc/mounts; on systems with large numbers of subvolumes, that would turn /proc/mounts into a huge, unwieldy mess that could also expose the names of otherwise private subvolumes.

Take 2: file handles

Brown concluded that "with its current framing the problem is unsolvable". Specifically, the problem is the 64 bits set aside for the inode number, which are not enough for Btrfs even now. The problem gets worse with overlayfs, which must combine inode numbers from multiple filesystems, yielding something that is necessarily larger than any one filesystem's numbers. Brown described the current solution in overlayfs as "it over-loads the high bits and hopes the filesystem doesn't use them", which seems less than fully ideal. But, as long as inode numbers are limited to any fixed size, there is no way around the problem, he said.

It would be better, he continued, to use the file handle provided by many filesystems, primarily for use with NFS; a file's handle can be obtained with name_to_handle_at(). The handle is of arbitrary length, and it includes a generation number, which handily gets around the problems of inode-number reuse when a file is deleted. If user space were to use handles rather than inode numbers to check whether two files are the same, a lot of problems would go away.

Of course, some new problems would also materialize, mostly in the form of the need to make a lot of changes to user-space interfaces and programs. No files exported by the kernel (/proc files, for example) use handles now, so a set of new files that included the handles would have to be created. Any program that looks at inode numbers would have to be updated. The result would be a lot of broken user-space tools. Brown has repeatedly insisted that breaking things may be possible (and necessary):

If you refuse to risk breaking anything, then you cannot make progress. Providing people can choose when things break, and have advanced warning, they often cope remarkable well.

Incompatible changes remain a hard sell, though. Beyond that, to get the full benefit from the change, Btrfs would have to be changed to stop using artificial device numbers for subvolumes, which is not a small change either. And, as Viro pointed out, it is possible for two different file handles to refer to the same file.

In summary, this approach did not win the day either.

Take 3: mount options

Brown's third attempt approached the problem from a different direction, making all of the changes explicitly opt-in. Specifically, he added two new mount options for Btrfs filesystems that would change their behavior with regard to inode and device numbers.

The first option, inumbits=, changes how inode numbers are presented; the default value of zero causes the internal object ID to be used (as is currently the case for Btrfs). A non-zero value tells Btrfs to generate inode numbers that are "mostly unique" and that fit into the given number of bits. Specifically, to generate the inode number for a given object within a subvolume, Btrfs will:

Generate an "overlay" value from the subvolume number; this is done by byte-swapping the number so that the low-order bits (which vary the most between subvolumes) are in the most-significant bit positions.
The overlay is right-shifted to fit within the number of bits specified by inumbits=. If that number is 64, no shift need be done.
That overlay value is then XORed with the object number to produce the inode number presented to user space.

The resulting inode numbers will still be unique within any given subvolume; collisions within a large Btrfs filesystem can still happen, but they are less likely than before. Setting inumbits=64 minimizes the chances of duplicate inode numbers, but a lower number (such as 56) may make sense in situations (such as when overlayfs is in use) where the top bits are used by other subsystems.

The second mount option is numdevs=; it controls how many device numbers are used to represent subvolumes within the filesystem. The default value, numdevs=many, preserves the existing behavior of allocating a separate device number for every subvolume. Setting numdevs=1, instead, causes a single device number to be used for all subvolumes. When a filesystem is mounted with this option, tools like find and du will not be able to detect the crossing of a subvolume boundary, so their options to stay within a single filesystem may not work as expected. It is also possible to specify numdevs=2, which causes two device numbers to be used in an alternating manner when moving from one subvolume to the next; this makes tools like find work as expected.

Finally, this patch set also added the concept of a "tree ID" that can be fetched with the statx() system call. Btrfs would respond to that query with the subvolume ID, which applications could then use to reliably determine whether two files are contained within the same subvolume or not.

Btrfs developer Josef Bacik described this work as "a step in the right direction", but said that he wants to see a solution that does not require special mount options. "Mount options are messy, and are just going to lead to distros turning them on without understanding what's going on and then we have to support them forever". A proper solution, he said, does not present the possibility for users to make bad decisions. He suggested just using the new tree ID within nfsd to solve the NFS-specific problems, generating new inode numbers itself if need be.

Brown countered with a suggestion that, rather than adding mount options, he could just create a new filesystem type ("btrfs2 or maybe betrfs") that would use the new semantics. Bacik didn't like that idea either, though. Brown added that he would prefer not to do "magic transformations" of Btrfs inode numbers in nfsd; if a filesystem requires such operations, they should be done in the filesystem itself, he said. He then asked that the Btrfs developers make a decision on their preferred way to solve this problem, but did not get an answer.

Take 4: the uniquifier

On August 13, Brown returned with a minimal patch aimed at solving the NFS problems that started this whole discussion. It enables a filesystem to provide a "uniquifier" value associated with a file; this value, the name of which is arguably better suited to a professional wrestler, is only available within the kernel. The NFS server can then XOR that value with the file's inode number to obtain a number that is more likely to be unique. Btrfs provides the overlay value described above as this value; nfsd uses it, and the problem (mostly) goes away.

Bacik said that this approach was "reasonable" and acked it for the Btrfs filesystem. It thus looks like it could finally be a solution for the problem at hand. Or, at least, it's closer; Brown later realized that the changed inode numbers would create the dreaded "stale file handle" errors on existing clients when the transition happens. An updated version of the patch set adds a new flag in an unused byte of the NFS file handle to mark "new-style" inode numbers and prevent this error from occurring.

The second revision of the fourth attempt may indeed be the charm that makes some NFS-specific problems go away for Btrfs users. It is hard not to see this change — an internal process involving magic numbers that still is not guaranteed to create unique inode numbers — as a bit of a hack, though. Indeed, even Brown referred to "hacks to work around misfeatures in filesystems" when talking about this work. Hacks, though, can be part of life when dealing with production systems and large user bases; a cleaner and more correct solution may not be possible without breaking user systems. So the uniquifier may be as good as things can get until some other problem is severe enough to force the acceptance of a more disruptive solution.

Comments (48 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: OpenSSH 8.7; TAB election; LibreOffice 7.2 Community; Villa on FOSS maintainers; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>