An operation for filesystem tucking

By Jonathan Corbet
March 31, 2023

As a general rule, the purpose behind mounting a filesystem is to make that filesystem's contents visible to the system, or at least to the mount namespace where that mount occurs. For similar reasons, it is unusual to mount one filesystem on top of another, since that would cause the contents of the over-mounted filesystem to be hidden. There are exceptions to everything, though, and that extends to mounted filesystems; a "tucking" mechanism proposed by Christian Brauner is designed to hide mounted filesystems underneath other mounts — temporarily, at least.

Brauner's cover letter describes the intended use case in great detail; the text below is an attempt to boil that discussion down a bit. His explanation leans heavily on the concepts of shared subtrees and mount propagation; a review of this article provides some context that may make the description easier to follow.

Filesystem tucking is aimed at systems running service and/or container workloads following the developing image-based model. Each service or container within such a system has a filesystem hierarchy that is an assembly of immutable base filesystems that can have overlay filesystems mounted on top. The base filesystems provide the core operating system, while the overlays supply task-specific components, updates, and configuration files. Each container can have its own filesystem hierarchy, but often most of them will share many of the components; each container assembles its hierarchy in its own mount namespace.

As an example, a container might mount its root filesystem from an immutable image on the host. Overlays might then supply some needed binaries or the necessary configuration files in /etc. Through the use of overlayfs, these overlays can add files to or replace specific files in the underlying image, with the result providing the needed files to the container.

This mechanism can work well, but there are potential problems when an overlay image needs to be updated — to reflect a configuration change or apply a security update, for example. One approach would be to send a message to every container instructing it to mount the new overlay, but that can be CPU-intensive and depends on each container responding appropriately. Instead, mount propagation can be used to unmount the old overlay and mount the new one on the host; those changes will then automatically propagate into every mount namespace using that overlay.

There is, however, a different problem with that approach: the need to unmount the old overlay before mounting the new one creates a window when the overlay is missing in the containers. That could lead to strange behavior or, possibly, an opening for a downgrade attack. It would be nice if there were a way to seamlessly move containers to the new overlay image automatically without opening that window, and without having to manage each container individually.

The answer is filesystem tucking. Consider, for example, a simple overlay mounted on /etc:

The base image provides the contents of /etc, in immutable form, for all containers on the system; the overlay then provides whatever additions are needed for the type of container that is running. At some point, the need to make a change to the configuration arises, so the overlay needs to be replaced. Rather than unmounting the existing overlay immediately, the first step is to "tuck" the new overlay underneath the old one:

Mount propagation will cause this new overlay to be mounted in every container where it is needed, but the new overlay will not make any visible changes to the filesystem contents at this point, since the old overlay is mounted on top of it. As far as an application running inside the container is concerned, nothing has changed. But, then, the old overlay can be unmounted, yielding:

The old overlay is no longer masking the new one, so the new overlay becomes active. When this sequence is followed, the update to the new overlay is done atomically from the container's point of view; there is no time where the base image is directly exposed.

It is possible to do this kind of tucking now, but not easily; the system calls for mounting (even using the new mount API) do not allow for mounting a filesystem underneath another in this way. The curious can read Brauner's cover letter for the details on how it can be done. To make life easier, Brauner proposed the addition of a new flag, MNT_TREE_TUCK (later changed to a separate operation called MOVE_MOUNT_BENEATH in a later revision of the patch set), to the move_mount() system call; it would cause the new mount to be placed underneath the existing mount at the mount point, rather than on top of it. There are a number of restrictions on this operation, including a prohibition on tucking a mount underneath the root filesystem, and a requirement that the caller have the privilege to unmount the filesystem under which the new one will be tucked.

At the command-line level, Brauner describes a new option to the mount command to create a tucked mount. In the first posting, this option was called --tuck, but perhaps fearing that this would cause mount to join fsck on the list of carefully typed (and pronounced) filesystem commands, Brauner changed it to --beneath in the second revision.

As of this writing, there have been no responses at all to this proposal; perhaps potential reviewers are still working their way through the cover material. Filesystem tucking seems likely to come up at the LSFMM/BPF Summit in May. It does appear that there is a use case for this feature, though, and no immediate downsides to having it, so chances are that some form of this capability will eventually find its way into the mainline.

Index entries for this article
Kernel	Filesystems/Mounting

An operation for filesystem tucking

Posted Mar 31, 2023 14:53 UTC (Fri) by bluca (subscriber, #118303) [Link]

Friar Tuck's new favourite syscall!

An operation for filesystem tucking

Posted Mar 31, 2023 16:21 UTC (Fri) by jreiser (subscriber, #11027) [Link] (1 responses)

From the viewpoint of an application, there may be more races. The new overlay should be at least API (application programming interface) compatible with the old, and ABI (application binary interface) compatibility also may be required in some cases. For example, if the unmount of the old overlay occurs while rtld, the run-time dynamic linker, is loading a list of shared libraries into an address space, then it may be easy to trigger an incompatibility between libraries at the beginning of the list versus libraries at the end. Of course such a race is present already with any overlay, but now there will be more races, even (especially) during security updates that use such tucking. Checking the fsuid of loaded pieces may become even more important. Pausing for an overlay update may be the only general solution.

An operation for filesystem tucking

Posted Mar 31, 2023 20:24 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> For example, if the unmount of the old overlay occurs while rtld, the run-time dynamic linker, is loading a list of shared libraries into an address space, then it may be easy to trigger an incompatibility between libraries at the beginning of the list versus libraries at the end.

As far as I can tell, this is simply illegal. umount(8) says this:

> Note that a filesystem cannot be unmounted when it is 'busy' - for example, when there are open files on it[...]

As long as the loader has at least one file in the mount open, it cannot be unmounted, at least according to this man page.

This is probably also why they want to tuck the new mount underneath the old mount, instead of putting it on top - umount(2) already provides flags for doing a "lazy" unmount that avoids busyness problems, but there is no such thing as a "lazy" mount(2).

An operation for filesystem tucking

Posted Mar 31, 2023 16:56 UTC (Fri) by jengelh (guest, #33263) [Link]

While the process is running, /usr/bin/whatever is in use (memory mappings), and /etc/passwd might be open (file descriptor), both blocking unmounting. Does overlayfs even support modifying the stack of mounts already? And how does that affect open fds?

An operation for filesystem tucking

Posted Mar 31, 2023 17:00 UTC (Fri) by Karellen (subscriber, #67644) [Link] (13 responses)

I wonder, why not mount the new overlay on top of the old overlay, and then unmount the old overlay from beneath it, instead?

That should provide the same atomic, no possibility of missing overlay, semantics, but without the extra flags and options, shouldn't it?

If unmounting covered filesystems is not currently allowed, then some new code may be needed to allow that, but it's not immediately obvious to me why that approach would be longer or more complex than the proposed "tucking" mechanism?

An operation for filesystem tucking

Posted Mar 31, 2023 20:59 UTC (Fri) by epa (subscriber, #39769) [Link] (10 responses)

I guess not quite, if the old overlay provided a file which is not present in the new overlay.

An operation for filesystem tucking

Posted Mar 31, 2023 21:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

The new overlay could contain a whiteout, if necessary. AFAICT, the real problem is that "lazy" umount(2) exists, but "lazy" mount(2) does not exist.

An operation for filesystem tucking

Posted Apr 1, 2023 6:41 UTC (Sat) by epa (subscriber, #39769) [Link]

What if the old overlay contained a file for /etc/aliases but the new overlay does not because you decided, after all, not to override /etc/aliases but stay with the base system version.

An operation for filesystem tucking

Posted Apr 1, 2023 10:49 UTC (Sat) by Karellen (subscriber, #67644) [Link] (7 responses)

On the other hand, if the new overlay provides a file not in the old overlay, in the tuck scenario it could "peek through" before the old overlay is removed.

An operation for filesystem tucking

Posted Apr 1, 2023 20:03 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (6 responses)

IMHO this is less likely to cause problems, because it wouldn't be seen (by the processes inside the container) as an existing file getting abruptly replaced or deleted. While it is certainly possible for software to enumerate the contents of a directory and (maybe) get upset when a new file shows up unexpectedly, in practice this usually only happens at startup or on an explicit signal such as SIGHUP or SIGUSR1/2 (at least for server-like things that you would put in a container in the first place - obviously something such as GNOME is monitoring ~/Desktop all the time so that it can draw icons etc., but running GNOME in a container is probably not the target use case here).

An operation for filesystem tucking

Posted Apr 3, 2023 12:57 UTC (Mon) by mgedmin (subscriber, #34497) [Link] (1 responses)

The file in the new overlay might be overriding an existing file in the base overlay that was not being overridden by the old overlay.

An operation for filesystem tucking

Posted Apr 3, 2023 17:49 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

I think the simplest way to handle that is to make it userspace's problem: If you want to tuck a new overlay that overrides files in the base overlay, you first tuck a copy of the old overlay that also contains all of the files you are going to override. Any reasonable container system should be keeping track of what files are supposed to exist in each layer, and so it should be able to detect situations like this.

An operation for filesystem tucking

Posted Apr 3, 2023 22:13 UTC (Mon) by stefanha (subscriber, #55072) [Link] (3 responses)

It's still not atomic. A new file dropped into a .d directory can easily cause problems since the old mount is still in place with the majority of the set of files.

This seems fragile to me, but then, in place updates are always tricky.

An operation for filesystem tucking

Posted Apr 4, 2023 7:38 UTC (Tue) by smurf (subscriber, #17840) [Link] (2 responses)

Of course it's still fragile, but it's less fragile than a mount-on-top with some way to unmount the lower-layered file system later.

An operation for filesystem tucking

Posted Apr 9, 2023 7:49 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

To be fair, the classic (pre-container) way to do this is to flip a symlink, and that really is perfectly atomic.

One wonders why userspace container systems don't just put a symlink inside of the filesystem image. You could have, for example, two mount points A and B, and a symlink to whichever mount point is currently in use. When you want to upgrade to a new version, you mount it on the other mount point, and then flip the symlink. Perhaps the container software doesn't want to dictate the use of such symlinks? But that seems like a rather... insubstantial reason to make this the kernel's problem, IMHO.

An operation for filesystem tucking

Posted Apr 9, 2023 10:46 UTC (Sun) by smurf (subscriber, #17840) [Link]

If you flip that symlink while a program is starting up it'll get some libraries from A and others from B, which may cause it to crash. So in that sense it's definitely not atomic. The same problem applies to a mount-on-top operation.

On the other hand if you have a tuck-mount you can do an atomic unmount afterwards, but only if there's no current user. That guarantees that you don't run into inconsistent libraries et al. even when you can't control exactly when a program (re)starts.

An operation for filesystem tucking

Posted Apr 1, 2023 12:56 UTC (Sat) by smurf (subscriber, #17840) [Link]

With the "mount underneath" method you can simply unmount the top when there are no users. Thus you're guaranteed to have a consistent state at that point.

When mounting on top there probably are open files on the overlaid file system. You now have a mix of open old and unopened new content, which is a surefire recipe for eventual disaster.

An operation for filesystem tucking

Posted Apr 4, 2023 13:38 UTC (Tue) by nim-nim (subscriber, #34454) [Link]

> I wonder, why not mount the new overlay on top of the old overlay, and then unmount the old overlay from beneath it, instead?

IMHO the article does not state clearly that local changes (for example /etc) are likely to exist in a RW layer mounted over the old overlay, so you still need to tuck under it. Nobody really succeeded in strict separation between RO and RW directories for complex setups, that‘s an ideal people are still chasing.

Also since I suppose unmounting the old overlay is faster than mounting the new one, that means you can wait for the new overlay to be mounted under the old one in all affected containers in the system, and them pop the old one in one go.

An operation for filesystem tucking

Posted Mar 31, 2023 21:43 UTC (Fri) by roc (subscriber, #30627) [Link]

To me, mount semantics seem like a complete nightmare of overlapping, piled-up complexity. If I were to go looking for critical kernel bugs that's where I'd be looking.

An operation for filesystem tucking

Posted Apr 1, 2023 7:17 UTC (Sat) by tchernobog (guest, #73595) [Link] (1 responses)

Does anybody know how this might tie in with the inotify mechanics? Is the inotify fd just invalidated upon unmounting of the topmost overlay, and the application has to reopen it?

An operation for filesystem tucking

Posted Apr 1, 2023 20:13 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

According to inotify(7): You get an IN_UNMOUNT event, followed by an IN_IGNORED (which indicates that the watch has been invalidated). You do not explicitly pass either of those events to inotify_add_watch(2) (the kernel just sends them to you automatically), so the only way to mess this up is to fail to read inotify(7) and miss a case.

An operation for filesystem tucking

Posted Apr 7, 2023 18:25 UTC (Fri) by bbbush (subscriber, #17456) [Link]

in rpm-ostree one can replace a package with another. It sounds like the same is applied to filesystems, so that the overall status is good. When replacing the package, the status of the tree. When replacing the filesystems, the status of the tree as well? Then it does not matter if any file is open, because the operation is going to be followed by a "systemctl reboot"?