An operation for filesystem tucking
Brauner's cover letter describes the intended use case in great detail; the text below is an attempt to boil that discussion down a bit. His explanation leans heavily on the concepts of shared subtrees and mount propagation; a review of this article provides some context that may make the description easier to follow.
Filesystem tucking is aimed at systems running service and/or container workloads following the developing image-based model. Each service or container within such a system has a filesystem hierarchy that is an assembly of immutable base filesystems that can have overlay filesystems mounted on top. The base filesystems provide the core operating system, while the overlays supply task-specific components, updates, and configuration files. Each container can have its own filesystem hierarchy, but often most of them will share many of the components; each container assembles its hierarchy in its own mount namespace.
As an example, a container might mount its root filesystem from an immutable image on the host. Overlays might then supply some needed binaries or the necessary configuration files in /etc. Through the use of overlayfs, these overlays can add files to or replace specific files in the underlying image, with the result providing the needed files to the container.
This mechanism can work well, but there are potential problems when an overlay image needs to be updated — to reflect a configuration change or apply a security update, for example. One approach would be to send a message to every container instructing it to mount the new overlay, but that can be CPU-intensive and depends on each container responding appropriately. Instead, mount propagation can be used to unmount the old overlay and mount the new one on the host; those changes will then automatically propagate into every mount namespace using that overlay.
There is, however, a different problem with that approach: the need to unmount the old overlay before mounting the new one creates a window when the overlay is missing in the containers. That could lead to strange behavior or, possibly, an opening for a downgrade attack. It would be nice if there were a way to seamlessly move containers to the new overlay image automatically without opening that window, and without having to manage each container individually.
The answer is filesystem tucking. Consider, for example, a simple overlay mounted on /etc:
The base image provides the contents of /etc, in immutable form, for all containers on the system; the overlay then provides whatever additions are needed for the type of container that is running. At some point, the need to make a change to the configuration arises, so the overlay needs to be replaced. Rather than unmounting the existing overlay immediately, the first step is to "tuck" the new overlay underneath the old one:
Mount propagation will cause this new overlay to be mounted in every container where it is needed, but the new overlay will not make any visible changes to the filesystem contents at this point, since the old overlay is mounted on top of it. As far as an application running inside the container is concerned, nothing has changed. But, then, the old overlay can be unmounted, yielding:
The old overlay is no longer masking the new one, so the new overlay becomes active. When this sequence is followed, the update to the new overlay is done atomically from the container's point of view; there is no time where the base image is directly exposed.
It is possible to do this kind of tucking now, but not easily; the system calls for mounting (even using the new mount API) do not allow for mounting a filesystem underneath another in this way. The curious can read Brauner's cover letter for the details on how it can be done. To make life easier, Brauner proposed the addition of a new flag, MNT_TREE_TUCK (later changed to a separate operation called MOVE_MOUNT_BENEATH in a later revision of the patch set), to the move_mount() system call; it would cause the new mount to be placed underneath the existing mount at the mount point, rather than on top of it. There are a number of restrictions on this operation, including a prohibition on tucking a mount underneath the root filesystem, and a requirement that the caller have the privilege to unmount the filesystem under which the new one will be tucked.
At the command-line level, Brauner describes a new option to the mount command to create a tucked mount. In the first posting, this option was called --tuck, but perhaps fearing that this would cause mount to join fsck on the list of carefully typed (and pronounced) filesystem commands, Brauner changed it to --beneath in the second revision.
As of this writing, there have been no responses at all to this proposal;
perhaps potential reviewers are still working their way through the cover
material. Filesystem tucking seems likely to come up at the
LSFMM/BPF Summit in
May. It does appear that there is a use case for this feature, though, and
no immediate downsides to having it, so chances are that some form of this
capability will eventually find its way into the mainline.
Index entries for this article | |
---|---|
Kernel | Filesystems/Mounting |
Posted Mar 31, 2023 14:53 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Mar 31, 2023 16:21 UTC (Fri)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Mar 31, 2023 20:24 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
As far as I can tell, this is simply illegal. umount(8) says this:
> Note that a filesystem cannot be unmounted when it is 'busy' - for example, when there are open files on it[...]
As long as the loader has at least one file in the mount open, it cannot be unmounted, at least according to this man page.
This is probably also why they want to tuck the new mount underneath the old mount, instead of putting it on top - umount(2) already provides flags for doing a "lazy" unmount that avoids busyness problems, but there is no such thing as a "lazy" mount(2).
Posted Mar 31, 2023 16:56 UTC (Fri)
by jengelh (guest, #33263)
[Link]
Posted Mar 31, 2023 17:00 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (13 responses)
I wonder, why not mount the new overlay on top of the old overlay, and then unmount the old overlay from beneath it, instead? That should provide the same atomic, no possibility of missing overlay, semantics, but without the extra flags and options, shouldn't it? If unmounting covered filesystems is not currently allowed, then some new code may be needed to allow that, but it's not immediately obvious to me why that approach would be longer or more complex than the proposed "tucking" mechanism?
Posted Mar 31, 2023 20:59 UTC (Fri)
by epa (subscriber, #39769)
[Link] (10 responses)
Posted Mar 31, 2023 21:14 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Apr 1, 2023 6:41 UTC (Sat)
by epa (subscriber, #39769)
[Link]
Posted Apr 1, 2023 10:49 UTC (Sat)
by Karellen (subscriber, #67644)
[Link] (7 responses)
Posted Apr 1, 2023 20:03 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Posted Apr 3, 2023 12:57 UTC (Mon)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
Posted Apr 3, 2023 17:49 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Apr 3, 2023 22:13 UTC (Mon)
by stefanha (subscriber, #55072)
[Link] (3 responses)
This seems fragile to me, but then, in place updates are always tricky.
Posted Apr 4, 2023 7:38 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (2 responses)
Posted Apr 9, 2023 7:49 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
One wonders why userspace container systems don't just put a symlink inside of the filesystem image. You could have, for example, two mount points A and B, and a symlink to whichever mount point is currently in use. When you want to upgrade to a new version, you mount it on the other mount point, and then flip the symlink. Perhaps the container software doesn't want to dictate the use of such symlinks? But that seems like a rather... insubstantial reason to make this the kernel's problem, IMHO.
Posted Apr 9, 2023 10:46 UTC (Sun)
by smurf (subscriber, #17840)
[Link]
On the other hand if you have a tuck-mount you can do an atomic unmount afterwards, but only if there's no current user. That guarantees that you don't run into inconsistent libraries et al. even when you can't control exactly when a program (re)starts.
Posted Apr 1, 2023 12:56 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
When mounting on top there probably are open files on the overlaid file system. You now have a mix of open old and unopened new content, which is a surefire recipe for eventual disaster.
Posted Apr 4, 2023 13:38 UTC (Tue)
by nim-nim (subscriber, #34454)
[Link]
IMHO the article does not state clearly that local changes (for example /etc) are likely to exist in a RW layer mounted over the old overlay, so you still need to tuck under it. Nobody really succeeded in strict separation between RO and RW directories for complex setups, that‘s an ideal people are still chasing.
Also since I suppose unmounting the old overlay is faster than mounting the new one, that means you can wait for the new overlay to be mounted under the old one in all affected containers in the system, and them pop the old one in one go.
Posted Mar 31, 2023 21:43 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Apr 1, 2023 7:17 UTC (Sat)
by tchernobog (guest, #73595)
[Link] (1 responses)
Posted Apr 1, 2023 20:13 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
Posted Apr 7, 2023 18:25 UTC (Fri)
by bbbush (subscriber, #17456)
[Link]
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking
An operation for filesystem tucking