Mounting images inside a user namespace
There has long been a desire to enable users to mount filesystem images without requiring privileges, but the security implications of allowing it are seriously concerning. Few, if any, kernel filesystems are hardened against maliciously crafted images, after all. Lennart Poettering led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit where he presented a possible path forward.
He started with an overview of the problem, noting that "everybody wants to be able to mount disk images that contain arbitrary filesystems" in user space, without needing to be root. Since malicious images could crash the kernel—or worse—the only way to do that is to establish some trust in the image before it gets mounted. He talked about some components that the systemd developers want to add that would allow container managers and other unprivileged user-space programs to accomplish this.
More specifically, code that is running in a user namespace can ask the host operating system to mount a filesystem stored in the contents of a particular file. It will require that containers have some limited access to an interprocess communication (IPC) mechanism to talk to the host OS. That is different than today's containers, which generally can only use the kernel API and, perhaps, communicate in a limited way with their container manager, he said.
There are multiple use cases for this feature, including unprivileged container managers that want to run containers from disk images, but also for tools that build container images. There are desktop application runtimes that want to be able to run apps from images as well. Essentially, any tool that wants to be able to work with disk images, but not have special privileges, could benefit.
There are a number of complexities for any solution. Some kind of trust needs to be established in the images before they get mounted; immutable images using dm-verity are easier in that regard, but there is a desire to also have limited support for writable images. Minimizing or eliminating the need for the host to enter the caller's namespace in order to perform the mount is also desirable. Recursion in the form of nested containers should be supported without needing to resort to special cases, as well, he said.
Poettering described how this all might work. An unprivileged process P, which might be a container manager, creates a user namespace U, but does not give U any user/group mappings. It then passes a file descriptor for U through an IPC mechanism to a service on the host, X, which could be a privileged process provided by systemd; X assigns a transient UID/GID range (64K of each, for example) to U. These transient ranges are a "key idea" of the feature; the transient ranges only last as long as the user namespace does and they are recycled when it goes away, unlike persistent UID/GID ranges. It is "dramatically different" to the way these ranges are handled today, he said.
X enforces a security policy on U that only allows a small subset of filesystem operations (open() for create, chmod(), and "a couple of other things") and only on mounts that appear in an allowlist, which is initially empty. So, initially, P cannot create any files. P can talk to Y, which is a different service, via IPC, passing it a file descriptor to U and another descriptor of an image file it would like to mount. It gets back a file descriptor, like one returned from fsmount() (in the new mount API), that corresponds to the mounted image with the ID-mapping from U already applied (using ID-mapped mounts). Y talks to X to get this new mount added to the allowlist and P can attach the mount file descriptor wherever it wants and join U if it has not already done so.
It looks like a lot of steps, he said, but for a client application it is fairly easy. The client simply makes an IPC call to get the user namespace set up and then a second one to get the mount. It can pass multiple images to Y to get multiple mounted filesystems and then it can attach them wherever makes sense in its directory hierarchy.
Instead of X and Y, he got more specific; he used the placeholders because the concept is entirely generic, so it could be implemented in other ways. For systemd, X would be systemd-userdbd and Y would be a new systemd-mntfsd service. The security policy he described for systemd-userdbd would be implemented using the BPF Linux security module (BPF-LSM). The images to be mounted by systemd-mntfsd would be in the discoverable disk image (DDI) format. More information about DDI (and other surrounding efforts) can be found in the report from last year's Image-Based Linux Summit.
These images have a GPT partition table and are separated into several partitions. One partition is for the filesystem, while another has the dm-verity information. There is a third partition with a signature for the root-level hash of the filesystem, which gets verified by the kernel using its internal keyring. If it passes, systemd-mntfsd will set up the filesystem and dm-verity, apply the user mapping, and return it to the requesting process. DDI makes it convenient to wrap each of those three parts together into a single image.
Another mechanism for trusting images would be to have a trusted directory on the host. Since only privileged processes should be able to write into that directory, systemd-mntfsd could be configured to allow requests to mount images from there. That provides a weaker level of trust but may be fine for some systems, he said.
Those two options (signed DDI and trusted directories) are already implemented and should appear in the next release of systemd. Another mechanism, which would allow mounting writable filesystems, is still being worked on. The idea would be that the requester (perhaps a tool building images) asks for a filesystem of a certain type and size that would be stored in a provided image file, which systemd-mntfsd would create (using mkfs) in the file; it would then add a dm-integrity sidecar file that tracks the changes to the filesystem image. Dm-integrity would use a hash with a key that is not accessible to the caller, so the sidecar file can only be (correctly) updated by the kernel. The caller can provide the image and the sidecar file at a later point and the mount service will be willing to mount it again. If the sidecar file is not present (or is corrupted), the image will not be mounted.
He was asked about using signed fs-verity files as well. He said that it is all being done in user space, so other mechanisms could be added if they make sense. His goal is generally to let the kernel make these trust decisions based on keys on its keyring, rather than "doing trust enforcement in user space", but others may want to do things differently.
Ted Ts'o suggested that systemd-mntfsd could copy an image file to a block device that is inaccessible to the requester, then run fsck on the filesystem image. If it passes that check, it could be mounted in a suitable fashion (e.g. nosuid, nodev) and handed off to the container without needing to use dm-verity. Poettering said that fsck is already being used in the writable case, "but it was news to me that this is the philosophy that filesystem engineers subscribe to". He noted that other filesystem developers were "shaking their heads", so he did not think that there was universal agreement that fsck was sufficient to detect malicious images.
Ts'o said that it would depend on the filesystem, so Poettering tried to get a commitment about ext4, but Ts'o hedged things a bit. He is "reasonably confident" that it is not possible to cause a "buffer overrun or privilege-escalation attack" with an ext4 filesystem that passes fsck. Denial-of-service due to an overly fragmented filesystem would be a possibility, though, so it "depends on what your threat model is", he said. Josef Bacik said that he just comes from a standpoint of being paranoid. He trusts that the Btrfs fsck does a good job to ensure that there is a valid filesystem, but it, like him, is imperfect. It sounds like a good solution, but he would be leery of trusting it in a high-security situation.
Jeff Layton asked about network filesystems. Poettering thought that might be less worrisome, but Layton assured him that it would not be. There is interest in being able to pass a directory file descriptor to systemd-mntfsd, which will bind-mount to that directory, apply the UID mapping, and return that to the requester, Poettering said. That is not particularly risky because the filesystem is already mounted in the system, which is perhaps analogous to the network-filesystem case. But it turns out that none of the network filesystems implement ID mapping, though Christian Brauner said that he had gotten it working for CephFS (with some caveats).
Layton said that a malicious server was just as bad or worse than a malicious image, but that NFS had recently added TLS support. One way to establish trust in that environment would be to only allow servers that can present a properly signed TLS certificate. David Howells raised the automounter as another thing to consider, while Steve French mentioned SMB. Poettering said that if there is a need to mount these kinds of things in containers, they can be added, "as long as there's some kind of sensible security story in place".
There is an unresolved problem that has cropped up, he said. LSMs cannot restrict manipulations of access-control lists (ACLs), so it is a way that the transient IDs in the user namespace (U above) could leak out into the rest of the system in a persistent fashion. Perhaps it is not a big problem, he said, but all of the other ways that these IDs can be persistently associated with filesystem objects (e.g. chown()) are being blocked. He is not too concerned, but it is a low-severity vulnerability.
He gave a demo at around 19:10 in the video of the session. He started systemd-userdbd in one window, systemd-mntfsd in another, and then handed a disk image to systemd-dissect, which mounted it using the new mechanism and then pulled it all apart. He ran it as an unprivileged user "and it just works". The user IDs are handled correctly and it is all "extremely simple". Furthermore, it is something of a showcase of recent kernel features, such as the new mount API (across namespaces) and BPF-LSM; they and a few others can be combined to provide this long-sought feature.
He is pleased with the result, because "it is tiny", is socket-activated so it is not running all of the time, and there is just a single socket for IPC that needs to be bind-mounted into container to make it all work. Brauner pointed out that the superblock is not owned by the user namespace where the mount is being done, "which means that all of the destructive ioctl()s" that exist for Btrfs or XFS are not available to the container. But the container does own the mount, which means it can unmount it. The ownership of the mount is separate from the ownership of the superblock, he said, which is a nice side effect.
An attendee asked whether the containers would have access to the image files after the mount had been done. If so, a container could modify the image, thus potentially crash or compromise the kernel that way. Poettering said that the containers may have access to those files, since they might own them, but that dm-verity is meant to prevent any changes; if the image file is changed, any read of that region will return an error. Other mechanisms, such as fs-verity and dm-integrity, would also provide that kind of protection. He noted that in the fsck scenario, Ts'o had said that the image would need to be copied to a location inaccessible to the container.
The session ended with a quick discussion of how a network filesystem might be mounted in a separate network namespace for the container. Poettering said that it was something to work out with the network-filesystem developers, since it would need to be a mount option of some sort. Howells said that it would straightforward to do that using the new mount API if it were deemed desirable.
| Index entries for this article | |
|---|---|
| Kernel | Containers |
| Kernel | Filesystems/Mounting |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2023 |
