Filesystem mounts in user namespaces — one year later

By Jonathan Corbet
August 17, 2016

User namespaces allow an ordinary user to set up an environment in which they appear to be running as root and have access to all operations normally reserved to privileged users — with exceptions where such access cannot be made safe for the system as a whole. One of those exceptions is mounting of filesystems with backing store (in other words, most real filesystems as opposed to in-memory filesystems like tmpfs). Work on enabling filesystem mounts in user namespaces has been underway for over a year. While these mounts are still disallowed in the 4.8 kernel, many of the infrastructural changes needed to eventually allow them have been merged.

There are numerous potential problems with allowing filesystem mounts in user namespaces. Most of these result from the fact that user and group ID values are mapped to different values inside user namespaces; this feature allows root access to be turned into something unprivileged outside of the namespace, but it also creates the potential for grave confusion if objects in the filesystem are accessed from multiple namespaces. Additionally, a great deal of privilege information is encoded in filesystems; this includes setuid executables, file capabilities, and security labels. Making filesystem mounts in user namespaces safe requires ensuring that there are no corner cases where privilege can leak across a namespace boundary.

Special and setuid files

As was the case a year ago, the patch set starts by adding a new s_user_ns field to the superblock structure used to represent a mounted filesystem. This field allows privilege-checking code to detect filesystems mounted inside a namespace other than the initial namespace and adjust its privilege checking accordingly.

One obvious problem area with unprivileged filesystem mounting is device files. A device special file gives direct access to a device on the system, subject to the protections applied to the file. A filesystem mounted in a user namespace may be fully under the control of an unprivileged user, who could happily populate it with device files that just happen to allow full access; that, in turn, would be a direct path to a total compromise of the system. Developers working on user namespaces wisely figured out some time ago that this particular feature might be seen as undesirable by the wider community and have put a number of defensive mechanisms in place.

The kernel has long had the ability to block the use of device special files on a given filesystem; it is a matter of setting the nodev flag at mount time. That flag is automatically set now in the few situations (such as tmpfs) where filesystem mounts are allowed inside user namespaces, but it has long been seen as a dangerous area, since there is always potential that the filesystem could be remounted in a way that removes that flag. In 4.8, the nodev flag has been moved internally from the vfsmount structure (representing a specific mount point) to the superblock structure, of which there is a single instance for all points where a given filesystem is mounted. That makes it impossible to strip the nodev flag away once it has been set and, hopefully, heads off attacks based on device special files.

Once a user mounts a filesystem within a user namespace, said user can create setuid files and files with capability masks. As long as those files are contained within that namespace, all is well; they will not give the user any privileges they do not already have. But perils await if those files can somehow be accessed from outside of that namespace and, unfortunately, there are paths (files under /proc, for example) where that might happen. To avoid excessive consumption of CVE numbers in this area, the 4.8 kernel will refuse to respect setuid flags, setgid flags, or file capabilities on a file if that file is accessed from anything other than the namespace in which it was initially mounted.

The next problem area is security labels used by security modules like SELinux or Smack. Security labels are not mapped at namespace boundaries like user and group IDs are, so there is potential for trouble if a process within a user namespace is able to gain security labels that it would not otherwise have. Running executables can be a path by which these labels are acquired, so, once again, the system has to be careful about what it allows within a user namespace, where the files may be entirely under an attacker's control. It must also be impossible for labels created within a namespace to be used to elevate privilege outside that namespace.

Early work in this area simply disabled security labels entirely, but that had unfortunate effects of its own, often rendering filesystems entirely unusable. There seems to be agreement that the ideal solution would be to look at the label applied to the filesystem's backing store and use it for all files within the filesystem in a user namespace. That would give processes in the namespace the same access that they would have outside of it. Unfortunately, it seems to be difficult to obtain that information at the points in the kernel where labels need to be checked, so that approach cannot be used.

The second-best approach is to use the label of the process that mounted the filesystem. Once again, that will keep the process from exceeding the privilege it has anyway. If Smack is running, it will deny access to files whose label does not match the mounting process's label. For SELinux, at this point, the label applied to the mount point applies through the entire filesystem and the labels on individual files are simply ignored. There are suggestions that the SELinux approach may gain sophistication over time.

User and group IDs

There is yet another set of potential pitfalls around user and group IDs in general. When a user namespace is created, a set of ID mappings is set up along with it. Those mappings translate in-namespace user and group IDs to their equivalents outside of the namespace. A user with ID 1000 may start a user namespace where 1000 in the wider world is mapped to 0 (root) inside the namespace; others IDs can be mapped as well. One result of this mapping is a restriction in the range of valid user and group IDs. In the initial namespace, any ID is considered valid by the kernel; inside a user namespace, instead, only IDs that have been explicitly mapped are valid. That is a significant semantic change in how these IDs are handled.

When a process within a user namespace accesses a filesystem mounted outside that namespace, its user and group IDs will be mapped accordingly before any access decisions are made. If the filesystem has been mounted within the namespace, though, that mapping should not happen. But how should the kernel respond if that filesystem contains user and group IDs that are not considered valid within the namespace? The answer, in general, is that those values are mapped to the special INVALID_UID and INVALID_GID values, both of which are defined to be -1. That change prevents unknown IDs within the namespace from creating confusion outside the namespace, but it has a few hazards of its own.

For example, the "protected symlinks" mechanism, if enabled, prevents a process from following a symbolic link if the user ID of the link owner differs from that of the directory in which the link is found. This restriction (along with a couple of others) is meant to thwart temporary file vulnerabilities and related unpleasantness when a privileged process is fooled into opening a file it was not meant to open. But consider a situation where the link and containing-directory owners do not match, but where both are owned by IDs that are not valid in the namespace; they will both be mapped to INVALID_UID and will, within the namespace, appear to match. The function may_follow_link(), where these checks are made, has been duly changed to ensure that the IDs are valid.

Creating files with invalid IDs can only lead to trouble later on, so that must be prohibited. A more interesting problem arises when attempting to modify a file that already has an invalid ID. Any changes to a file will cause its inode structure to be written back to persistent storage, but the inode will contain the INVALID_UID and INVALID_GID values, which should not be written back. The solution adopted here is to simply disallow modification of files with invalid IDs altogether, rather than try to fix all the ways that a corrupt inode might be written. So such files will be read-only on filesystems mounted within a user namespace, even if their permissions would allow modification. There is one exception planned, though it has not been implemented yet: changing the IDs to valid values will be allowed; doing so will potentially make the file writable.

Not done yet

The list of potential problems does not end there. Disk quotas must be handled properly and not be confused by invalid IDs. The integrity management subsystem must be taught to not be fooled by filesystems mounted in user namespaces. Access-control lists must be checked to ensure that they do not contain invalid IDs. And so on, including, presumably, some special cases that nobody has thought about. For this reason, Eric Biederman, who took over the patch set from Seth Forshee and made it suitable for merging, said in the pull request: "I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically." Finding all of those cases may take a while yet, and it may take even longer for developers to convince themselves that no traps remain.

But there is another aspect to the problem that was extensively discussed in 2015, but which has not come up at all this time around. Most Linux filesystems are simply not designed to be robust in the face of deliberately hostile on-disk filesystem images. As long as mounting a filesystem has been a privileged operation, that has not normally been a problem; getting an administrator to mount a random ext4 or XFS filesystem image can take quite a bit of social engineering. Once filesystem mounts are possible within user namespaces, that situation changes: any attacker with a local account can mount malicious filesystems. They will also be able to change the underlying filesystem at any time, further widening the range of potential mischief. Said mischief undoubtedly includes a number of opportunities to fully compromise the kernel.

Hardening filesystems against attacks from below is not a simple matter; such hardening has never been a design goal for the filesystems in common use. Doing so in a way that preserves performance will be even more difficult. So it is not really clear when it might truly be safe to allow unprivileged users to mount arbitrary filesystems. Even after the kernel reaches a point where this capability might be enabled, one might well expect distributors and administrators to keep it disabled for a long time. There is clear value in this feature, but it may be some time before it's made widely available.

Index entries for this article
Kernel	Namespaces/User namespaces

Filesystem mounts in user namespaces — one year later

Posted Aug 18, 2016 6:13 UTC (Thu) by bronson (subscriber, #4806) [Link] (1 responses)

> To avoid excessive consumption of CVE numbers

Love Corbet writing.

Filesystem mounts in user namespaces — one year later

Posted Aug 22, 2016 15:29 UTC (Mon) by cavok (subscriber, #33216) [Link]

He clearly knows the right end of the Zap-O-CVE.

Filesystem mounts in user namespaces — one year later

Posted Aug 18, 2016 8:57 UTC (Thu) by vegard (subscriber, #52330) [Link] (1 responses)

> Most Linux filesystems are simply not designed to be robust in the face of deliberately hostile on-disk filesystem images.

> Hardening filesystems against attacks from below is not a simple matter; such hardening has never been a design goal for the filesystems in common use.

We're still working on getting our filesystem fuzzing-with-AFL code out (http://lwn.net/Articles/685182/) but currently it does not take more than a few hours to find kernel-crashing bugs in any of the most widely used filesystems (ext4, xfs, btrfs, etc.). Several other filesystems crash in a matter of seconds under our AFL-based fuzzer.

I've looked more closely at ext4 myself and trying to fix the bugs that came up (http://lists.openwall.net/linux-ext4/2016/07/14/5, http://www.spinics.net/lists/linux-ext4/msg53166.html) and Bo Liu has been fixing quite a few bugs in btrfs. Other filesystems are on the roadmap but the sheer volume of issues means we don't have the manpower/bandwidth to fix everything at once.

Even so, these are only the bugs found by fuzzing. I can't find the link right now but I think there was a paper where somebody deliberately inserted bugs in the code and tried to discover them with coverage-guided fuzzing and it still only found some 10% of the bugs.

In light of all this, I would be extremely wary of enabling unprivileged mounts. Automatic mounting of external media (still enabled by default on most desktop distros) is already quite bad enough.

Filesystem mounts in user namespaces — one year later

Posted Aug 20, 2016 23:48 UTC (Sat) by ebiederm (subscriber, #35028) [Link]

The primary target for now remains fuse.

We may be able to support other filesystems eventually but fuse should be supportable now.
Do you have any fuzz testing with respect to fuse?

The work of handling the odd cases was performed with respect to the vfs and not fuse as the necessary changes are cleaner and more obvious that way.

Fuse presents a very interesting case as it allows isolating the filesystem code in userspace while still allowing that code to be used by all programs through the vfs.

Fuse can support all of the popular filesystems today, and as such provides a safer alternative to mounting filesystems on usb sticks. The user namespace mount aspect of this just this all more usable.

Filesystem mounts in user namespaces — one year later

Posted Aug 19, 2016 21:13 UTC (Fri) by flussence (guest, #85566) [Link]

> As long as mounting a filesystem has been a privileged operation, that has not normally been a problem; getting an administrator to mount a random ext4 or XFS filesystem image can take quite a bit of social engineering.

Step 1: insert the provided BadUSB device.
Step 2: click the /media/not-evil-I-promise icon that appears in the file manager.
:)

Filesystem mounts in user namespaces — one year later

Posted Aug 20, 2016 5:46 UTC (Sat) by thomas.poulsen (subscriber, #22480) [Link]

This situation appears similar to that of network filesystems. If a filesystem is exported to the local network, anyone who is root on a box on the network can mount it and create many of the same problems described here. Perhaps some of the concepts from nfs like exports and usermappings can be used in the user namespace situation.

Filesystem mounts in user namespaces — one year later

Posted Aug 25, 2016 8:51 UTC (Thu) by sourcejedi (guest, #45153) [Link] (1 responses)

I don't think INVALID_UID is actually -1. UIDs are unsigned 32-bit ints. So it's not the well-known `nobody` user ((u16) -1), but it's the equivalent for 32-bit UIDs.

Filesystem mounts in user namespaces — one year later

Posted Sep 6, 2016 7:52 UTC (Tue) by mgedmin (subscriber, #34497) [Link]

Nitpick: nobody is uid (u16)-2, i.e. 65534.