User namespaces + overlayfs = root privileges

By Jake Edge
January 13, 2016

The user namespaces feature is conceptually fairly straightforward—allow users to run as root in their own space, while limiting their privileges on the system outside that space—but the implementation has, perhaps unsurprisingly, proven to be quite tricky. There are some assumptions about user IDs and how they operate that are deeply wired into the kernel in various subsystems; shaking those out has taken some time, which led to some hesitation about enabling the feature in distribution kernels. But that reluctance has largely passed at this point, which makes the recent discovery of a root-privilege escalation using user namespaces and the overlay filesystem (overlayfs) that much more dangerous.

The basic idea, as described by "halfdog" in a blog post, is that a regular user can create new mount and user namespaces, mount an overlayfs inside them, and exploit a hole in the overlayfs implementation to create a setuid-root binary that can be run from outside the namespace. Effectively, a regular user can create a root-privileged binary and do whatever they want with it—thus it is a complete system compromise.

The exploit uses another property of namespaces that has always seemed like something of a bug: the /proc filesystem provides a route for processes outside of a namespace to "see" inside it. In this case, the overlayfs mounted inside the namespace can be accessed from outside of it by using information on the mounts inside the namespace via /proc/PID/cwd. As halfdog put it:

For the above exploit to work, not only exposure within the namespace is required, a process from outside uses /proc to access the mounts which should be visible only to processes within the namespace. This is by itself already a risk but might be also a security vulnerability by itself, worth fixing.

The exploit [C program] works like this:

New mount and user namespaces are created for the process.
That process then mounts an overlayfs atop /bin using temporary directories for the overlayfs "upperdir" and "workdir" directories. A writable overlayfs must have both of these directories; upperdir holds the files/directories that have been changed, while workdir is used as a work space to enable atomic overlayfs operations.
The process inside the namespaces changes its working directory to the overlayfs, thus making it visible outside of the namespaces by way of /proc/PID/cwd.
The process changes the su binary (in /bin) to be world-writable, but does not change the owner. That results in a new file being created in the upper overlay directory.
A process outside of the namespaces writes anything it wants to that file without changing the setuid bit (more on that coming).
The outer process then runs that su with root privileges.

That seems reasonably straightforward, but there is one difficulty: writing to a root-owned, setuid-enabled file from a non-root process will remove the setuid bit, which defeats the whole thing. So a variant of another exploit (SetgidDirectoryPrivilegeEscalation) described by halfdog is used to trick the setuid-root mount program into writing an ELF binary to the file. Since mount is owned by root, the write doesn't remove setuid, resulting in an setuid-root program with contents controlled by a regular user (attacker).

Because the process inside the namespaces made the file world-writable, a regular user outside the namespaces can run mount with its stderr hooked to the file of interest (which, crucially, only requires an open() that doesn't revoke setuid). Then, when mount writes to the file, it is doing so as root, so it doesn't trigger the setuid revocation either. The file cannot be written inside the namespaces, which would be easier since it does not require the /proc/PID/cwd dance, as it would be done as root inside the user namespace. Since that is not the same as root outside that namespace, the setuid revocation would still occur.

Perhaps a final entry could be made to the list above: "Profit!".

The fix is fairly simple; it was committed by Miklos Szeredi in early December and was merged into 4.4-rc4 (without any mention of the security implications). According to Al Viro's commit message, overlayfs was "too enthusiastic about optimizing ->setattr() away". It combined two operations that should have been done separately, which led to the creation of the setuid-root file in the upper overlay filesystem. After the fix it won't be possible to change su to be world-writable while still retaining root ownership and the setuid bit in the overlay.

The second exploit that was used to write the file appears to not yet have been fixed. Halfdog's description points to some email messages discussing the problem and possible solutions with several kernel developers, but no real resolution is evident. Obviously, if there was no way to write the overlay file, this particular avenue for exploiting the overlayfs bug would not have worked, but it is unknown if there are more ways to skin that particular cat.

All of the myriad interactions between various kernel subsystems (capabilities, namespaces, security modules, filesystems, /proc, and so on), especially given the "no regressions for user space" policy, make these kinds of bugs pretty much inevitable. One could imagine tossing out a bunch of currently expected behavior to simplify the complexity of those interactions, but that is not going to happen, so these types of problems will crop up from time to time.

This episode is also a reminder of the ingenuity that can go into an exploit of this kind. Other user namespace exploits and, indeed, exploits of bugs in other programs and kernel features have shown similar levels of cleverness. Often stringing together a few seemingly low-severity vulnerabilities results in something that almost appears as if it has exceeded the sum of its parts. Of course, white hats are not the only ones with the required level of skill, which makes efforts like the Kernel self-protection project, as well as other analysis and hardening projects, that much more important.

Index entries for this article
Security	Linux kernel
Security	Namespaces

User namespaces + overlayfs + ubuntu = root privileges

Posted Jan 13, 2016 21:23 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

Mainline kernels are not affected as they do not allow mounting overlayfs with only user namespace privilege. Only Ubuntu was affected.

The mainline commit messages not talking about a problem which does not and did not exist in the kernel that was being modified seems reasonable in that context.

User namespaces + overlayfs = root privileges

Posted Jan 13, 2016 21:55 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> The exploit uses another property of namespaces that has always seemed like something of a bug: the /proc filesystem provides a route for processes outside of a namespace to "see" inside it.

This actually seems to me like the normal and expected operation of a namespace: processes outside the namespace can see into it, but processes inside the namespace cannot see out. It wouldn't make sense, for example, for a process to be able to create a PID namespace to hide child processes from the original user. Running processes inside a namespace is about limiting those processes, not the ones outside the namespace. Of course, everything needs to be translated properly so that outside processes looking into a namespace see the correct user IDs and so forth.

As for the issue of tricking mount—or probably any number of other programs—into writing to an inherited file descriptor for a SUID file, wouldn't it make more sense to revoke the SUID bit when the file is first opened for write access by a non-root process, rather than waiting until data is actually written? The target program wouldn't even need to be SUID, if it can receive file descriptors from non-root processes some other way. Unix domain sockets (as used in DBUS) come to mind as a possible attack vector.

User namespaces + overlayfs = root privileges

Posted Jan 14, 2016 2:30 UTC (Thu) by clopez (guest, #66009) [Link] (1 responses)

OverlayFS was introduced in https://git.kernel.org/linus/e9be9d (v3.18-rc2)

So it don't affects any kernel < 3.18.
Debian stable has 3.16

User namespaces + overlayfs = root privileges

Posted Jan 14, 2016 8:25 UTC (Thu) by alexl (subscriber, #19068) [Link]

Not to mention that debian requires you to explicitly enable the kernel.unprivileged_userns_clone sysctl for non-privileged user namespace support, and the fact that even then you can't mount overlayfs (see erics comment above).

User namespaces + overlayfs = root privileges

Posted Jan 14, 2016 17:46 UTC (Thu) by iabervon (subscriber, #722) [Link] (2 responses)

The step that seems most surprising to me is that a process in namespace A is affected by the setuid bit on a file in namespace B; I'd expect the VFS to treat files from a namespace not in your ancestry as if they were on a nosuid,nodev mount.

On the other hand, the fact that it's possible to look into another namespace, but it's not obvious that you can, is a poor situation; it's hard to remember the security design when some things that are not actually prohibited are hard or awkward to do.

User namespaces + overlayfs = root privileges

Posted Jan 14, 2016 18:48 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> The step that seems most surprising to me is that a process in namespace A is affected by the setuid bit on a file in namespace B; I'd expect the VFS to treat files from a namespace not in your ancestry as if they were on a nosuid,nodev mount.

I would expect SUID/SGID files in subordinate namespaces to work normally, with the caveat that they are SUID/SGID to the corresponding unprivileged user and/or group outside the namespace and not the privileged user/group they appear to belong to inside the namespace. Note that normal users can create SUID/SGID binaries if they have write access to any non-nosuid filesystem; they just can't make them SUID to other users or groups of which they are not members, such as root. It is perfectly possible to create a binary which can be run by other users with your own UID and/or one of your groups (something to look out for if you're trying to revoke permissions).

Forcing nodev for filesystems mounted by users who are not privileged in the root namespace does make a lot of sense, however. I would go so far as to say that it ought to be the default, with root namespace privileges required to enable device support. In most systems there are only two filesystems which should contain device nodes: /dev and /dev/pts.

User namespaces + overlayfs = root privileges

Posted Jan 15, 2016 1:29 UTC (Fri) by butlerm (subscriber, #13312) [Link]

Wouldn't it be safer to ignore SUID / SGID bits if the file in question lacked a UUID in an extended attribute that matched a UUID assigned to the corresponding superuser or group, respectively?

Then presumably a user in one namespace could mount a filesystem created in a different namespace (possibly on a different system), and the security bits in question would be silently ignored, for failure to match the corresponding security identifiers?

And if you really wanted those bits to take effect, you would go change the extended attributes to match the UUIDs of the superuser and/or group in the appropriate name space?