|
|
Log in / Subscribe / Register

A container-confinement breakout

By Jake Edge
March 6, 2019

The recently announced container-confinement breakout for containers started with runc is interesting from a few different perspectives. For one, it affects more than just runc-based containers as privileged LXC-based containers (and likely others) are also affected, though the LXC-based variety are harder to compromise than the runc ones. But it also, once again, shows that privileged containers are difficult—perhaps impossible—to create in a secure manner. Beyond that, it exploits some Linux kernel interfaces in novel ways and the fixes use a perhaps lesser-known system call that was added to Linux less than five years back.

The runc tool implements the container runtime specification of the Open Container Initiative (OCI), so it is used by a number of different containerization solutions and orchestration systems, including Docker, Podman, Kubernetes, CRI-O, and containerd. The flaw, which uses the /proc/self/exe pseudo-file to gain control of the host operating system (thus anything else, including other containers, running on the host), has been assigned CVE-2019-5736. It is a massive hole for containers that run with access to the host root user ID (i.e. UID 0), which, sadly, covers most of the containers being run today.

There are a number of sources of information on the flaw, starting with the announcement from runc maintainer Aleksa Sarai linked above. The discoverers, Adam Iwaniuk and Borys Popławski, put out a blog post about how they found the hole, including some false steps along the way. In addition, one of the LXC maintainers who worked with Sarai on the runc fix, Christian Brauner, described the problems with privileged containers and how CVE-2019-5736 applies to LXC containers. There is a proof of concept (PoC) attached to Sarai's announcement, along with another more detailed PoC he posted the following day after the discoverers' blog post.

As with many exploits, this one takes a rather convoluted path, but it can reliably compromise the host. It does this by either getting a user to create a container using a compromised container image or by having a user attach to a container (e.g. using "docker exec") that an attacker has had write access to. That begins the complicated dance.

The attacker sets up the entry point to the container (which is invoked when the container is created or accessed) to be a symbolic link to /proc/self/exe. That link will resolve to the runc binary in the host filesystem when it is run from, say, a /bin/bash that contains:

    #!/proc/self/exe
The effect will be to re-execute the runc command inside the container but using the binary from the host.

The attacker also needs to subvert one of the shared libraries used by runc inside the container; when runc gets re-executed inside the container, it will use the shared libraries from within it. In Sarai's PoC, he chose libseccomp, but other libraries used by runc could be chosen instead. The shared library is changed to add a new global constructor function that will be called after the library is loaded. This function will open /proc/self/exe for read (as it would get an ETXTBSY error if it tried to open it for write). That open() results in a file descriptor that can be accessed via /proc (e.g. /proc/self/fd/3).

The reason that runc-based containers are easier to exploit than LXC-based ones is due to the fact that runc exits after it does its job, while lxc-attach waits around for its command to finish. That means the runc binary is not busy (in the ETXTBSY sense) for most of the time that the container runs. While this non-busy window is nearly always open for runc, it is much smaller for LXC: After the container command completes and lxc-attach exits, there is a small window that can be exploited for privileged LXC containers.

Once it has opened the file, the constructor then executes another binary (using execve()), which terminates the execution of runc. That allows this new binary to open /proc/self/fd/3 for write, since ETXTBSY is no longer returned after the execve(). Now the attacker has a file descriptor that can be used to write anything they want to the runc binary on the host. Game over.

The fix, as implemented by runc and LXC, is to use an anonymous file created by memfd_create(), copy the contents of the runc binary into it, and then seal the file to protect it from being changed. For kernels prior to the introduction of memfd_create() in the 3.17 kernel that was released back in October 2014, there is a fallback fix that uses a file created with O_TMPFILE. The use of memfd_create() led Florian Weimer to wonder about the fix, however:

Is it really necessary to use a memfd_create here? Do you really need sealing? It's a bit odd to add a new system call dependency in a security update. The ability fexecve a memfd descriptor is also rather odd. I wouldn't have expected execute permissions on memfd descriptors, so this sounds like a kernel bug (which now can't be fixed).

But Sarai does not see a problem with the approach; sealing a file in memory ensures that the right binary is executed in all cases. In response to Steve Grubb's concern that the fix was "more of a workaround than a root cause fix", Sarai said:

As for it not being a root cause fix, I disagree (it protects against a variety of concerning attacks that aren't related to this CVE). Obviously if everyone used correctly-configured user namespaces then this wouldn't be a problem -- but here were are.

He did note that his patch set providing a number of different VFS path-resolution restrictions "would help fix this and could help fix a wide variety of other container runtime issues that have been bothering me for a couple of years". Those patches are based on earlier work, including the AT_NO_JUMPS patches from 2017; that functionality has been proposed in various forms over the years but has never made it into the mainline—at least so far.

There are a couple of obvious exploit scenarios. The easiest is to have someone use a malicious container image—a practice that is not unknown in the container world. In fact, lots of container users grab images (or other artifacts) from the internet without vetting their contents. A compromise of a running privileged container (which is most of them) could also lead down this path. The attacker could set up the exploit on likely suspects (e.g. /bin/bash) and wait for the container owner to use docker exec—or do something to the container to draw attention to it. Once the host has been compromised by one of its running containers, all of the other running containers and the operating system itself, are compromised.

All in all, it is a nasty little hole. Updating runc, LXC, and others affected would seem mandatory; one hopes it has already long been done. Finding better ways to further isolate containers going forward is needed, but the main tool, user namespaces, has been around for quite a while now. As Brauner noted in his blog post, privileged containers are simply never going to be truly secure.


Index entries for this article
SecurityContainers


to post comments

A container-confinement breakout

Posted Mar 6, 2019 17:45 UTC (Wed) by gebi (guest, #59940) [Link] (1 responses)

wouldn't just setting the runc binary immutable (chattr +i) be enough to fix the vulnerability?

A container-confinement breakout

Posted Mar 6, 2019 19:28 UTC (Wed) by cyphar (subscriber, #110703) [Link]

Yes, unless someone starts a container with CAP_LINUX_IMMUTABLE -- which was previously a "safe" thing to do and wouldn't be any more. One fairly easy way of emulating CAP_LINUX_IMMUTABLE would be to create your own sealed memfd and then bind-mount it over /usr/bin/runc -- the mitigation won't make an additional copy under those circumstances and there isn't a capability available to un-seal a memfd.

I'm hoping that the updated mitigation patch I've been working on[1] will help work around the concerns about memfd_create(2).

[1]: https://github.com/opencontainers/runc/pull/1984

A container-confinement breakout

Posted Mar 6, 2019 19:14 UTC (Wed) by cyphar (subscriber, #110703) [Link] (1 responses)

There has been an update on this from the runc side of things. Quite a few folks have complained about the new memory overhead in memfd_create(2) -- since it's done after cgroup attachment it gets accounted as container usage (since we are a Go binary this ends up being 10MB which is fairly significant). There is also the issue of kernel support -- aside from old kernels we also found out that RHEL's kernel appears to have issues with their memfd_create(2) backport.

All of this has resulted in a PR I've posted[1] which solves the problem using a temporary (MNT_DETACH) read-only bind-mount of /proc/self/exe in an attempt to avoid the memory overhead -- and in principle because we create the read-only bind-mount ourselves and detached it's very unlikely a host process will accidentally make it read-write during the "runc init" window. If that fails, we then try memfd_create(2) followed by O_TMPFILE (or failing that, mksotemp(3)) inside the runc state directory -- which we need to have read-write access to in normal operation.

I've also re-sent v5 of the AT_THIS_ROOT patchset (which includes the ability to scope #! resolution) to the ML today.

[1]: https://github.com/opencontainers/runc/pull/1984

A container-confinement breakout

Posted Mar 6, 2019 19:25 UTC (Wed) by cyphar (subscriber, #110703) [Link]

> and in principle because we create the read-only bind-mount ourselves and detached it's very unlikely a host process will accidentally make it read-write during the "runc init" window.

Scratch this, I just remembered as I was writing this comment that you could just remount it as read-write (not sure why I didn't consider that before). In that case, I will go with the overlayfs solution (where we create an overlayfs and open the upper-layer version of the binary). It will be quite ugly because you can't overlayfs on top of /proc/self -- and bind-mounts won't help either because overlayfs doesn't see them.

I have had a looming feeling for the past month or so it would've just been easier to go the LXC route and tell everyone that running containers as root is a really bad idea, and that they should expect reasonable security from such a configuration.

A container-confinement breakout

Posted Mar 7, 2019 4:31 UTC (Thu) by patrakov (subscriber, #97174) [Link] (2 responses)

With LXC, there is a bigger elephant in this fragile room. The official Ubuntu 18.10 LXC template is so evil that it escapes the memory limits by default, without any user action. Bug filed: https://github.com/lxc/lxc/issues/2845.

Also, for privileged LXC, two much simpler breakout techniques are available that work on Ubuntu: https://seclists.org/oss-sec/2019/q1/125. The upstream position is that privileged containers cannot be made secure, so "not a security bug". I have no idea if they also work with other container runtimes, would appreciate if someone tests the ideas on Ubuntu and other distributions.

A container-confinement breakout

Posted Mar 7, 2019 4:36 UTC (Thu) by patrakov (subscriber, #97174) [Link] (1 responses)

Also other LXC users claimed the auto-escape out of the devices cgroup as a result of installation of legitimate software inside the container: https://github.com/lxc/lxc/issues/2762.

A container-confinement breakout

Posted Mar 7, 2019 11:01 UTC (Thu) by brauner (subscriber, #109349) [Link]

For *privileged* containers we only drop mac_admin mac_override sys_time sys_module sys_rawio by default and not CAP_SYS_ADMIN.
We run full systems containers and as such dropping CAP_SYS_ADMIN even in *privileged* containers by default is unwanted since the init system will not be able to perform any mounts it wants.
So you would need to know in advance what mounts to perform. Fwiw, we do support this but it doesn't make a lot of sense for us.
Our security page also refers to such breakouts:

"We are aware of a number of exploits which will let you escape such containers and get full root privileges on the host. Some of those exploits can be trivially blocked and so we do update our different policies once made aware of them. Some others aren't blockable as they would require blocking so many core features that the average container would become completely unusable."

A container-confinement breakout

Posted Mar 10, 2019 9:38 UTC (Sun) by smadu2 (guest, #54943) [Link]

Is rkt vulnerable ?


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds