A container-confinement breakout
The recently announced container-confinement breakout for containers started with runc is interesting from a few different perspectives. For one, it affects more than just runc-based containers as privileged LXC-based containers (and likely others) are also affected, though the LXC-based variety are harder to compromise than the runc ones. But it also, once again, shows that privileged containers are difficult—perhaps impossible—to create in a secure manner. Beyond that, it exploits some Linux kernel interfaces in novel ways and the fixes use a perhaps lesser-known system call that was added to Linux less than five years back.
The runc tool implements the container runtime specification of the Open Container Initiative (OCI), so it is used by a number of different containerization solutions and orchestration systems, including Docker, Podman, Kubernetes, CRI-O, and containerd. The flaw, which uses the /proc/self/exe pseudo-file to gain control of the host operating system (thus anything else, including other containers, running on the host), has been assigned CVE-2019-5736. It is a massive hole for containers that run with access to the host root user ID (i.e. UID 0), which, sadly, covers most of the containers being run today.
There are a number of sources of information on the flaw, starting with the announcement from runc maintainer Aleksa Sarai linked above. The discoverers, Adam Iwaniuk and Borys Popławski, put out a blog post about how they found the hole, including some false steps along the way. In addition, one of the LXC maintainers who worked with Sarai on the runc fix, Christian Brauner, described the problems with privileged containers and how CVE-2019-5736 applies to LXC containers. There is a proof of concept (PoC) attached to Sarai's announcement, along with another more detailed PoC he posted the following day after the discoverers' blog post.
As with many exploits, this one takes a rather convoluted path, but it can reliably compromise the host. It does this by either getting a user to create a container using a compromised container image or by having a user attach to a container (e.g. using "docker exec") that an attacker has had write access to. That begins the complicated dance.
The attacker sets up the entry point to the container (which is invoked when the container is created or accessed) to be a symbolic link to /proc/self/exe. That link will resolve to the runc binary in the host filesystem when it is run from, say, a /bin/bash that contains:
#!/proc/self/exe
The effect will be to re-execute the runc command inside the
container but using the binary from the host.
The attacker also needs to subvert one of the shared libraries used by runc inside the container; when runc gets re-executed inside the container, it will use the shared libraries from within it. In Sarai's PoC, he chose libseccomp, but other libraries used by runc could be chosen instead. The shared library is changed to add a new global constructor function that will be called after the library is loaded. This function will open /proc/self/exe for read (as it would get an ETXTBSY error if it tried to open it for write). That open() results in a file descriptor that can be accessed via /proc (e.g. /proc/self/fd/3).
The reason that runc-based containers are easier to exploit than LXC-based ones is due to the fact that runc exits after it does its job, while lxc-attach waits around for its command to finish. That means the runc binary is not busy (in the ETXTBSY sense) for most of the time that the container runs. While this non-busy window is nearly always open for runc, it is much smaller for LXC: After the container command completes and lxc-attach exits, there is a small window that can be exploited for privileged LXC containers.
Once it has opened the file, the constructor then executes another binary (using execve()), which terminates the execution of runc. That allows this new binary to open /proc/self/fd/3 for write, since ETXTBSY is no longer returned after the execve(). Now the attacker has a file descriptor that can be used to write anything they want to the runc binary on the host. Game over.
The fix, as implemented by runc and LXC, is to use an anonymous file created by memfd_create(), copy the contents of the runc binary into it, and then seal the file to protect it from being changed. For kernels prior to the introduction of memfd_create() in the 3.17 kernel that was released back in October 2014, there is a fallback fix that uses a file created with O_TMPFILE. The use of memfd_create() led Florian Weimer to wonder about the fix, however:
But Sarai does not
see a problem with the approach; sealing a file in memory ensures that
the right binary is executed in all cases. In response to Steve Grubb's concern that the fix was "more of a
workaround than a root cause fix
", Sarai said:
He did note that his patch
set providing a number of different VFS path-resolution restrictions
"would help fix this
and could help fix a wide variety of other container runtime issues that
have been bothering me for a couple of years
". Those patches are
based on earlier work, including the AT_NO_JUMPS patches from 2017; that
functionality has been proposed in various forms over the years but has
never made it into the mainline—at least so far.
There are a couple of obvious exploit scenarios. The easiest is to have someone use a malicious container image—a practice that is not unknown in the container world. In fact, lots of container users grab images (or other artifacts) from the internet without vetting their contents. A compromise of a running privileged container (which is most of them) could also lead down this path. The attacker could set up the exploit on likely suspects (e.g. /bin/bash) and wait for the container owner to use docker exec—or do something to the container to draw attention to it. Once the host has been compromised by one of its running containers, all of the other running containers and the operating system itself, are compromised.
All in all, it is a nasty little hole. Updating runc, LXC, and others affected would seem mandatory; one hopes it has already long been done. Finding better ways to further isolate containers going forward is needed, but the main tool, user namespaces, has been around for quite a while now. As Brauner noted in his blog post, privileged containers are simply never going to be truly secure.
| Index entries for this article | |
|---|---|
| Security | Containers |
