Another runc container breakout

By Joe Brockmeier
February 12, 2024

Once again, runc—a tool for spawning and running OCI containers—is drawing attention due to a high severity container breakout attack. This vulnerability is interesting for several reasons: its potential for widespread impact, the continued difficulty in actually containing containers, the dangers of running containers as a privileged user, and the fact that this vulnerability is made possible in part by a response to a previous container breakout flaw in runc.

The runc utility is the most popular (but not only) implementation of the Open Container Initiative (OCI) container runtime specification and can be used to handle the low-level operations of running containers for container management and orchestration tools like Docker, Podman, Rancher, containerd, Kubernetes, and many others. Runc also has had its share of vulnerabilities, the most recent of which is CVE-2024-21626 — a flaw that could allow the attacker to completely escape the container and take control of the host system. The fix is shipped with runc 1.1.12, with versions 1.0.0-rc93 through 1.1.11 affected by the vulnerability. Distributors are backporting the fix and shipping updates as necessary.

Anatomy

The initial vulnerability was reported by Rory McNamara to Docker, with further research done by runc maintainers Li Fu Bang and Aleksa Sarai, who discovered other potential attacks. Since version 1.0.0-rc93, runc has left open ("leaked") a file descriptor to /sys/fs/cgroup in the host namespace that then shows up in /proc/self/fd within the container namespace. The reference can be kept alive, even after the host namespace file descriptors would usually be closed, by setting the container's working directory to the leaked file descriptor. This allows the container to have a working directory outside of the container itself, and gives full access to the host filesystem. The runc security advisory for the CVE describes several types of attacks that could take advantage of the vulnerability, all of which could allow an attacker to gain full control of the host system.

Mohit Gupta has an analysis of the flaw and reported it took about four attempts to successfully start a container with its working directory set to "/proc/self/fd/7/" in a scenario using "docker run" with a malicious Dockerfile. Because runc did not verify that the working directory is inside the container, processes inside the container can then access the host filesystem. For example, if an attacker is successful in guessing the right file descriptor and sets the working directory to "/proc/self/fd/7" they can run a command like "cat ../../../../etc/passwd" and get a list of all users on the system—more damaging options are also possible.

Sarai noted in the advisory that the exploit can be considered critical if using runtimes like Docker or Kubernetes "as it can be done remotely by anyone with the rights to start a container image (and can be exploited from within Dockerfiles using ONBUILD in the case of Docker)". On the Snyk blog, McNamara demonstrated how to exploit this using "docker build" or "docker run", and Gupta's analysis demonstrated how one might deploy an image into a Kubernetes cluster or use a malicious Dockerfile to compromise a CI/CD system via "docker build" or similar tools that depend on runc. What's worse, as McNamara wrote, is that "access privileges will be the same as that of the in-use containerization solution, such as Docker Engine or Kubernetes. Generally, this is the root user, and it is therefore, possible to escalate from disk access to achieve full host root command execution."

This isn't the first time runc has had this type of vulnerability, it also suffered a container breakout flaw (CVE-2019-5736) in February 2019. That issue led Sarai to work on openat2(), which was designed to place restrictions on the way that programs open a path that might be under the control of an attacker. It's somewhat amusing then, for some values of "amusing" anyway, that Sarai mentioned that "the switch to openat2(2) in runc was the cause of one of the fd leaks that made this issue exploitable in runc."

While this vulnerability has been addressed, Sarai also wrote that there's more work to be done to stop tactics related to keeping a handle to a magic link (such as /proc/self/exe) on the host open and then writing to it after the container has started. He noted that he has reworked the design in runc, but is still "playing around with prototypes at the moment".

Other OCI runtimes

While the vulnerability is confirmed for runc, the advisory contains a warning for other runtimes as well. Sarai writes, "several other container runtimes are either potentially vulnerable to similar attacks, or do not have sufficient protection against attacks of this nature." Sarai specifically talks about crun (a container runtime implemented in C, used by Podman by default), youki (a container runtime implemented in Rust), and LXC.

According to Sarai, none of the other runtimes leak useful file descriptors to their "runc init" counterparts—but he finds crun and youki wanting in terms of ensuring that all non-stdio files are closed and that the working directory for a container is inside the container. LXC also does not ensure that a working directory is inside the container. He suggests that maintainers of other runtimes peruse the patches for runc's leaked file descriptors. While these implementations are not exploitable ("as far as we can tell"), they have issues that could be addressed to prevent similar attacks in the future.

The good news is that, despite many container vulnerabilities over the years, so far we haven't seen reports of a widespread exploit of these flaws. Whether the credit belongs to the vigilance of container-runtime maintainers and security reporters, disciplined use of trusted container images, use of technologies like AppArmor and SELinux, the difficulty in using these flaws to conduct real-world attacks, or a combination of those is unclear. This shouldn't be taken to to suggest that those threats are not real, of course.

Index entries for this article
Security	Containers

Another runc container breakout

Posted Feb 12, 2024 16:57 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> The good news is that, despite many container vulnerabilities over the years, so far we haven't seen reports of a widespread exploit of these flaws.

I consulted at several companies, and none of them use containers as a means of privilege separation. It's commonly assumed that containers in Linux are not particularly secure.

Another runc container breakout

Posted Feb 12, 2024 17:35 UTC (Mon) by ms (subscriber, #41272) [Link] (11 responses)

For someone who's mildly interested and extremely ignorant of this area: what are the more secure solutions? Do you have to go full-hog virtualisation with qemu?

Another runc container breakout

Posted Feb 12, 2024 17:50 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

If you want true privilege separation where violations can cost you a lot of money (banking, HIPAA, etc.) then you go with dedicated hosts. So each workload gets its own physical host, possibly running multiple VMs/containers but at the same privilege level.

If you're slightly less paranoid, then VMs that don't share resources (including CPUs in the same socket) are fine. One more step, and you can use VMs on oversubscribed hardware or VMs with dynamic resources. All these VMs can also run multiple containers, typically for the same application.

I don't think containers are good enough for _any_ privilege separation.

Another runc container breakout

Posted Feb 12, 2024 18:49 UTC (Mon) by raven667 (subscriber, #5198) [Link] (3 responses)

> each workload gets its own physical host, possibly running multiple VMs/containers but at the same privilege level

I would add that while this may sound expensive, if you have more than a very small workload, partitioning the VM clusters by security level doesn't really cost much more in hardware, it's the same amount of resources you would have needed to provision anyway, as long as you don't get crazy about defining fine-grained boundaries at the VM level and constrain yourself to broad security risk zones based on the sensitivity of data being processed or the auditability/control of the devices connecting to it.

Another runc container breakout

Posted Feb 12, 2024 20:30 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (2 responses)

A lot of this stuff sounds expensive only because we tend to underestimate the cost of a security breach. If you properly account for that cost, making the system more secure by design starts to look more sensible. Until you get to the extreme, security is generally cheaper than insecurity. It's just the cost of insecurity is sporadic, so bad managers pretend it doesn't exist.

Another runc container breakout

Posted Feb 13, 2024 4:51 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> It's just the cost of insecurity is sporadic, so bad managers pretend it doesn't exist.

Same as planes crashing or plane doors blowing off and quality in general. It's very rare so surely you can cut corners here and there and save a fair amount of money.

I don't agree that "a security bug is just another bug" in general but from that pure cost and management perspective a security bug is indeed just a quality defect like any other.

That's why managers are paid so much: because _in theory_ they're able to find the best quality trade-offs - and it's hard. In practice however...

Another runc container breakout

Posted Feb 13, 2024 11:31 UTC (Tue) by Wol (subscriber, #4433) [Link]

> That's why managers are paid so much: because _in theory_ they're able to find the best quality trade-offs - and it's hard. In practice however...

It also assumes the necessary data exists. My employer collects all the data imaginable, but as far as I can tell, the data I need to do my job doesn't exist ... namely "what are the discrepancies between what we've planned, and the resources we've got on the ground".

HOPEFULLY I'm going to succeed in doing something about it, but one lone voice crying "help" while drowning in a data lake can't achieve much...

Cheers,
Wol

Another runc container breakout

Posted Feb 14, 2024 20:22 UTC (Wed) by ringerc (subscriber, #3071) [Link]

They're good for casual isolation of cooperating services. They help manage failure domains.

They also have benefits for deterring casual attackers. "To evade the bear you don't have to outrun the bear, just run faster than the next guy".

When combined with strict seccomp rules, dropping all capabilities, nonzero uid and a restrictive apparmor config they can be quite strong for suitable workloads. But in reality it's rarely feasible to lock a workload down enough that an attacker couldn't potentially exploit their way out; most things need general purpose file I/O and/or network I/O after all, and that gives it the ability to touch system interfaces that could be exploitable. So they're not fit for purpose for isolating general purpose untrusted workloads where there are strong incentives to escape the container and high impacts if an attacker succeeds.

I have significantly more confidence in lightweight VM/hypervisor models like Firecracker. My work is moving in that direction for some of our shared tenancy stuff.

But I agree you can't beat physical node isolation for workloads where the bill for a breach is in the many many millions.

Another runc container breakout

Posted Feb 12, 2024 18:17 UTC (Mon) by atnot (subscriber, #124910) [Link]

In general, you still use containers, but use some layer in between to provide each customer with a separate kernel. There's a large variety of ways to do this. On the simpler end, you can just statically provision normal dedicated hardware or VMs per customer that you operate with a shared control plane. This is kind of wasteful because you need to run a whole intermediate OS and size it for worst case memory use. You can be a bit smarter than this though, so there's a number of microvm based solution like Kata Containers that run each container as a lightweight KVM-based virtual machine instead, with impressively small overhead. But if even that is too much for you, you can also use something like gVisor, which implements the kernel interface entirely in userspace. Kind of like wine, but for running Linux applications on Linux.

There's a huge variety of solutions here, but what they have in common is that nobody trusts multiple customers with one Kernel.

Another runc container breakout

Posted Feb 12, 2024 19:47 UTC (Mon) by geofft (subscriber, #59789) [Link] (1 responses)

There's a couple of options. At my day job we've been running a traditional UNIX environment for about 20 years for our HPC use case, and so we have relied on the normal UNIX privilege separation mechanism, namely UIDs. We mostly kept this when we started moving to containers: user-provided code only runs as their own user even inside the container. Users have historically had non-root shell access to our systems via plain SSH, and so as long as they remain non-root, a containerization breakout like this doesn't impact our security model. (Of course, keeping a Linux environment secure against local privilege escalation is challenging on its own, and if our untrusted users were anonymous people on the internet instead of employees who have gone through background checks, we would likely be more paranoid.)

We benefited from the fact that our workloads were already designed for non-containerized environments (and so we already had non-root ways of installing software etc.) and we could just containerize that model. If you're running container-native workloads that expect to have root, or even multi-tenant workloads that all use the same non-root UID (e.g. everyone runs as UID 1000), user namespaces are a great option: the code in the container feels like it's running as whatever UID, and if it's root it can do things like mount a tmpfs or reconfigure its own networking. But from outside the container, these are separate, unprivileged UIDs. So the attack in the article about reading /etc/passwd is still possible, but a container running as "root" cannot write to it. Kubernetes has user namespace support in alpha.

And then yes, you can use virtualization (of any level of hog). If you're on a public cloud platform and you're on VMs anyway, setting up your VMs to be per workload is pretty easy. There's also stuff like Kata Containers or gVisor (commercially offered by Google as GKE Sandbox) for having a "container" runtime that actually uses hardware virtualization.

Another runc container breakout

Posted Feb 17, 2024 15:12 UTC (Sat) by raven667 (subscriber, #5198) [Link]

> Of course, keeping a Linux environment secure against local privilege escalation is challenging on its own, and if our untrusted users were anonymous people on the internet instead of employees who have gone through background checks, we would likely be more paranoid

This is something I bang on about all the time, one of the corollarys if risk is lack of accountability, in places where you have accountability you have a less likelihood for issues, and the driver of security process should be more driven by impact. Making a reasonable and effective risk calculation for where to spend your limited time instituting security is hard to do, it's mostly vibes and tribal embarrassment which drives people to apply all the "best practices" they've heard of without regard to which ones actually help in their situation and which are wastes of time.

Another runc container breakout

Posted Feb 15, 2024 9:14 UTC (Thu) by cortana (subscriber, #24596) [Link] (1 responses)

I think about it like this: containers are a technology for distributing software artefacts, packaged up and ready to run on a host. Think shipping containers. They aren't designed to confine an executing workload within its container's namespaces; you still need other technologies, which can be used in conjunction with containers, for that.

So, for instance, looking at a platform like Red Hat OpenShift, which runs on top of RHEL CoreOS, a typical container:

* Must run with a UID, GID and supplemental groups from the range assigned to its project
* Must run within a restrictive domain (container_t) and a unique MCS category (e.g., c5,c67)
* Must run with the runtime/default seccomp profile applied

So for one workload to read the files of another, it would have to:

* Use an exploit to escape out of its namespaces
* Use an exploit to override DAC (i.e., read a file owned by another user; made more difficult by the seccomp profile which prevents the use of less common system calls)
* Use an exploit to override the SELinux policy which prevents, for instance, a process running as container_t:c5,c67 from accessing objects labelled with container_file_t:c78,c200

It's interesting you mention qemu, because the idea of using a restrictive label & MCS category to protect container workloads from one another came from the very same setup that RHEL uses to protect VMs from one other: The container_t label is (or was at some point) an alias to svirt_lxc_net_t, the label that libvirt applies to all the qemu processes that it launches (again, with each one being given a unique pair of MCS categories so that VMs can't attack each other).

Another runc container breakout

Posted Feb 17, 2024 15:17 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Thanks for the summary of how this works, those looks like all good things to do, however there are an awful lot of local kernel exploits in Linux that can make any permission system moot, so that's where having different hardware for different security/risk workloads can help, remote exploit is much harder todo/easier to defend/detect.

Another runc container breakout

Posted Feb 12, 2024 19:45 UTC (Mon) by carlosrodfern (subscriber, #166486) [Link] (2 responses)

I think the main concern is the other large amount of companies that could be affected by this that implement internet-based services with a system of micro-services, which are backed with containers on orchestration platforms like kubernetes (the only sane way to do microservices). The attack would look like breaking into a container with some remote code execution vulnerability, and then taking advantages of runc vulnerabilities like this one to expand the attack.

Another runc container breakout

Posted Feb 13, 2024 11:08 UTC (Tue) by anguslees (subscriber, #7131) [Link] (1 responses)

If I understand correctly, that's not how this attack works, however. As I understand it, this attack requires an attacker to already have the ability to start new malicious containers (or container config).

Getting control over an existing container, can not be escalated to a host exploit through this issue.

Another runc container breakout

Posted Feb 13, 2024 16:15 UTC (Tue) by carlosrodfern (subscriber, #166486) [Link]

Yes, you are right. You'll need some extra privileges in k8s to start a malicious container in this case. The only way I can think of exploiting it thru the front door would be you break into a container, use some vuln to escape into another one that does have such privileges on the k8s cluster, and then use this vuln to get into the host fs. Complex attack to get there, indeed. This is where techs like SELinux and specifically MCS for containers, would really help, as an additional layer of security. Unfortunately, many Kubernetes clusters have their node OSs with SELinux in permissive mode, or enforcing but without leveraging MCS.

Another runc container breakout

Posted Feb 17, 2024 21:14 UTC (Sat) by bsdnet (guest, #169714) [Link] (1 responses)

In my understanding, container is more for resource efficiency rather than security.

Another runc container breakout

Posted Feb 18, 2024 15:08 UTC (Sun) by raven667 (subscriber, #5198) [Link]

I would say (as an exercise to think about this succinctly and write it out) that containers are more efficient than VMs because one kernel can optimize hardware access/sharing better than nested opaque kernels can, and both VMs/containers exist to solve a software packaging/dependency problem on how to get multiple applications to share the same hardware, for cost/power efficiency, without conflicting with one another due to shared library dependency conflicts. VMs are sometimes used for sharing multiple incompatible OS on the hardware as well, but a huge number of Linux VM servers are running Linux guest VMs, that the kernel is perfectly comparable with, hence using containers instead.

Another runc container breakout

Posted Feb 12, 2024 18:43 UTC (Mon) by champtar (subscriber, #128673) [Link]

crun has already released a version with the recommended hardening https://github.com/containers/crun/releases/tag/1.14.1