Another runc container breakout
Once again, runc—a tool for spawning and running OCI containers—is drawing attention due to a high severity container breakout attack. This vulnerability is interesting for several reasons: its potential for widespread impact, the continued difficulty in actually containing containers, the dangers of running containers as a privileged user, and the fact that this vulnerability is made possible in part by a response to a previous container breakout flaw in runc.
The runc utility is the most popular (but not only) implementation of the Open Container Initiative (OCI) container runtime specification and can be used to handle the low-level operations of running containers for container management and orchestration tools like Docker, Podman, Rancher, containerd, Kubernetes, and many others. Runc also has had its share of vulnerabilities, the most recent of which is CVE-2024-21626 — a flaw that could allow the attacker to completely escape the container and take control of the host system. The fix is shipped with runc 1.1.12, with versions 1.0.0-rc93 through 1.1.11 affected by the vulnerability. Distributors are backporting the fix and shipping updates as necessary.
Anatomy
The initial vulnerability was reported by Rory McNamara to Docker, with further research done by runc maintainers Li Fu Bang and Aleksa Sarai, who discovered other potential attacks. Since version 1.0.0-rc93, runc has left open ("leaked") a file descriptor to /sys/fs/cgroup in the host namespace that then shows up in /proc/self/fd within the container namespace. The reference can be kept alive, even after the host namespace file descriptors would usually be closed, by setting the container's working directory to the leaked file descriptor. This allows the container to have a working directory outside of the container itself, and gives full access to the host filesystem. The runc security advisory for the CVE describes several types of attacks that could take advantage of the vulnerability, all of which could allow an attacker to gain full control of the host system.
Mohit Gupta has an analysis of the flaw and reported it took about four attempts to successfully start a container with its working directory set to "/proc/self/fd/7/" in a scenario using "docker run" with a malicious Dockerfile. Because runc did not verify that the working directory is inside the container, processes inside the container can then access the host filesystem. For example, if an attacker is successful in guessing the right file descriptor and sets the working directory to "/proc/self/fd/7" they can run a command like "cat ../../../../etc/passwd" and get a list of all users on the system—more damaging options are also possible.
Sarai noted in the advisory that the exploit can be considered
critical if using runtimes like Docker or Kubernetes "as
it can be done remotely by anyone with the rights to start a
container image (and can be exploited from within Dockerfiles
using ONBUILD in the case of Docker)
". On the Snyk blog, McNamara
demonstrated how to exploit this using "docker build" or
"docker run", and Gupta's analysis demonstrated how one might
deploy an image into a Kubernetes cluster or use a malicious Dockerfile to
compromise a CI/CD system via "docker build" or similar tools
that depend on runc. What's worse, as McNamara wrote, is that "access privileges
will be the same as that of the in-use containerization solution, such as
Docker Engine or Kubernetes. Generally, this is the root user, and it is
therefore, possible to escalate from disk access to achieve full host root
command execution
."
This isn't the first time runc has had this type
of vulnerability, it also suffered a container breakout flaw (CVE-2019-5736)
in February 2019. That issue led Sarai to work on openat2(), which was
designed to place restrictions on the way that programs open a path that
might be under the control of an attacker. It's somewhat amusing then,
for some values of "amusing" anyway, that Sarai mentioned that "the switch to openat2(2) in runc was the cause of one of the fd leaks that made this issue exploitable in runc
."
While this vulnerability has been addressed, Sarai also wrote
that there's more work to be done to stop tactics related to keeping a handle
to a magic link (such as /proc/self/exe) on the host open and then
writing to it after the container has started. He noted that he has reworked
the design in runc, but is still "playing around with prototypes at the
moment
".
Other OCI runtimes
While the vulnerability is confirmed for runc, the advisory contains a
warning for other runtimes as well. Sarai writes, "several other container
runtimes are either potentially vulnerable to similar attacks, or do not have
sufficient protection against attacks of this nature
." Sarai specifically
talks about crun (a container
runtime implemented
in C, used by Podman by default), youki
(a container runtime implemented in Rust), and LXC.
According to Sarai, none of the other runtimes leak useful file
descriptors to their "runc init" counterparts—but
he finds crun and youki wanting in terms of ensuring that all non-stdio
files are closed and that the working directory for a container is
inside the container. LXC also does not ensure that a working directory is
inside the container. He suggests that maintainers of other runtimes peruse the
patches for runc's leaked file descriptors. While these implementations are not exploitable ("as far as we can
tell
"), they have issues that could be addressed to prevent similar
attacks in the future.
The good news is that, despite many container vulnerabilities over the years, so far we haven't seen reports of a widespread exploit of these flaws. Whether the credit belongs to the vigilance of container-runtime maintainers and security reporters, disciplined use of trusted container images, use of technologies like AppArmor and SELinux, the difficulty in using these flaws to conduct real-world attacks, or a combination of those is unclear. This shouldn't be taken to to suggest that those threats are not real, of course.
Index entries for this article | |
---|---|
Security | Containers |
Posted Feb 12, 2024 16:57 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (17 responses)
I consulted at several companies, and none of them use containers as a means of privilege separation. It's commonly assumed that containers in Linux are not particularly secure.
Posted Feb 12, 2024 17:35 UTC (Mon)
by ms (subscriber, #41272)
[Link] (11 responses)
Posted Feb 12, 2024 17:50 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
If you're slightly less paranoid, then VMs that don't share resources (including CPUs in the same socket) are fine. One more step, and you can use VMs on oversubscribed hardware or VMs with dynamic resources. All these VMs can also run multiple containers, typically for the same application.
I don't think containers are good enough for _any_ privilege separation.
Posted Feb 12, 2024 18:49 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (3 responses)
I would add that while this may sound expensive, if you have more than a very small workload, partitioning the VM clusters by security level doesn't really cost much more in hardware, it's the same amount of resources you would have needed to provision anyway, as long as you don't get crazy about defining fine-grained boundaries at the VM level and constrain yourself to broad security risk zones based on the sensitivity of data being processed or the auditability/control of the devices connecting to it.
Posted Feb 12, 2024 20:30 UTC (Mon)
by rgmoore (✭ supporter ✭, #75)
[Link] (2 responses)
A lot of this stuff sounds expensive only because we tend to underestimate the cost of a security breach. If you properly account for that cost, making the system more secure by design starts to look more sensible. Until you get to the extreme, security is generally cheaper than insecurity. It's just the cost of insecurity is sporadic, so bad managers pretend it doesn't exist.
Posted Feb 13, 2024 4:51 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (1 responses)
Same as planes crashing or plane doors blowing off and quality in general. It's very rare so surely you can cut corners here and there and save a fair amount of money.
I don't agree that "a security bug is just another bug" in general but from that pure cost and management perspective a security bug is indeed just a quality defect like any other.
That's why managers are paid so much: because _in theory_ they're able to find the best quality trade-offs - and it's hard. In practice however...
Posted Feb 13, 2024 11:31 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
It also assumes the necessary data exists. My employer collects all the data imaginable, but as far as I can tell, the data I need to do my job doesn't exist ... namely "what are the discrepancies between what we've planned, and the resources we've got on the ground".
HOPEFULLY I'm going to succeed in doing something about it, but one lone voice crying "help" while drowning in a data lake can't achieve much...
Cheers,
Posted Feb 14, 2024 20:22 UTC (Wed)
by ringerc (subscriber, #3071)
[Link]
They also have benefits for deterring casual attackers. "To evade the bear you don't have to outrun the bear, just run faster than the next guy".
When combined with strict seccomp rules, dropping all capabilities, nonzero uid and a restrictive apparmor config they can be quite strong for suitable workloads. But in reality it's rarely feasible to lock a workload down enough that an attacker couldn't potentially exploit their way out; most things need general purpose file I/O and/or network I/O after all, and that gives it the ability to touch system interfaces that could be exploitable. So they're not fit for purpose for isolating general purpose untrusted workloads where there are strong incentives to escape the container and high impacts if an attacker succeeds.
I have significantly more confidence in lightweight VM/hypervisor models like Firecracker. My work is moving in that direction for some of our shared tenancy stuff.
But I agree you can't beat physical node isolation for workloads where the bill for a breach is in the many many millions.
Posted Feb 12, 2024 18:17 UTC (Mon)
by atnot (subscriber, #124910)
[Link]
There's a huge variety of solutions here, but what they have in common is that nobody trusts multiple customers with one Kernel.
Posted Feb 12, 2024 19:47 UTC (Mon)
by geofft (subscriber, #59789)
[Link] (1 responses)
We benefited from the fact that our workloads were already designed for non-containerized environments (and so we already had non-root ways of installing software etc.) and we could just containerize that model. If you're running container-native workloads that expect to have root, or even multi-tenant workloads that all use the same non-root UID (e.g. everyone runs as UID 1000), user namespaces are a great option: the code in the container feels like it's running as whatever UID, and if it's root it can do things like mount a tmpfs or reconfigure its own networking. But from outside the container, these are separate, unprivileged UIDs. So the attack in the article about reading /etc/passwd is still possible, but a container running as "root" cannot write to it. Kubernetes has user namespace support in alpha.
And then yes, you can use virtualization (of any level of hog). If you're on a public cloud platform and you're on VMs anyway, setting up your VMs to be per workload is pretty easy. There's also stuff like Kata Containers or gVisor (commercially offered by Google as GKE Sandbox) for having a "container" runtime that actually uses hardware virtualization.
Posted Feb 17, 2024 15:12 UTC (Sat)
by raven667 (subscriber, #5198)
[Link]
This is something I bang on about all the time, one of the corollarys if risk is lack of accountability, in places where you have accountability you have a less likelihood for issues, and the driver of security process should be more driven by impact. Making a reasonable and effective risk calculation for where to spend your limited time instituting security is hard to do, it's mostly vibes and tribal embarrassment which drives people to apply all the "best practices" they've heard of without regard to which ones actually help in their situation and which are wastes of time.
Posted Feb 15, 2024 9:14 UTC (Thu)
by cortana (subscriber, #24596)
[Link] (1 responses)
So, for instance, looking at a platform like Red Hat OpenShift, which runs on top of RHEL CoreOS, a typical container:
* Must run with a UID, GID and supplemental groups from the range assigned to its project
So for one workload to read the files of another, it would have to:
* Use an exploit to escape out of its namespaces
It's interesting you mention qemu, because the idea of using a restrictive label & MCS category to protect container workloads from one another came from the very same setup that RHEL uses to protect VMs from one other: The container_t label is (or was at some point) an alias to svirt_lxc_net_t, the label that libvirt applies to all the qemu processes that it launches (again, with each one being given a unique pair of MCS categories so that VMs can't attack each other).
Posted Feb 17, 2024 15:17 UTC (Sat)
by raven667 (subscriber, #5198)
[Link]
Posted Feb 12, 2024 19:45 UTC (Mon)
by carlosrodfern (subscriber, #166486)
[Link] (2 responses)
Posted Feb 13, 2024 11:08 UTC (Tue)
by anguslees (subscriber, #7131)
[Link] (1 responses)
Getting control over an existing container, can not be escalated to a host exploit through this issue.
Posted Feb 13, 2024 16:15 UTC (Tue)
by carlosrodfern (subscriber, #166486)
[Link]
Posted Feb 17, 2024 21:14 UTC (Sat)
by bsdnet (guest, #169714)
[Link] (1 responses)
Posted Feb 18, 2024 15:08 UTC (Sun)
by raven667 (subscriber, #5198)
[Link]
Posted Feb 12, 2024 18:43 UTC (Mon)
by champtar (subscriber, #128673)
[Link]
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Wol
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
* Must run within a restrictive domain (container_t) and a unique MCS category (e.g., c5,c67)
* Must run with the runtime/default seccomp profile applied
* Use an exploit to override DAC (i.e., read a file owned by another user; made more difficult by the seccomp profile which prevents the use of less common system calls)
* Use an exploit to override the SELinux policy which prevents, for instance, a process running as container_t:c5,c67 from accessing objects labelled with container_file_t:c78,c200
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout
Another runc container breakout