|
|
Log in / Subscribe / Register

CVE-2019-5736: runc container breakout

Anybody running containerized workloads with runc (used by Docker, cri-o, containerd, and Kubernetes, among others) will want to make note of a newly disclosed vulnerability known as CVE-2019-5736. "The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host." LXC is also evidently vulnerable to a variant of the exploit.


From:  Aleksa Sarai <cyphar-AT-cyphar.com>
To:  oss-security-AT-lists.openwall.com
Subject:  [oss-security] CVE-2019-5736: runc container breakout (all versions)
Date:  Tue, 12 Feb 2019 00:05:20 +1100
Message-ID:  <20190211130520.xwi6vpay3sc56pza@yavin>
Cc:  dev-AT-opencontainers.org, security-announce-AT-opencontainers.org
Archive-link:  Article

[[        Patch CRD: 2019-02-11 15:00 CET ]]
[[ Exploit Code CRD: 2019-02-18 15:00 CET ]]

Hello,

I am one of the maintainers of runc (the underlying container runtime
underneath Docker, cri-o, containerd, Kubernetes, and so on). We
recently had a vulnerability reported which we have verified and have a
patch for.

The researchers who found this vulnerability are:
  * Adam Iwaniuk
  * Borys Popławski

In addition, Aleksa Sarai (me) discovered that LXC was also vulnerable
to a more convoluted version of this flaw.

== OVERVIEW ==

The vulnerability allows a malicious container to (with minimal user
interaction) overwrite the host runc binary and thus gain root-level
code execution on the host. The level of user interaction is being able
to run any command (it doesn't matter if the command is not
attacker-controlled) as root within a container in either of these
contexts:

  * Creating a new container using an attacker-controlled image.
  * Attaching (docker exec) into an existing container which the
    attacker had previous write access to.

This vulnerability is *not* blocked by the default AppArmor policy, nor
by the default SELinux policy on Fedora[++] (because container processes
appear to be running as container_runtime_t). However, it *is* blocked
through correct use of user namespaces (where the host root is not
mapped into the container's user namespace).

Our CVSSv3 vector is (with a score of 7.2):

  AV:L/AC:H/PR:L/UI:R/S:C/C:N/I:H/A:H

The assigned CVE for this issue is CVE-2019-5736.

[++]: This is only the case for the "moby-engine" package on Fedora. The
	  "docker" package as well as podman are protected against this
	  exploit because they run container processes as container_t.

== PATCHES ==

I have attached the relevant patch which fixes this issue. This patch is
based on HEAD, but the code in libcontainer/nsenter/ changes so
infrequently that it should apply cleanly to any old version of the runc
codebase you are dealing with.

Please note that the patch I have pushed to runc master[1] is a modified
version of this patch -- even though it is functionally identical
(though we would recommend using the upstream one if you haven't patched
using the attached one already).

== NON-ESSENTIAL EXPLOIT CODE ==

Several vendors have asked for exploit code to ensure that the patches
actually solve the issue. Due to the severity of the issue (especially
for public cloud vendors), we decided to provide the attached exploit
code. This exploit code was written by me, and is more generic than the
original exploit code provided by the researchers and works against LXC
(it could likely be used on other vulnerable runtimes with no
significant modification). Details on how to use the exploit code are
provided in the README.

As per OpenWall rules, this exploit code will be published *publicly* 7
days after the CRD (which is 2019-02-18). *If you have a container
runtime, please verify that you are not vulnerable to this issue
beforehand.*

== IMPACT ON OTHER PROJECTS ==

It should be noted that upon further investigation I've discovered that
LXC has a similar vulnerability, and they have also pushed a similar
patch[2] which we co-developed. LXC is a bit harder to exploit, but the
same fundamental flaw exists.

After some discussion with the systemd-nspawn folks, it appears that
they aren't vulnerable (because their method of attaching to a container
uses a different method to LXC and runc).

I have been contacted by folks from Apache Mesos who said they were also
vulnerable (I believe just using the exploit code that will be
provided). It is quite likely that most container runtimes are
vulnerable to this flaw, unless they took very strange mitigations
before-hand.

== OTHER NEWS ==

We have set up an announcement list for future security vulnerabilities,
and you can see the process for joining here[3] (it's based on the
Kubernetes security-announce mailing list). Please join if you
distribute any container runtimes that depend on runc (or other OCI
projects).

[1]: https://github.com/opencontainers/runc/commit/0a8e4117e7f...
[2]: https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49ba...
[3]: https://github.com/opencontainers/org/blob/master/securit...

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
From 604a8f8120ef128c0a5bc778e71909eeb0906842 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Wed, 9 Jan 2019 13:40:01 +1100
Subject: [PATCH] nsenter: clone /proc/self/exe to avoid exposing host binary
 to container

There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

As a hotfix we require memfd_create(2), but we can always extend this to
use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this
approach is no page-cache sharing for the runc binary (which overlayfs
would give us) but this is far less complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
 libcontainer/nsenter/cloned_binary.c | 236 +++++++++++++++++++++++++++
 libcontainer/nsenter/nsexec.c        |  11 ++
 2 files changed, 247 insertions(+)
 create mode 100644 libcontainer/nsenter/cloned_binary.c

diff --git a/libcontainer/nsenter/cloned_binary.c b/libcontainer/nsenter/cloned_binary.c
new file mode 100644
index 000000000000..ec383c173dd2
--- /dev/null
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -0,0 +1,236 @@
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <limits.h>
+#include <fcntl.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/vfs.h>
+#include <sys/mman.h>
+#include <sys/sendfile.h>
+#include <sys/syscall.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+
+#define MEMFD_COMMENT "runc_cloned:/proc/self/exe"
+#define MEMFD_LNKNAME "/memfd:" MEMFD_COMMENT " (deleted)"
+
+/* Use our own wrapper for memfd_create. */
+#if !defined(SYS_memfd_create) && defined(__NR_memfd_create)
+#  define SYS_memfd_create __NR_memfd_create
+#endif
+#ifndef SYS_memfd_create
+#  error "memfd_create(2) syscall not supported by this glibc version"
+#endif
+int memfd_create(const char *name, unsigned int flags)
+{
+	return syscall(SYS_memfd_create, name, flags);
+}
+
+/* This comes directly from <linux/fcntl.h>. */
+#ifndef F_LINUX_SPECIFIC_BASE
+# define F_LINUX_SPECIFIC_BASE 1024
+#endif
+#ifndef F_ADD_SEALS
+# define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
+# define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
+#endif
+#ifndef F_SEAL_SEAL
+# define F_SEAL_SEAL   0x0001	/* prevent further seals from being set */
+# define F_SEAL_SHRINK 0x0002	/* prevent file from shrinking */
+# define F_SEAL_GROW   0x0004	/* prevent file from growing */
+# define F_SEAL_WRITE  0x0008	/* prevent writes */
+#endif
+
+/*
+ * Verify whether we are currently in a self-cloned program. It's not really
+ * possible to trivially identify a memfd compared to a regular tmpfs file, so
+ * the best we can do is to check whether the readlink(2) looks okay and that
+ * it is on a tmpfs.
+ */
+static int is_self_cloned(void)
+{
+	struct statfs statfsbuf = {0};
+	char linkname[PATH_MAX + 1] = {0};
+
+	if (statfs("/proc/self/exe", &statfsbuf) < 0)
+		return -1;
+	if (readlink("/proc/self/exe", linkname, PATH_MAX) < 0)
+		return -1;
+
+	return statfsbuf.f_type == TMPFS_MAGIC &&
+		!strncmp(linkname, MEMFD_LNKNAME, PATH_MAX);
+}
+
+/*
+ * Basic wrapper around mmap(2) that gives you the file length so you can
+ * safely treat it as an ordinary buffer. Only gives you read access.
+ */
+static char *read_file(char *path, size_t *length)
+{
+	int fd;
+	char buf[4096], *copy = NULL;
+
+	if (!length)
+		goto err;
+	*length = 0;
+
+	fd = open(path, O_RDONLY|O_CLOEXEC);
+	if (fd < 0)
+		goto err_free;
+
+	for (;;) {
+		int n;
+		char *old = copy;
+
+		n = read(fd, buf, sizeof(buf));
+		if (n < 0)
+			goto err_fd;
+		if (!n)
+			break;
+
+		do {
+			copy = realloc(old, (*length + n) * sizeof(*old));
+		} while(!copy);
+
+		memcpy(copy + *length, buf, n);
+		*length += n;
+	}
+	close(fd);
+	return copy;
+
+err_fd:
+	close(fd);
+err_free:
+	free(copy);
+err:
+	return NULL;
+}
+
+/*
+ * A poor-man's version of "xargs -0". Basically parses a given block of
+ * NUL-delimited data, within the given length and adds a pointer to each entry
+ * to the array of pointers.
+ */
+static int parse_xargs(char *data, int data_length, char ***output)
+{
+	int num = 0;
+	char *cur = data;
+
+	if (!data || *output)
+		return -1;
+
+	do {
+		*output = malloc(sizeof(**output));
+	} while (!*output);
+
+	while (cur < data + data_length) {
+		char **old = *output;
+
+		num++;
+		do {
+			*output = realloc(old, (num + 1) * sizeof(*old));
+		} while (!*output);
+
+		(*output)[num - 1] = cur;
+		cur += strlen(cur) + 1;
+	}
+	(*output)[num] = NULL;
+	return num;
+}
+
+/*
+ * "Parse" out argv and envp from /proc/self/cmdline and /proc/self/environ.
+ * This is necessary because we are running in a context where we don't have a
+ * main() that we can just get the arguments from.
+ */
+static int fetchve(char ***argv, char ***envp)
+{
+	char *cmdline, *environ;
+	size_t cmdline_size, environ_size;
+
+	cmdline = read_file("/proc/self/cmdline", &cmdline_size);
+	if (!cmdline)
+		goto err;
+	environ = read_file("/proc/self/environ", &environ_size);
+	if (!environ)
+		goto err_free;
+
+	if (parse_xargs(cmdline, cmdline_size, argv) <= 0)
+		goto err_free_both;
+	if (parse_xargs(environ, environ_size, envp) <= 0)
+		goto err_free_both;
+
+	return 0;
+
+err_free_both:
+	free(environ);
+err_free:
+	free(cmdline);
+err:
+	return -1;
+}
+
+static int clone_binary(void)
+{
+	int binfd, memfd, err;
+	ssize_t sent = 0;
+	struct stat statbuf = {0};
+
+	binfd = open("/proc/self/exe", O_RDONLY|O_CLOEXEC);
+	if (binfd < 0)
+		goto err;
+	if (fstat(binfd, &statbuf) < 0)
+		goto err_binfd;
+
+	memfd = memfd_create(MEMFD_COMMENT, MFD_CLOEXEC|MFD_ALLOW_SEALING);
+	if (memfd < 0)
+		goto err_binfd;
+
+	while (sent < statbuf.st_size) {
+		ssize_t n = sendfile(memfd, binfd, NULL, statbuf.st_size - sent);
+		if (n < 0)
+			goto err_memfd;
+		sent += n;
+	}
+
+	err = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL);
+	if (err < 0)
+		goto err_memfd;
+
+	close(binfd);
+	return memfd;
+
+err_memfd:
+	close(memfd);
+err_binfd:
+	close(binfd);
+err:
+	return -1;
+}
+
+int ensure_cloned_binary(void)
+{
+	int execfd;
+	char **argv = NULL, **envp = NULL;
+
+	/* Check that we're not self-cloned, and if we are then bail. */
+	int cloned = is_self_cloned();
+	if (cloned != 0)
+		return cloned;
+
+	if (fetchve(&argv, &envp) < 0)
+		return -1;
+
+	execfd = clone_binary();
+	if (execfd < 0)
+		return -1;
+
+	fexecve(execfd, argv, envp);
+	return -1;
+}
diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 28269dfc027f..4fdfec1b7b89 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -534,6 +534,9 @@ void join_namespaces(char *nslist)
 	free(namespaces);
 }
 
+/* Defined in cloned_binary.c. */
+int ensure_cloned_binary(void);
+
 void nsexec(void)
 {
 	int pipenum;
@@ -549,6 +552,14 @@ void nsexec(void)
 	if (pipenum == -1)
 		return;
 
+	/*
+	 * We need to re-exec if we are not in a cloned binary. This is necessary
+	 * to ensure that containers won't be able to access the host binary
+	 * through /proc/self/exe. See CVE-2019-5736.
+	 */
+	if (ensure_cloned_binary() < 0)
+		bail("could not ensure we are a cloned binary");
+
 	/* Parse all of the netlink configuration. */
 	nl_parse(pipenum, &config);
 
-- 
2.20.1


to post comments

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 15:56 UTC (Tue) by ColinIanKing (guest, #57499) [Link] (7 responses)

If those realloc() calls in the fix fail don't we end up with segfaults?

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 16:39 UTC (Tue) by ibukanov (subscriber, #3942) [Link]

The code loops until realloc succeeds.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 16:41 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (5 responses)

No, because according to realloc(3):

> If realloc() fails the original block is left untouched; it is not freed or moved.

So instead of segfaulting you would drop into an infinite loop and/or trigger the OOM killer. That's bad, but it's (probably) not a vuln.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 19:03 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (4 responses)

Looping as opposed to calling abort() is a useful option in a complex application. It allows to attach a debugger and investigate the live state.

Plus in the container world it is quite likely that realloc returns null not when the system is out of memory, but rather when the container hits its allocation limits. Administrator may rise the limits and let the application to continue and reach some stable state when it can be properly closed.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 19:28 UTC (Tue) by sorokin (guest, #88478) [Link] (3 responses)

> Administrator may rise the limits and let the application to continue and reach some stable state when it can be properly closed.

Should the same logic be applied to, say, open() function? File is not there, but, perhaps, it will be soon. Administrator may attach to the application and see that a necessary file is missing and put it into the right place.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 21:03 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (2 responses)

Memory allocation is much more widespread then open calls. So most applications do not even try to bother with allocation errors and assume that new/alloc/realloc never fails as dealing with those is too painful. This is true both for manual memory management and GC languages like Java or Go. This leads to the question of what to do if an allocation does report an error and one cannot propagate the error to the caller. Then doing a loop is not particularly worse then calling abort().

CVE-2019-5736: runc container breakout

Posted Feb 14, 2019 0:38 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

Which caller? The immediately preceding function or the caller that invoked the application? Being able to attach a debugger certainly has value, but in production environments, and particularly in cloud infrastructure environments, that's not the most important characteristic. What you want is for things to fail fast so that the system can adjust in a timely manner; and usually fail fast to the nearest caller (at least nearest by API or contract boundary) as the nearest caller will usually be in the best position to react optimally (and if not it can simply bubble up). It's *exactly* analogous to buffer bloat--by trying to be too helpful these hacks are blunting or dropping feedback pressure, with the macro result being crappy behavior that is extremely costly to remedy, if it can be remedied at all.

The distinction between being "truly" out of memory and only being out of memory because of policy seems contrived and irrelevant, particularly in the era of nested containers and VMs.

In the context of a language like Go or Perl or simple C utilities or similar languages, aborting on allocation failure (rather than trying to implement the runtime in a way that makes it recoverable) is excused as a reasonable cost-benefit tradeoff as these programs are typically executed circumscribed contexts (microservices, CGI scripts) where the possibility of meaningful recover by higher layers up the stack is retained. But locking up the application? Sure, being able to attach a debugger has some utility, just as being able to attach a debugger to a car's breaking system which decided to fail in an open state as a convenience for the engineers. But that's hardly the most reasonable or desirable behavior in the vast majority of contexts.

CVE-2019-5736: runc container breakout

Posted Feb 14, 2019 10:59 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

The code in question is a part of a long-running application that has non-trivial persistent state, not a microservice or CGI script.

In Linux realistically when realloc returns NULL, it is due to limits imposed on the application, not because the system is out of memory. Linux happily over commits and when the system does run out of memory, a process will be killed at an arbitrary point by OOM killer. In a production system this should not happen as it is extremely hard to write an application code that does not corrupt the state if the app can be killed at any point.

But if an application hits a memory policy limit, then waiting for the limit to be lifted is a reasonable behavior. First if the limit is imposed not on a single thread, but many threads/processes, then it makes sense to wait until the memory hog finishes. Second, many applications still may leave things in an inconsistent state if any memory allocation can call abort.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 16:52 UTC (Tue) by brauner (subscriber, #109349) [Link] (3 responses)

@jcorbet, could you please correct this to include the information that this attack only affects privileged containers.
For a few more details I've written about this a little: https://brauner.github.io/2019/02/12/privileged-container...
Thank you!
Christian

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 16:53 UTC (Tue) by brauner (subscriber, #109349) [Link]

s/jcorbet/corbet. Sorry for the typo. :)

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 18:32 UTC (Tue) by Lennie (guest, #49641) [Link]

Very good, you tell um about (un)privileged containers.

It's very good security hygiene.

I wish it was common for people running Docker/Kubernetes.

CVE-2019-5736: runc container breakout

Posted Feb 13, 2019 22:12 UTC (Wed) by geuder (subscriber, #62854) [Link]

> that this attack only affects privileged containers.

Which I would guess is > 99% of all docker installations.

If I understood it correctly you need to enable user namespaces in your docker installation before creating the first container.

After you have done that, containers will be created as unprivileged by default. But by using an option you can still create privileged and "super-privileged" containers.

I fear that having to start with a fresh installation is a quite high hurdle for many existing installation. And because it is not the default 99% of the new installations will enter the same dead end.

How many of the existing docker images would work in an unprivileged container? I have no experience on that. But I have used unprivileged containers under lxc before and it required nearly endless fiddling to get any sharing with the host working as desired without opening all gates. So I would not have high hopes that existing docker images would just run, unless they really don't share anything that makes uids visible.

I agree unprivileged containers should be used in many cases. But I predict it will not happen any time soon, because of the complications and extra work involved.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 19:17 UTC (Tue) by NightMonkey (subscriber, #23051) [Link] (2 responses)

Could anyone care to speak about this vulnerability with regard to major cloud provider container implementations? I'm thinking AWS ECS (Fargate or otherwise), Google Cloud, etc. Thanks in advance.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 21:44 UTC (Tue) by NightMonkey (subscriber, #23051) [Link]

Well, to start to answer my own quest:

https://aws.amazon.com/security/security-bulletins/AWS-20...

CVE-2019-5736: runc container breakout

Posted Feb 13, 2019 22:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Amazon doesn't use Docker containers to isolate different tenants. Even Fargate and Lambda use full-scale virtualization for isolation, so that your data won't leak into other users' accounts.

If you run untrusted containers then your account is vulnerable. But why would you do this?

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 19:23 UTC (Tue) by sorokin (guest, #88478) [Link] (3 responses)

> do {
> *output = malloc(sizeof(**output));
> } while (!*output);

Am I the only one person who consider these infinite loops a bad practice? I would say that the code should report the error. It also must preserve the state unchanged so the operation can be retried.

Having infinite loops like this especially in function that read whole file in memory looks very strange.

Apparently the authors are not consistent. They have asprintf() without looping in the same file.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 19:34 UTC (Tue) by excors (subscriber, #95769) [Link] (1 responses)

Reporting errors when out of memory seems difficult, unless you're extremely careful to implement the entire error-reporting path with no memory allocation at all (including in any third-party libraries you call into). More practical to just fail in a secure and obvious way (like abort() or (if you can expect the user to notice and debug/kill the process) loop forever).

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 20:26 UTC (Tue) by sorokin (guest, #88478) [Link]

> Reporting errors when out of memory seems difficult, unless you're extremely careful to implement the entire error-reporting path with no memory allocation at all (including in any third-party libraries you call into).

I completely agree. It is understandably difficult to be 100% exception-safe. That is why under "report the error" I meant reporting in general. "fprintf(stderr, ...); abort();" fine. "errno = ENOMEM; return -1;" fine. "throw std::bad_alloc();" fine. I would say that any kind of error reporting is OK as far as no data is lost and the operation can be retried.

> if you can expect the user to notice and debug/kill the proces

That is what I think is not reasonable to expect. When some KDE application hung using 100% CPU time, the last idea that will come to my mind is that it lacks memory. Also most users don't know how to use gdb.

CVE-2019-5736: runc container breakout

Posted Feb 12, 2019 23:57 UTC (Tue) by cyphar (subscriber, #110703) [Link]

The patch actually pushed is substantially cleaner[1], but the looping is mostly because the patch was co-developed with the LXC folks and they have must_realloc and family that do the exact same thing. "reporting an error" here would be a crash, by the way (in the context of "runc init" we currently don't have a way to report errors other than through the exit code). The first few iterations of the patch just aborted each time, but mimicking LXC's must_realloc seemed nicer.

> They have asprintf() without looping in the same file.

Yup, that is a mistake -- I will fix that when I update the fix to work on pre-3.11 kernels.

[1]: https://github.com/opencontainers/runc/commit/6635b4f0c6a...

CVE-2019-5736: runc container breakout

Posted Feb 14, 2019 8:12 UTC (Thu) by smcv (subscriber, #53363) [Link]

Flatpak versions older than 1.2.3 and 1.0.7 are thought to be vulnerable to a similar attack (CVE-2019-8308), although only in narrow circumstances: when an app or runtime with an `apply_extra` script is installed system-wide, the `apply_extra` script runs as root in a container, and could escape the container by using a similar technique.


Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds