CVE-2019-5736: runc container breakout
Anybody running containerized workloads with runc (used by Docker,
cri-o, containerd, and Kubernetes, among others) will want to make note of
a newly disclosed vulnerability known as CVE-2019-5736. "
The vulnerability allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host." LXC is also evidently vulnerable to a variant of the exploit.
| From: | Aleksa Sarai <cyphar-AT-cyphar.com> | |
| To: | oss-security-AT-lists.openwall.com | |
| Subject: | [oss-security] CVE-2019-5736: runc container breakout (all versions) | |
| Date: | Tue, 12 Feb 2019 00:05:20 +1100 | |
| Message-ID: | <20190211130520.xwi6vpay3sc56pza@yavin> | |
| Cc: | dev-AT-opencontainers.org, security-announce-AT-opencontainers.org | |
| Archive-link: | Article |
[[ Patch CRD: 2019-02-11 15:00 CET ]]
[[ Exploit Code CRD: 2019-02-18 15:00 CET ]]
Hello,
I am one of the maintainers of runc (the underlying container runtime
underneath Docker, cri-o, containerd, Kubernetes, and so on). We
recently had a vulnerability reported which we have verified and have a
patch for.
The researchers who found this vulnerability are:
* Adam Iwaniuk
* Borys Popławski
In addition, Aleksa Sarai (me) discovered that LXC was also vulnerable
to a more convoluted version of this flaw.
== OVERVIEW ==
The vulnerability allows a malicious container to (with minimal user
interaction) overwrite the host runc binary and thus gain root-level
code execution on the host. The level of user interaction is being able
to run any command (it doesn't matter if the command is not
attacker-controlled) as root within a container in either of these
contexts:
* Creating a new container using an attacker-controlled image.
* Attaching (docker exec) into an existing container which the
attacker had previous write access to.
This vulnerability is *not* blocked by the default AppArmor policy, nor
by the default SELinux policy on Fedora[++] (because container processes
appear to be running as container_runtime_t). However, it *is* blocked
through correct use of user namespaces (where the host root is not
mapped into the container's user namespace).
Our CVSSv3 vector is (with a score of 7.2):
AV:L/AC:H/PR:L/UI:R/S:C/C:N/I:H/A:H
The assigned CVE for this issue is CVE-2019-5736.
[++]: This is only the case for the "moby-engine" package on Fedora. The
"docker" package as well as podman are protected against this
exploit because they run container processes as container_t.
== PATCHES ==
I have attached the relevant patch which fixes this issue. This patch is
based on HEAD, but the code in libcontainer/nsenter/ changes so
infrequently that it should apply cleanly to any old version of the runc
codebase you are dealing with.
Please note that the patch I have pushed to runc master[1] is a modified
version of this patch -- even though it is functionally identical
(though we would recommend using the upstream one if you haven't patched
using the attached one already).
== NON-ESSENTIAL EXPLOIT CODE ==
Several vendors have asked for exploit code to ensure that the patches
actually solve the issue. Due to the severity of the issue (especially
for public cloud vendors), we decided to provide the attached exploit
code. This exploit code was written by me, and is more generic than the
original exploit code provided by the researchers and works against LXC
(it could likely be used on other vulnerable runtimes with no
significant modification). Details on how to use the exploit code are
provided in the README.
As per OpenWall rules, this exploit code will be published *publicly* 7
days after the CRD (which is 2019-02-18). *If you have a container
runtime, please verify that you are not vulnerable to this issue
beforehand.*
== IMPACT ON OTHER PROJECTS ==
It should be noted that upon further investigation I've discovered that
LXC has a similar vulnerability, and they have also pushed a similar
patch[2] which we co-developed. LXC is a bit harder to exploit, but the
same fundamental flaw exists.
After some discussion with the systemd-nspawn folks, it appears that
they aren't vulnerable (because their method of attaching to a container
uses a different method to LXC and runc).
I have been contacted by folks from Apache Mesos who said they were also
vulnerable (I believe just using the exploit code that will be
provided). It is quite likely that most container runtimes are
vulnerable to this flaw, unless they took very strange mitigations
before-hand.
== OTHER NEWS ==
We have set up an announcement list for future security vulnerabilities,
and you can see the process for joining here[3] (it's based on the
Kubernetes security-announce mailing list). Please join if you
distribute any container runtimes that depend on runc (or other OCI
projects).
[1]: https://github.com/opencontainers/runc/commit/0a8e4117e7f...
[2]: https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49ba...
[3]: https://github.com/opencontainers/org/blob/master/securit...
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
From 604a8f8120ef128c0a5bc778e71909eeb0906842 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Wed, 9 Jan 2019 13:40:01 +1100
Subject: [PATCH] nsenter: clone /proc/self/exe to avoid exposing host binary
to container
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).
As a hotfix we require memfd_create(2), but we can always extend this to
use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this
approach is no page-cache sharing for the runc binary (which overlayfs
would give us) but this is far less complicated.
This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).
Fixes: CVE-2019-5736
Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
libcontainer/nsenter/cloned_binary.c | 236 +++++++++++++++++++++++++++
libcontainer/nsenter/nsexec.c | 11 ++
2 files changed, 247 insertions(+)
create mode 100644 libcontainer/nsenter/cloned_binary.c
diff --git a/libcontainer/nsenter/cloned_binary.c b/libcontainer/nsenter/cloned_binary.c
new file mode 100644
index 000000000000..ec383c173dd2
--- /dev/null
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -0,0 +1,236 @@
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <limits.h>
+#include <fcntl.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/vfs.h>
+#include <sys/mman.h>
+#include <sys/sendfile.h>
+#include <sys/syscall.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+
+#define MEMFD_COMMENT "runc_cloned:/proc/self/exe"
+#define MEMFD_LNKNAME "/memfd:" MEMFD_COMMENT " (deleted)"
+
+/* Use our own wrapper for memfd_create. */
+#if !defined(SYS_memfd_create) && defined(__NR_memfd_create)
+# define SYS_memfd_create __NR_memfd_create
+#endif
+#ifndef SYS_memfd_create
+# error "memfd_create(2) syscall not supported by this glibc version"
+#endif
+int memfd_create(const char *name, unsigned int flags)
+{
+ return syscall(SYS_memfd_create, name, flags);
+}
+
+/* This comes directly from <linux/fcntl.h>. */
+#ifndef F_LINUX_SPECIFIC_BASE
+# define F_LINUX_SPECIFIC_BASE 1024
+#endif
+#ifndef F_ADD_SEALS
+# define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
+# define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
+#endif
+#ifndef F_SEAL_SEAL
+# define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
+# define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
+# define F_SEAL_GROW 0x0004 /* prevent file from growing */
+# define F_SEAL_WRITE 0x0008 /* prevent writes */
+#endif
+
+/*
+ * Verify whether we are currently in a self-cloned program. It's not really
+ * possible to trivially identify a memfd compared to a regular tmpfs file, so
+ * the best we can do is to check whether the readlink(2) looks okay and that
+ * it is on a tmpfs.
+ */
+static int is_self_cloned(void)
+{
+ struct statfs statfsbuf = {0};
+ char linkname[PATH_MAX + 1] = {0};
+
+ if (statfs("/proc/self/exe", &statfsbuf) < 0)
+ return -1;
+ if (readlink("/proc/self/exe", linkname, PATH_MAX) < 0)
+ return -1;
+
+ return statfsbuf.f_type == TMPFS_MAGIC &&
+ !strncmp(linkname, MEMFD_LNKNAME, PATH_MAX);
+}
+
+/*
+ * Basic wrapper around mmap(2) that gives you the file length so you can
+ * safely treat it as an ordinary buffer. Only gives you read access.
+ */
+static char *read_file(char *path, size_t *length)
+{
+ int fd;
+ char buf[4096], *copy = NULL;
+
+ if (!length)
+ goto err;
+ *length = 0;
+
+ fd = open(path, O_RDONLY|O_CLOEXEC);
+ if (fd < 0)
+ goto err_free;
+
+ for (;;) {
+ int n;
+ char *old = copy;
+
+ n = read(fd, buf, sizeof(buf));
+ if (n < 0)
+ goto err_fd;
+ if (!n)
+ break;
+
+ do {
+ copy = realloc(old, (*length + n) * sizeof(*old));
+ } while(!copy);
+
+ memcpy(copy + *length, buf, n);
+ *length += n;
+ }
+ close(fd);
+ return copy;
+
+err_fd:
+ close(fd);
+err_free:
+ free(copy);
+err:
+ return NULL;
+}
+
+/*
+ * A poor-man's version of "xargs -0". Basically parses a given block of
+ * NUL-delimited data, within the given length and adds a pointer to each entry
+ * to the array of pointers.
+ */
+static int parse_xargs(char *data, int data_length, char ***output)
+{
+ int num = 0;
+ char *cur = data;
+
+ if (!data || *output)
+ return -1;
+
+ do {
+ *output = malloc(sizeof(**output));
+ } while (!*output);
+
+ while (cur < data + data_length) {
+ char **old = *output;
+
+ num++;
+ do {
+ *output = realloc(old, (num + 1) * sizeof(*old));
+ } while (!*output);
+
+ (*output)[num - 1] = cur;
+ cur += strlen(cur) + 1;
+ }
+ (*output)[num] = NULL;
+ return num;
+}
+
+/*
+ * "Parse" out argv and envp from /proc/self/cmdline and /proc/self/environ.
+ * This is necessary because we are running in a context where we don't have a
+ * main() that we can just get the arguments from.
+ */
+static int fetchve(char ***argv, char ***envp)
+{
+ char *cmdline, *environ;
+ size_t cmdline_size, environ_size;
+
+ cmdline = read_file("/proc/self/cmdline", &cmdline_size);
+ if (!cmdline)
+ goto err;
+ environ = read_file("/proc/self/environ", &environ_size);
+ if (!environ)
+ goto err_free;
+
+ if (parse_xargs(cmdline, cmdline_size, argv) <= 0)
+ goto err_free_both;
+ if (parse_xargs(environ, environ_size, envp) <= 0)
+ goto err_free_both;
+
+ return 0;
+
+err_free_both:
+ free(environ);
+err_free:
+ free(cmdline);
+err:
+ return -1;
+}
+
+static int clone_binary(void)
+{
+ int binfd, memfd, err;
+ ssize_t sent = 0;
+ struct stat statbuf = {0};
+
+ binfd = open("/proc/self/exe", O_RDONLY|O_CLOEXEC);
+ if (binfd < 0)
+ goto err;
+ if (fstat(binfd, &statbuf) < 0)
+ goto err_binfd;
+
+ memfd = memfd_create(MEMFD_COMMENT, MFD_CLOEXEC|MFD_ALLOW_SEALING);
+ if (memfd < 0)
+ goto err_binfd;
+
+ while (sent < statbuf.st_size) {
+ ssize_t n = sendfile(memfd, binfd, NULL, statbuf.st_size - sent);
+ if (n < 0)
+ goto err_memfd;
+ sent += n;
+ }
+
+ err = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_SEAL);
+ if (err < 0)
+ goto err_memfd;
+
+ close(binfd);
+ return memfd;
+
+err_memfd:
+ close(memfd);
+err_binfd:
+ close(binfd);
+err:
+ return -1;
+}
+
+int ensure_cloned_binary(void)
+{
+ int execfd;
+ char **argv = NULL, **envp = NULL;
+
+ /* Check that we're not self-cloned, and if we are then bail. */
+ int cloned = is_self_cloned();
+ if (cloned != 0)
+ return cloned;
+
+ if (fetchve(&argv, &envp) < 0)
+ return -1;
+
+ execfd = clone_binary();
+ if (execfd < 0)
+ return -1;
+
+ fexecve(execfd, argv, envp);
+ return -1;
+}
diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 28269dfc027f..4fdfec1b7b89 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -534,6 +534,9 @@ void join_namespaces(char *nslist)
free(namespaces);
}
+/* Defined in cloned_binary.c. */
+int ensure_cloned_binary(void);
+
void nsexec(void)
{
int pipenum;
@@ -549,6 +552,14 @@ void nsexec(void)
if (pipenum == -1)
return;
+ /*
+ * We need to re-exec if we are not in a cloned binary. This is necessary
+ * to ensure that containers won't be able to access the host binary
+ * through /proc/self/exe. See CVE-2019-5736.
+ */
+ if (ensure_cloned_binary() < 0)
+ bail("could not ensure we are a cloned binary");
+
/* Parse all of the netlink configuration. */
nl_parse(pipenum, &config);
--
2.20.1
