Handling the Kubernetes symbolic link vulnerability

By Jake Edge
December 19, 2018

A year-old bug in Kubernetes was the topic of a talk given by Michelle Au and Jan Šafránek at KubeCon + CloudNativeCon North America, which was held mid-December in Seattle. In the talk, they looked at the details of the bug and the response from the Kubernetes product security team (PST). While the bug was fairly straightforward, it was surprisingly hard to fix. The whole process also provided experience that will help improve vulnerability handling in the future.

The Kubernetes project first became aware of the problem from a GitHub issue that was created on November 30, 2017. It gave full detail of the bug and was posted publicly. That is not the proper channel for reporting Kubernetes security bugs, Au stressed. Luckily, a security team member saw the bug report and cleared out all of the details, moving it to a private issue tracker. There is a documented disclosure process for the project that anyone finding a security problem should follow, she said.

Background

In order to understand the bug, some background on volumes in Kubernetes is needed, much of which can also be found in a blog post by Au and Šafránek. They put up an example pod spec (which can be seen in the slides [PDF]) that was using a volume. When that pod gets scheduled to a node, the volume will be mounted so that containers in the pod can access it. Each container can specify where the volume will be mounted inside the container. A directory is created on the host filesystem where the volume is mounted.

When the container starts, the kubelet node manager needs to tell the container runtime about the volume; both the host filesystem path and the location in the container where it should be bind mounted are needed. In addition, a container can specify a subdirectory of the location where the volume is mounted in the container using the subPath directive. But subPath was subject to a classic symbolic-link race condition vulnerability.

A container could symbolically link anything to a subPath name in its view of the volume. A subsequent container that used the subPath name would be using a link controlled by the owner of the pod. Those links are resolved on the host, so linking the subPath name to / would provide access to the root directory on the host; game over.

Šafránek demonstrated the flaw. He showed that it allows access to the root directory of the host, which means that the whole node is compromised. For example, it gives access to the Docker socket so an attacker can run anything in the containers, access any secrets used by the containers, and more. All of that comes about because Kubernetes does not check for symbolic links.

Working out a solution

The problem was reported just before KubeCon Austin, so the PST brainstormed on solutions at that gathering. The first, "naive", idea was simply to resolve the full path, then validate that it still points inside the volume. But there is a time of check to time of use (TOCTTOU) race condition in that scheme. The user can modify the directories after the check but before they are handed to the container runtime.

The next idea was to freeze the filesystem while validating the path and handing it to the container runtime. For Windows, CreateFile() can be used to lock each component of the path until after the runtime mounts it, but something different was needed for Linux. Bind mounting the volume to some Kubernetes directory, outside of user control, and then handing that off is a safe way to get it to the runtime, but it still leaves race conditions. After any symbolic links are resolved and after the path has been validated to remain inside the volume are two points where a user could switch the path to contain a symbolic link.

The /proc filesystem contained a clue that was used for the actual solution that was implemented. The links under /proc/PID/fd can reliably be used to bind mount a file descriptor corresponding to the final component of the subPath. The volume directory is opened, then each component of the subPath is opened using openat() while disallowing following symbolic links and validating that the path is still within the volume. The file descriptor file in /proc of the final component is then bind mounted to a safe location and handed off to the container runtime. That eliminates the races and implements a scheme that is not dependent on the underlying filesystem type.

Making the fix

It took a fair amount of time to get the fix out there; there were lots of end-to-end tests that needed to be developed and run on both Linux and Windows. But, since Kubernetes is developed in the open, how could this fix be developed in secret? The answer is that a separate repository, kubernetes-security, was used. Only the PST can normally access it, but the PST can give temporary access to those working on the fix. Au and Šafránek lost their access after the fix was released; "we have no idea what's going on there now", Šafránek said.

The development and testing process is similar to that of the open kubernetes repository, but the logs of tests and such for kubernetes-security go to a private bucket that only Google employees can access. Šafránek works for Red Hat in Europe, so sometimes he had to wait for Au, who works for Google in the US, to wake up so that he could find out what went wrong for a test run.

The flaw was disclosed to third-party Kubernetes vendors on the closed kubernetes-distributors-announce mailing list one week before it was publicly disclosed. On March 12, CVE-2017-1002101 was announced, which was roughly four months after it was reported. Kubernetes 1.7, 1.8, and 1.9 were updated and released on that day. The timeline and more can be found in the public post-mortem document.

Au went over some "best practices" for avoiding these kinds of problems. To start with, do not run containers as the root user; containers running as another user only have the same access as that user. That can be enforced by using PodSecurityPolicy, though containers will still run in the root group; the upcoming RunAsGroup feature will address that shortcoming. The security policy can also be used to restrict volume access, though that would not have helped for this particular vulnerability.

Using sandboxed containers is something that is being investigated for future Kubernetes releases. Using gVisor or Kata Containers will provide another security boundary. That is in keeping with a core principle that there should be at least two security barriers around untrusted code. For this vulnerability, a sandbox could have prevented access to the host filesystem. Au said she expects to see some movement on container sandboxes over the next year or so.

She started her talk summary with a reminder to follow the project's security disclosure process. She also suggested that Kubernetes and other projects be "extra cautious" when handling untrusted paths. Symbolic-link races and TOCTTOU are well-known dangers in path handling. In addition, she recommended setting restrictive security policies and using multiple security boundaries.

In answer to a question, Au said that most of the four months was taken up by development and testing, some of which was slowed down by the end-of-year holiday break. About two weeks were taken up with the actual release process. When asked about what could improve for the next CVE, Šafránek said that getting access to the private logs is important; Au said that it is being worked on. She also pointed to the post-mortem document as a good source for improvement ideas.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Seattle for KubeCon NA.]

Index entries for this article
Security	Containers
Security	Vulnerability response
Conference	KubeCon NA/2018

Handling the Kubernetes symbolic link vulnerability

Posted Dec 19, 2018 21:33 UTC (Wed) by jra (subscriber, #55261) [Link] (19 responses)

Over many years I've come to the conclusion that symlinks are a security nightmare whose utility is outweighed by the security problems they cause.

At the very least Linux needs a flag bit that allows applications to tell syscalls to completely ignore symlinks and fail pathnames containing them. O_NOFOLLOW doesn't do what anyone wants it to do, unfortunately.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 19, 2018 22:02 UTC (Wed) by nopsled (guest, #129072) [Link]

Now you know why Plan 9 never had them.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 11:58 UTC (Thu) by domenpk (guest, #12382) [Link] (16 responses)

A shout out to

/proc/sys/fs/protected_symlinks
and
/proc/sys/fs/protected_hardlinks

I wonder if any distro has this enabled by default.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 12:17 UTC (Thu) by Trou.fr (subscriber, #26289) [Link]

Debian sid has '1' in both files.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 12:20 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (14 responses)

Every modern distro has it enabled, except fringe ones and the ones disabling it on purpose:
https://github.com/systemd/systemd/blob/master/sysctl.d/5...

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 18:15 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (13 responses)

Disabling useful system features because application authors can't be bothered^W"trusted" to avoid simple coding mistakes which have been publically known for an eternity isn't particularly "modern". That's nothing but "the BSD UNIX security gold standard" since some time in the late 1980s.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 19:49 UTC (Thu) by jerojasro (guest, #98169) [Link] (10 responses)

It might be useful, but if it is so easily misused, time and time and again, maybe it's not worth keeping.

Your approach of demanding "complete avoidance of mistakes" from application authors does not work in the real world, as recurring security issues stemming from the same "feature" show. It's way easier to fix the environment and eradicate the issue in a single place, instead of educating tons of developers and expect them to remember and never make the same mistake again...

(FWIW, I'm mostly paraphrasing/matching what I read at: https://rachelbythebay.com/w/2018/05/13/dates/ )

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 19:59 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (7 responses)

This is not about "misuse" of anything, the underlying problem is called "a TOCTOU race", the pattern of that being

application A verifies that condition X holds
application B changes the state of the system such that X isn't true anymore
application A acts on the presumption that X is still true

And I would indeed "demand" complete avoidance of that if I was in the position to demand anything here, as there's nothing particularly difficult about understanding that this approach is broken.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 22:23 UTC (Thu) by jerojasro (guest, #98169) [Link] (1 responses)

you are correct, we should not let TOCTOU races slip past. Still, some of us do, for whatever reason, and it's better to just make that error possibility disappear, whenever possible, than struggle with every new person who doesn't have yet the instinct/skill to identify TOCTOU issues, wait to be bitten by yet-another instance of the same security issue, and say: "why did this happen, this is a TOCTOU race, there is nothing particularly difficult about understanding it".

Yes, we should pay attention, learn our craft, be mindful. But even so, better to, as Rachel (the author of the article I posted) says, design the potential for error out of the system.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 21, 2018 21:22 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link]

There's no possibility of such a thing "slipping past": The broken code has to be intentionally written in the broken way. And please note that I do mean intentionally here. More often than not, the reason why the broken code got written is probably "It'll work almost all of the time!".

Handling the Kubernetes symbolic link vulnerability

Posted Dec 21, 2018 2:52 UTC (Fri) by dvdeug (guest, #10998) [Link] (4 responses)

Functional programming goes so far as to make changing the state of the system impossible to avoid such problems in multithreaded programs. In an environment without memory protection, every single <b>if</b> statement offers a chance for a such a race. In an multiprocess environment, every single use of fstat offers a chance for such a race. If you want to demand complete avoidance for that, show me how you'd write du so that it accurately offers the size of a directory's contents at some snapshot in time. The only way I know how to write du in a way that would avoid that problem would be to lock the system while du is running, which is not something that an ordinary program can do.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 21, 2018 23:09 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

du does not suffer from this problem: It provides approximate disk usage information[*]. But it doesn't act on this information.

BTW, the actual Kubernetes problem was even sillier than a TOCTOU race as the faulty code didn't even attempt to verify that the untrustworthy input it acted upon was valid.

[*] Assuming the whole system was locked while it ran, the information would still only be approximate as it could (and likely would) change as soon as the system was again unlocked. Hence, there's no point in even trying as the result is unreliable either way.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 22, 2018 6:32 UTC (Sat) by dvdeug (guest, #10998) [Link] (2 responses)

In the context with which we are talking, du does not provide approximate disk usage information. An attacker could get it to return almost any value by shuffling files around behind its back. At least a locked version would not allow an attacker to move a file around behind du's back.

If one doesn't act on provided information, it was useless to provide it. If one does act on it, in any way, there's a potential for a race condition. You said "And I would indeed "demand" complete avoidance of that if I was in the position to demand anything here, as there's nothing particularly difficult about understanding that this approach is broken." If the approach is broken and should be completely avoided, show me how to avoid it here, or argue that du, as well as virtually every other file tool* for any multiprocessing system, is inherently broken. Otherwise you've got to accept that complete avoidance is impractical.

* For example, rm; an attacker could potentially watch for rm to be run, move the file out of the way, and move it back after rm ran. The timing would be hard, but it's at least theoretically possible. Should rm be avoided?

Handling the Kubernetes symbolic link vulnerability

Posted Dec 27, 2018 19:41 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

du does provide approximate disk usage information. It (the program) doesn't act on this information, hence, it's not susceptible to a TOCTOU-based attack. What some kind of operator might or might not do based on approximate information provided by du is a different conversation but as operators aren't usually programmed, worrying about how to program them securely is ... eh ... a bit off.
(aka "a red herring"). Dito for rm.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 27, 2018 23:32 UTC (Thu) by dvdeug (guest, #10998) [Link]

Saying a filesystem is empty except for a few directories when it in fact had several GB of data in it before and during du's run is not "approximate".

Operators are often programs, frequently written in shell. If they use rm or du, they are subject to these types of attack. Even when operators are human, the fact that they literally can not operate these programs without getting into these race conditions is a concern if this approach is clearly broken.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 21:50 UTC (Thu) by joncb (guest, #128491) [Link] (1 responses)

But by that logic it's even more important that we remove malloc/free as soon as possible. One of the biggest categories of security vulnerabilities is variants on buffer overruns and we've been unsuccessfully trying to get programmers to deal with them correctly for well over 30 years.

Symlinks are such an important part of operating system functionality that i'm not sure how much of linux software would still run if you removed it... including systemd.

Having said that, improving the ergonomics of certain calls to turn them into pits of success instead of pits of failure seems like a valuable endeavor but i don't entirely know what that means from a practical standpoint.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 22:07 UTC (Thu) by sheepgoesbaaa (guest, #98005) [Link]

"pits of success" is a keeper

Handling the Kubernetes symbolic link vulnerability

Posted Dec 21, 2018 0:56 UTC (Fri) by jra (subscriber, #55261) [Link] (1 responses)

Problem is that application developers usually assume that pathnames they didn't create or modify themselves don't change on the fly underneath.

And you know what - they're mostly right in normal use cases :-). Most "normal" programmers don't expect a resource to get altered outside of their own program.

This is the core problem with people getting MT-programming right. Most people's minds just don't work that way (paranoia about resource modification between every possible access).

This is why the handle-based approach got retrofitted into POSIX via the XXXat() interfaces to get around the problem with the normal interfaces. The problem was the old interfaces weren't removed, and so everyone mostly kept using them (they work, right!).

Handle based pathname traversal (or chdir() into a parent directory to pin the inode) is the only way to fix a filesystem with arbitrary symlinks to make things safe, but almost no application code does that correctly (Samba didn't for the longest time, does now).

If you have an interface that's really hard to use safely, I'm going to argue it's the interface that's the problem, not the application programmers.

Handling the Kubernetes symbolic link vulnerability

Posted Dec 28, 2018 6:23 UTC (Fri) by cyphar (subscriber, #110703) [Link]

> This is why the handle-based approach got retrofitted into POSIX via the XXXat() interfaces to get around the problem with the normal interfaces.

Unfortunately just using *at(2) isn't really sufficient. You would need to do full path lookups (as in at least one openat(2) for each component of your path) in userspace with some pretty ugly checking (fstatat(2) or potentially readlink("/proc/self/fd/$foo")) in order to verify you haven't been thrown out of where you expect.

This is is why I am working on adding O_BENEATH and similar openat(2) flags so the kernel will do the checks for you[1] (since the kernel can actually do checks somewhat-atomically within VFS). Hopefully it'll remove some of these really frustrating hurdles.

[1]: https://lwn.net/Articles/767547/

Handling the Kubernetes symbolic link vulnerability

Posted Dec 20, 2018 17:41 UTC (Thu) by cyphar (subscriber, #110703) [Link]

> At the very least Linux needs a flag bit that allows applications to tell syscalls to completely ignore symlinks and fail pathnames containing them. O_NOFOLLOW doesn't do what anyone wants it to do, unfortunately.

This is precisely what I'm trying to do with the O_THISROOT patchset (which includes things like O_NOSYMLINKS and O_BENEATH) -- in fact these Kubernetes CVEs (and the fairly dodgy way they were fixed) inspired me to work on it (among other reasons).

Handling the Kubernetes symbolic link vulnerability

Posted Jan 7, 2019 21:13 UTC (Mon) by zoobab (guest, #9945) [Link]

"The Kubernetes project first became aware of the problem from a GitHub issue that was created on November 30, 2017. It gave full detail of the bug and was posted publicly. That is not the proper channel for reporting Kubernetes security bugs, Au stressed. Luckily, a security team member saw the bug report and cleared out all of the details, moving it to a private issue tracker. There is a documented disclosure process for the project that anyone finding a security problem should follow, she said."

At least next time, post it publicly on a server where anybody else cannot intervene to censor it.