The difficulty of safe path traversal
Aleksa Sarai, as the maintainer of the runc container runtime, faces a constant battle against security problems. Recently, runc has seen another instance of a security vulnerability that can be traced back to the difficulty of handling file paths on Linux. Sarai spoke at the 2025 Linux Plumbers Conference (slides; video) about some of the problems runc has had with path-traversal vulnerabilities, and to ask people to please use libpathrs, the library that he has been developing for safe path traversal.
Sarai began by defining what he meant by path safety. There are two kinds of path safety that are relevant to runc, he said. The first is "regular" path safety, which applies to any application working with files: when operating on a path, one of the components might change unexpectedly. For example, a program could be reading several files from a directory using absolute paths, only for a directory in the middle of the path to be changed, causing the program to see a mixture of files from different directories. That kind of time-of-check-to-time-of-use (TOCTTOU) error comes up all of the time in path-handling code. He shared a slide showing 14 different CVEs in runc since 2017, all of which involved this kind of problem. LWN covered one in 2024 and one in 2019 that were particularly noteworthy. The second kind of path safety deals with the peculiarities of virtual filesystems, and needs to be handled separately, Sarai added.
There are some partial solutions to regular path safety, Sarai said. The
openat2()
system call, for example, lets programmers specify to the kernel how it should
handle certain kinds of ambiguities. On the other hand, openat2() is
often blocked by
seccomp(), since one of the arguments is a pointer. So, the only
real alternative is to implement path traversal in user space, by opening each
directory along the path by hand. That is "quite finicky, but you can do
it
." The Go programming language recently added
standard library support for doing that kind of traversal, for instance. Sarai's
recommendation is to use
libpathrs, the
MPL-and-LGPL-licensed Rust library
(with C bindings) that he has written to do this delicate dance correctly.
The second kind of path safety that runc has to deal with, above and beyond what
most applications deal with, is what Sarai calls "strict" path safety. Some
virtual filesystems, such as procfs, require extra attention to detail to use
safely. On a normal filesystem, one does not typically care which exact inode
one opens, so long as it is for a file located at the right path. But on procfs,
it's important to operate on the exact file the program was expecting:
"overmounts or fake mounts can trick us into doing dangerous things.
" For
example, a program that reexecutes itself using /proc/self/exe would be quite
broken if a user bind-mounted a different file into that path.
This isn't a new problem. For regular files, one can use the
RESOLVE_NO_XDEV option to openat2() to avoid crossing any
filesystem boundaries. But that doesn't work for procfs's magic links (such as
/proc/self/exe) "for some reason.
" [Sarai later wrote to me to
clarify that the reason is actually straightforward: since
/proc/self/exe refers to an executable file on another file system, of
course RESOLVE_NO_XDEV blocks it. The trick is making sure that the
magic link hasn't been overmounted with a reference to the wrong other
filesystem.] The solution is to use the
filesystem mount API
to obtain a file descriptor for the root of procfs that no other
process can interfere with.
That solution doesn't work inside nested user namespaces, however. So, programs that need to work inside nested namespaces (such as runc, when people want to nest containers) have to resort to some extremely delicate checks that particular paths haven't been overmounted.
Sarai expected some people to be skeptical that this was a real problem, on the basis that only root can mount things. That's not really true in the context of runc, though. There are, of course, mount namespaces to deal with. But there is also the fact that runc is used as a backend by a large number of different high-level containerization programs. Those programs often give users a lot of flexibility to configure container mounts, which can result in unprivileged mounting into containers that runc is working with, sometimes in ways that permit a program to escape the container. Almost all of the vulnerabilities in runc have been misconfiguration bugs, he said. Sometimes, runc can prevent those by recognizing an obviously invalid configuration, and sometimes the higher-level programs need to be patched to avoid generating the problematic configuration in the first place.
He then went through a series of pop quizzes to illustrate the difficulties that come up in configuring a container. Two of runc's most recent vulnerabilities, CVE-2025-31133 and CVE-2025-52565, involved using a mount and a symbolic link to trick runc into giving a container access to /proc/sys/kernel/core_pattern, the procfs file used to configure the kernel's core-dump handler. Writing into that file could let processes escape from inside a container.
Sarai didn't go into detail on why setting the core-dump handler could enable an escape in the talk, but he was probably referring to the problem Christian Brauner highlighted in May. When the kernel launches the configured core-dump handler, it runs with full privileges in the root namespace; if the container can configure the core-dump handler to be a file that it controls, this allows it to effectively take over the system.
The solution in runc was to use much stricter validation of special inodes, to
move to libpathrs for path traversal, and to use
TIOCGPTPEER to validate that console files are really console files
and not sneakily overmounted regular files.
But the work also brought to light some potential kernel changes that
could make writing this kind of path-handling code much safer. Sarai suggested
adding a RESOLVE_NO_DOTDOT option to openat2(), to prevent
traversing into a parent directory by accident ban the use of ".." in paths at all. He said that it would also help
to block all overmounts of procfs magic links; most have been blocked
since kernel version 6.12, but "most" and "all" are different prospects in
the world of security.
For people writing user-space applications, Sarai's recommendation is to switch to a more file-descriptor based design, rather than relying on paths. Ideally, use openat2() or libpathrs to handle path manipulation. Every system call that works with path names is potentially dangerous, he said.
One member of the audience asked whether Sarai had any advice for safely dealing
with paths in cgroupfs (the virtual filesystem for manipulating control
groups).
Sarai replied that the RESOLVE_NO_XDEV flag to
openat2() was quite helpful for making sure that one stays within
cgroupfs. Version 1 control groups are "annoying,
" and he didn't have
much advice for dealing with them correctly. For version 2 control groups,
however, he recommended opening a file descriptor for the root of the filesystem
and checking the inode number. If that is correct, it's much more certain that
the program is interacting with the real cgroupfs.
Another member of the audience asked how the CVEs that Sarai had highlighted
were discovered. "Not through fuzzing or anything, just people looking,
"
he answered. In 2018, he "provided a very general script
" for creating
scenarios that can result in path-traversal problems. Since then, people have
been poking at runc and slowly evolving the attacks to reach increasingly
obscure corner cases. Each year's CVEs tend to be an evolution of the previous
year's, he said.
Someone else asked whether he thought that using virtual filesystems such as
procfs and cgroupfs to present kernel interfaces was a mistake. "Well, there's
several attacks that could never happen if it were [designed using system calls
instead], and that makes you wonder, right?
" Sarai replied. On the other hand, virtual
filesystems
do provide some nice benefits. The LXC container runtime has a
fake
procfs implementation to help control-group-unaware programs to identify process
and memory limits, for example, he said.
That didn't satisfy some people, who pointed out that there are also ways to intercept system calls. At that point, however, the session ran out of time and the discussion spilled over into the hallway.
[ Thanks to the Linux Foundation, LWN's travel sponsor, for helping with travel to Tokyo to cover the Linux Plumbers Conference. ]
| Index entries for this article | |
|---|---|
| Conference | Linux Plumbers Conference/2025 |
Posted Jan 6, 2026 22:18 UTC (Tue)
by epa (subscriber, #39769)
[Link] (3 responses)
Posted Jan 7, 2026 20:22 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
I guess this is (or would eventually be) possible to implement with BPF-LSM but it would obviously break most real systems. And unfortunately, symlinks are still an issue (though this is basically solved with
Posted Jan 8, 2026 8:24 UTC (Thu)
by taladar (subscriber, #68407)
[Link] (1 responses)
That said, it is not really necessary to allow mounting over non-empty directories to achieve this, or indeed for mount points to just be regular directories that can be written into when the mount that was supposed to be there failed to mount properly. Something similar to the git model for submodules where "mounts" are essentially an entirely different type of file would work for mount points and might fix some other issues too.
Posted Jan 8, 2026 15:07 UTC (Thu)
by epa (subscriber, #39769)
[Link]
I agree that it's a strength to let you arrange your filesystem on physical storage as you wish; I just think it could be implemented in a more organized way. Like how device special files could, in principle, be created anywhere on the filesystem (at least in classical Unix) but in practice it costs nothing to make them always exist under /dev -- and if you did want to put a device file elsewhere in the namespace for whatever bizarre reason, you could always symlink to it.
Posted Jan 7, 2026 1:14 UTC (Wed)
by MrWim (subscriber, #47432)
[Link] (3 responses)
How is that different to RESOLVE_BENEATH?
Posted Jan 7, 2026 19:54 UTC (Wed)
by cyphar (subscriber, #110703)
[Link] (2 responses)
It seems this bit of my talk was a bit confusing, sorry about that -- I'm going to send a mail to the editor to suggest a slight clean-up, but in short:
Posted Jan 7, 2026 20:20 UTC (Wed)
by daroc (editor, #160859)
[Link] (1 responses)
Is the difference that RESOLVE_BENEATH permits using '..' as long as doing so doesn't go up past the parent directory, but RESOLVE_NO_DOTDOT would ban '..' at all? So they would both block '../foo', but only the latter would block 'foo/../bar'?
Posted Jan 7, 2026 20:23 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
Posted Jan 7, 2026 9:34 UTC (Wed)
by kleptog (subscriber, #1183)
[Link] (5 responses)
There is a certain irony in blocking, for security, a system call that would make it easier to write secure code.
Posted Jan 7, 2026 14:06 UTC (Wed)
by chris_se (subscriber, #99706)
[Link] (3 responses)
I think the main issue here is probably inertia in older code / older configurations, rather than security. While it's true that you can't use seccomp to match against flags and mode of openat2() (in contrast to open()/openat() where that's possible), in a container context I don't really see why you'd restrict the flags and/or mode of the open()-family of system calls. And for applications that apply sandboxes for privilege separation themselves they'd probably just disallow open() entirely.
The only reason I could think for disallowing openat2() but not open() would also lead to disallowing openat() - namely that in specific circumstances they could be used to escape specific incomplete attempts at sandboxes. But in that case the reason that openat2() takes a pointer is not relevant here, because openat() would also be affected. I don't think this makes much sense in a container context though, because container runtimes actually perform proper sandboxing steps so that having openat()/openat2() inside the container is not an issue.
Posted Jan 7, 2026 20:10 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted Jan 7, 2026 20:17 UTC (Wed)
by cyphar (subscriber, #110703)
[Link] (1 responses)
No, it is unfortunately an active decision (systemd actively does this). The main reason is that of paranoia -- currently you cannot restrict flag arguments at all with seccomp which makes security-conscious people very antsy. If we ever added some horrific security vulnerability to
Posted Jan 7, 2026 20:28 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
And of course, I meant
Posted Jan 7, 2026 20:00 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
It is a bit ironic but is kind of understandable. For what it's worth, at LPC 2024 I proposed an approach to enabling seccomp filtering of pointers for extensible structure arguments (which Kees seemed quite receptive to) and so this may be solved soon™ -- unfortunately I've was quite busy for most of 2025 (with all of this runc hardening stuff)... Also, a lot of this fallback stuff started being developed back when
Posted Jan 7, 2026 13:37 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (1 responses)
It's a pretty recent addition, but anything else the kernel could provide would obviously be even more recent (like, in the future).
Posted Jan 7, 2026 19:42 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
Unfortunately not. In particular, some features like And ultimately,
Posted Jan 8, 2026 23:48 UTC (Thu)
by milesrout (subscriber, #126894)
[Link] (1 responses)
>Sarai expected some people to be skeptical that this was a real problem, on the basis that only root can mount things. That's not really true in the context of runc, though. There are, of course, mount namespaces to deal with. But there is also the fact that runc is used as a backend by a large number of different high-level containerization programs. Those programs often give users a lot of flexibility to configure container mounts, which can result in unprivileged mounting into containers that runc is working with, sometimes in ways that permit a program to escape the container. Almost all of the vulnerabilities in runc have been misconfiguration bugs, he said. Sometimes, runc can prevent those by recognizing an obviously invalid configuration, and sometimes the higher-level programs need to be patched to avoid generating the problematic configuration in the first place.
Don't mount namespaces only allow you to mount inside the namespace. Why would you try to run code inside that namespace?
It isn't a runc vulnerability, in any circumstances, for someone to misconfigure it and that to lead to a security vulnerability. That is a vulnerability in their configuration.
Posted Jan 9, 2026 7:25 UTC (Fri)
by cyphar (subscriber, #110703)
[Link]
Believe me, I understand the skepticism. In theory all of this stuff shouldn't be necessary, and if I could've gotten out of working on this by rejecting this attack model I would've done so several years ago. Unfortunately there are two compounding problems with doing so: runc has not had a security vulnerability that did not involve a malicious configuration since 2016, but we have to deal with issues that practically arise due to how these systems are put together. This is also the reason we cannot reject security issues that require user namespaces to be disabled -- most Kubernetes and Docker workloads do not use user namespaces, so we cannot just ignore them. I cannot put into words how much I wish this were not the case.
Posted Jan 9, 2026 15:55 UTC (Fri)
by alip (subscriber, #170176)
[Link] (2 responses)
[1]: https://man.exherbo.org/syd.2.html#trace/deny_dotdot
Posted Jan 9, 2026 19:07 UTC (Fri)
by cyphar (subscriber, #110703)
[Link] (1 responses)
Yes, this is also on my list of things to do. When I first wrote openat2(2) I didn't think it'd be necessary but I just recently encountered a case where it would've been really handy to have this capability...
Posted Jan 10, 2026 12:43 UTC (Sat)
by cyphar (subscriber, #110703)
[Link]
That being said, you should still combine
Mounting could be a bit more regulated
Mounting could be a bit more regulated
openat2(2)) -- though of course plan9 did away with those too (and they were probably right to).Mounting could be a bit more regulated
Mounting could be a bit more regulated
Openat2 resolve_beneath
Openat2 resolve_beneath
openat2(2) there was a push to have RESOLVE_BENEATH block ... My argument at the time (which was quoted in LWN) was that this would break a lot of real use-cases because symlinks containing .. are very widespread./proc resolver we cannot support walking into .. so for parity we should probably have this also be the case for openat(2) (this is why I brought it up in the talk)... but this is not really true in the general case (which is why I didn't do it for RESOLVE_BENEATH).Openat2 resolve_beneath
Openat2 resolve_beneath
Security irony
Security irony
> There is a certain irony in blocking, for security, a system call that would make it easier to write secure code.
Security irony
Security irony
openat(2) (however unlikely that is) they would not be able to disable it without disabling all usage of openat(2).
> If we ever added some horrific security vulnerability to openat(2) (however unlikely that is) they would not be able to disable it without disabling all usage of openat(2).
Security irony
openat2(2) here. I also do think this is a reasonable concern in general -- though in practice nobody actually blocks open(2) flags because the primary risk with open(2) is path access, which is meant to be mediated by LSMs and regular DAC/ACL permissions.Security irony
openat(2) was new and so it was a more pressing thing. These days we would mostly need it to support older machines.Alternatives to path traversal
Alternatives to path traversal
PIDFD_SELF provides some useful information but it isn't a /proc subdir. I did propose something like this privately but this is too hard to do, so using fsopen(2) is what we ended up with. Unfortunately fsopen(2) alone is not enough (in the generic open_tree(2) case), you need to also layer this lookup stuff on top./proc/self/fd/$n reopening or using /proc/self/exe still requires /proc magic-links, which entails doing all of this stuff. You can imagine implementing these hacks with proper kernel interfaces in a different way (which I have proposals and prototypes for) but that's all future stuff and we need to support older kernels anyway...sysctls need /proc/sys which (I suspect) is probably never going to be switched to a non-filesystem interface.Solution to a non-problem
Solution to a non-problem
runc init by runc) as part of finishing up the configuration before execve(2)-ing the user's code. During this process we need to write to /proc and do a bunch of other similar operations. The three options we have are:
/proc (which can lead to container escapes if a bad configuration can reference the file descriptor (this is what LXC does) -- see CVE-2024-21626 for an example of this style of attack (the key takeaway from this is that PR_SET_DUMPABLE and O_CLOEXEC is not actually sufficient),/proc (this is what runc did until now), which leads to the kinds of issues I talked about in this talk; orDockerfile because there is a RUN --mount=... directive which can be combined with multi-stage builds (to hit the race condition). This makes even something as simple as building an image enough to trigger these kinds of exploits (of course, they are also exploitable from basic Kubernetes pod configurations and reasonable-looking Podman/Docker invocations too).More openat2
[2]: https://man.exherbo.org/syd.7.html#Path_Resolution_Restri...
[3]: https://cgit.freebsd.org/src/tree/sys/kern/vfs_lookup.c#n351
More openat2
More openat2
RESOLVE_NO_DOTDOT with RESOLVE_BENEATH because of absolute symlinks and also because RESOLVE_BENEATH includes a final sanity check in the kernel to detect escapes. RESOLVE_NO_DOTDOT is kinda helpful but I think people overestimate how useful blocking ".." entirely actually is -- the scoped lookup logic for RESOLVE_BENEATH is safe even though it allows "..".
