|
|
Log in / Subscribe / Register

The difficulty of safe path traversal

By Daroc Alden
January 6, 2026

LPC

Aleksa Sarai, as the maintainer of the runc container runtime, faces a constant battle against security problems. Recently, runc has seen another instance of a security vulnerability that can be traced back to the difficulty of handling file paths on Linux. Sarai spoke at the 2025 Linux Plumbers Conference (slides; video) about some of the problems runc has had with path-traversal vulnerabilities, and to ask people to please use libpathrs, the library that he has been developing for safe path traversal.

Sarai began by defining what he meant by path safety. There are two kinds of path safety that are relevant to runc, he said. The first is "regular" path safety, which applies to any application working with files: when operating on a path, one of the components might change unexpectedly. For example, a program could be reading several files from a directory using absolute paths, only for a directory in the middle of the path to be changed, causing the program to see a mixture of files from different directories. That kind of time-of-check-to-time-of-use (TOCTTOU) error comes up all of the time in path-handling code. He shared a slide showing 14 different CVEs in runc since 2017, all of which involved this kind of problem. LWN covered one in 2024 and one in 2019 that were particularly noteworthy. The second kind of path safety deals with the peculiarities of virtual filesystems, and needs to be handled separately, Sarai added.

[Aleksa Sarai]

There are some partial solutions to regular path safety, Sarai said. The openat2() system call, for example, lets programmers specify to the kernel how it should handle certain kinds of ambiguities. On the other hand, openat2() is often blocked by seccomp(), since one of the arguments is a pointer. So, the only real alternative is to implement path traversal in user space, by opening each directory along the path by hand. That is "quite finicky, but you can do it." The Go programming language recently added standard library support for doing that kind of traversal, for instance. Sarai's recommendation is to use libpathrs, the MPL-and-LGPL-licensed Rust library (with C bindings) that he has written to do this delicate dance correctly.

The second kind of path safety that runc has to deal with, above and beyond what most applications deal with, is what Sarai calls "strict" path safety. Some virtual filesystems, such as procfs, require extra attention to detail to use safely. On a normal filesystem, one does not typically care which exact inode one opens, so long as it is for a file located at the right path. But on procfs, it's important to operate on the exact file the program was expecting: "overmounts or fake mounts can trick us into doing dangerous things." For example, a program that reexecutes itself using /proc/self/exe would be quite broken if a user bind-mounted a different file into that path.

This isn't a new problem. For regular files, one can use the RESOLVE_NO_XDEV option to openat2() to avoid crossing any filesystem boundaries. But that doesn't work for procfs's magic links (such as /proc/self/exe) "for some reason." [Sarai later wrote to me to clarify that the reason is actually straightforward: since /proc/self/exe refers to an executable file on another file system, of course RESOLVE_NO_XDEV blocks it. The trick is making sure that the magic link hasn't been overmounted with a reference to the wrong other filesystem.] The solution is to use the filesystem mount API to obtain a file descriptor for the root of procfs that no other process can interfere with.

That solution doesn't work inside nested user namespaces, however. So, programs that need to work inside nested namespaces (such as runc, when people want to nest containers) have to resort to some extremely delicate checks that particular paths haven't been overmounted.

Sarai expected some people to be skeptical that this was a real problem, on the basis that only root can mount things. That's not really true in the context of runc, though. There are, of course, mount namespaces to deal with. But there is also the fact that runc is used as a backend by a large number of different high-level containerization programs. Those programs often give users a lot of flexibility to configure container mounts, which can result in unprivileged mounting into containers that runc is working with, sometimes in ways that permit a program to escape the container. Almost all of the vulnerabilities in runc have been misconfiguration bugs, he said. Sometimes, runc can prevent those by recognizing an obviously invalid configuration, and sometimes the higher-level programs need to be patched to avoid generating the problematic configuration in the first place.

He then went through a series of pop quizzes to illustrate the difficulties that come up in configuring a container. Two of runc's most recent vulnerabilities, CVE-2025-31133 and CVE-2025-52565, involved using a mount and a symbolic link to trick runc into giving a container access to /proc/sys/kernel/core_pattern, the procfs file used to configure the kernel's core-dump handler. Writing into that file could let processes escape from inside a container.

Sarai didn't go into detail on why setting the core-dump handler could enable an escape in the talk, but he was probably referring to the problem Christian Brauner highlighted in May. When the kernel launches the configured core-dump handler, it runs with full privileges in the root namespace; if the container can configure the core-dump handler to be a file that it controls, this allows it to effectively take over the system.

The solution in runc was to use much stricter validation of special inodes, to move to libpathrs for path traversal, and to use TIOCGPTPEER to validate that console files are really console files and not sneakily overmounted regular files. But the work also brought to light some potential kernel changes that could make writing this kind of path-handling code much safer. Sarai suggested adding a RESOLVE_NO_DOTDOT option to openat2(), to prevent traversing into a parent directory by accident ban the use of ".." in paths at all. He said that it would also help to block all overmounts of procfs magic links; most have been blocked since kernel version 6.12, but "most" and "all" are different prospects in the world of security.

For people writing user-space applications, Sarai's recommendation is to switch to a more file-descriptor based design, rather than relying on paths. Ideally, use openat2() or libpathrs to handle path manipulation. Every system call that works with path names is potentially dangerous, he said.

One member of the audience asked whether Sarai had any advice for safely dealing with paths in cgroupfs (the virtual filesystem for manipulating control groups). Sarai replied that the RESOLVE_NO_XDEV flag to openat2() was quite helpful for making sure that one stays within cgroupfs. Version 1 control groups are "annoying," and he didn't have much advice for dealing with them correctly. For version 2 control groups, however, he recommended opening a file descriptor for the root of the filesystem and checking the inode number. If that is correct, it's much more certain that the program is interacting with the real cgroupfs.

Another member of the audience asked how the CVEs that Sarai had highlighted were discovered. "Not through fuzzing or anything, just people looking," he answered. In 2018, he "provided a very general script" for creating scenarios that can result in path-traversal problems. Since then, people have been poking at runc and slowly evolving the attacks to reach increasingly obscure corner cases. Each year's CVEs tend to be an evolution of the previous year's, he said.

Someone else asked whether he thought that using virtual filesystems such as procfs and cgroupfs to present kernel interfaces was a mistake. "Well, there's several attacks that could never happen if it were [designed using system calls instead], and that makes you wonder, right?" Sarai replied. On the other hand, virtual filesystems do provide some nice benefits. The LXC container runtime has a fake procfs implementation to help control-group-unaware programs to identify process and memory limits, for example, he said.

That didn't satisfy some people, who pointed out that there are also ways to intercept system calls. At that point, however, the session ran out of time and the discussion spilled over into the hallway.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for helping with travel to Tokyo to cover the Linux Plumbers Conference. ]


Index entries for this article
ConferenceLinux Plumbers Conference/2025


to post comments

Mounting could be a bit more regulated

Posted Jan 6, 2026 22:18 UTC (Tue) by epa (subscriber, #39769) [Link] (3 responses)

The Unix model of mounting a filesystem anywhere in the existing filesystem hierarchy has always seemed a bit chaotic. I would prefer to say that all filesystems except the root must be mounted in a single place (under /mnt or /media or whatever - or perhaps one directory for each user) and then if you want you can symlink to them. That would mean one less weird case for userspace to worry about: you have to check for symlinks anyway in security-sensitive code, and they are much more obvious in directory listings, whereas crossing a mount point is hidden magic.

Mounting could be a bit more regulated

Posted Jan 7, 2026 20:22 UTC (Wed) by cyphar (subscriber, #110703) [Link]

I guess this is (or would eventually be) possible to implement with BPF-LSM but it would obviously break most real systems. And unfortunately, symlinks are still an issue (though this is basically solved with openat2(2)) -- though of course plan9 did away with those too (and they were probably right to).

Mounting could be a bit more regulated

Posted Jan 8, 2026 8:24 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

I would see the ability to mount a different filesystem anywhere as one of the big strengths of Unix-systems, certainly much better than the alternative e.g. Windows uses where every application can be installed in different locations and stores its data in configurable places if the data is of significant size.

That said, it is not really necessary to allow mounting over non-empty directories to achieve this, or indeed for mount points to just be regular directories that can be written into when the mount that was supposed to be there failed to mount properly. Something similar to the git model for submodules where "mounts" are essentially an entirely different type of file would work for mount points and might fix some other issues too.

Mounting could be a bit more regulated

Posted Jan 8, 2026 15:07 UTC (Thu) by epa (subscriber, #39769) [Link]

Yes, I'd be happy with that too. I suggested reusing the machinery of symlinks because that already exists. If you see a symlink to /mnt/whatever then it's obviously a different volume, and if the volume isn't mounted you see it as a broken symlink (which again we already know how to handle).

I agree that it's a strength to let you arrange your filesystem on physical storage as you wish; I just think it could be implemented in a more organized way. Like how device special files could, in principle, be created anywhere on the filesystem (at least in classical Unix) but in practice it costs nothing to make them always exist under /dev -- and if you did want to put a device file elsewhere in the namespace for whatever bizarre reason, you could always symlink to it.

Openat2 resolve_beneath

Posted Jan 7, 2026 1:14 UTC (Wed) by MrWim (subscriber, #47432) [Link] (3 responses)

>Sarai suggested adding a RESOLVE_NO_DOTDOT option to openat2(), to prevent traversing into a parent directory by accident

How is that different to RESOLVE_BENEATH?

Openat2 resolve_beneath

Posted Jan 7, 2026 19:54 UTC (Wed) by cyphar (subscriber, #110703) [Link] (2 responses)

It seems this bit of my talk was a bit confusing, sorry about that -- I'm going to send a mail to the editor to suggest a slight clean-up, but in short:

  1. Back when I developed openat2(2) there was a push to have RESOLVE_BENEATH block ... My argument at the time (which was quoted in LWN) was that this would break a lot of real use-cases because symlinks containing .. are very widespread.
  2. However, for the very minimal userspace /proc resolver we cannot support walking into .. so for parity we should probably have this also be the case for openat(2) (this is why I brought it up in the talk).
  3. Also, in some very specific scenarios, you do actually just want to block .. but this is not really true in the general case (which is why I didn't do it for RESOLVE_BENEATH).

Openat2 resolve_beneath

Posted Jan 7, 2026 20:20 UTC (Wed) by daroc (editor, #160859) [Link] (1 responses)

I'm sure that the fault in understanding here is all mine.

Is the difference that RESOLVE_BENEATH permits using '..' as long as doing so doesn't go up past the parent directory, but RESOLVE_NO_DOTDOT would ban '..' at all? So they would both block '../foo', but only the latter would block 'foo/../bar'?

Openat2 resolve_beneath

Posted Jan 7, 2026 20:23 UTC (Wed) by cyphar (subscriber, #110703) [Link]

Yes, that is correct.

Security irony

Posted Jan 7, 2026 9:34 UTC (Wed) by kleptog (subscriber, #1183) [Link] (5 responses)

> The openat2() system call, for example, lets programmers specify to the kernel how it should handle certain kinds of ambiguities. On the other hand, openat2() is often blocked by seccomp(), since one of the arguments is a pointer.

There is a certain irony in blocking, for security, a system call that would make it easier to write secure code.

Security irony

Posted Jan 7, 2026 14:06 UTC (Wed) by chris_se (subscriber, #99706) [Link] (3 responses)

> > The openat2() system call, for example, lets programmers specify to the kernel how it should handle certain kinds of ambiguities. On the other hand, openat2() is often blocked by seccomp(), since one of the arguments is a pointer.
> There is a certain irony in blocking, for security, a system call that would make it easier to write secure code.

I think the main issue here is probably inertia in older code / older configurations, rather than security. While it's true that you can't use seccomp to match against flags and mode of openat2() (in contrast to open()/openat() where that's possible), in a container context I don't really see why you'd restrict the flags and/or mode of the open()-family of system calls. And for applications that apply sandboxes for privilege separation themselves they'd probably just disallow open() entirely.

The only reason I could think for disallowing openat2() but not open() would also lead to disallowing openat() - namely that in specific circumstances they could be used to escape specific incomplete attempts at sandboxes. But in that case the reason that openat2() takes a pointer is not relevant here, because openat() would also be affected. I don't think this makes much sense in a container context though, because container runtimes actually perform proper sandboxing steps so that having openat()/openat2() inside the container is not an issue.

Security irony

Posted Jan 7, 2026 20:10 UTC (Wed) by roc (subscriber, #30627) [Link]

One goal for sandboxes is to simply reduce the amount of kernel attack surface the sandboxee can reach. For that, they often want to check syscall flags and allow only a small subset of common flags. So it can make sense to want to check the flags of open() or openat2() if possible.

Security irony

Posted Jan 7, 2026 20:17 UTC (Wed) by cyphar (subscriber, #110703) [Link] (1 responses)

No, it is unfortunately an active decision (systemd actively does this). The main reason is that of paranoia -- currently you cannot restrict flag arguments at all with seccomp which makes security-conscious people very antsy. If we ever added some horrific security vulnerability to openat(2) (however unlikely that is) they would not be able to disable it without disabling all usage of openat(2).

Security irony

Posted Jan 7, 2026 20:28 UTC (Wed) by cyphar (subscriber, #110703) [Link]

> If we ever added some horrific security vulnerability to openat(2) (however unlikely that is) they would not be able to disable it without disabling all usage of openat(2).

And of course, I meant openat2(2) here. I also do think this is a reasonable concern in general -- though in practice nobody actually blocks open(2) flags because the primary risk with open(2) is path access, which is meant to be mediated by LSMs and regular DAC/ACL permissions.

Security irony

Posted Jan 7, 2026 20:00 UTC (Wed) by cyphar (subscriber, #110703) [Link]

It is a bit ironic but is kind of understandable. For what it's worth, at LPC 2024 I proposed an approach to enabling seccomp filtering of pointers for extensible structure arguments (which Kees seemed quite receptive to) and so this may be solved soon™ -- unfortunately I've was quite busy for most of 2025 (with all of this runc hardening stuff)...

Also, a lot of this fallback stuff started being developed back when openat(2) was new and so it was a more pressing thing. These days we would mostly need it to support older machines.

Alternatives to path traversal

Posted Jan 7, 2026 13:37 UTC (Wed) by iabervon (subscriber, #722) [Link] (1 responses)

Could runc use PIDFD_SELF instead of traversing a path? It has always seemed to me that the weakest part of the UNIX "everything is a file" model is that you access everything through a generic namespace, rather than getting those files out of system calls that handle specific requests. Submitting queries as strings that mix control and data and having them handled by a generic engine is the sort of thing that we've learned to avoid when possible over the years.

It's a pretty recent addition, but anything else the kernel could provide would obviously be even more recent (like, in the future).

Alternatives to path traversal

Posted Jan 7, 2026 19:42 UTC (Wed) by cyphar (subscriber, #110703) [Link]

Unfortunately not.

PIDFD_SELF provides some useful information but it isn't a /proc subdir. I did propose something like this privately but this is too hard to do, so using fsopen(2) is what we ended up with. Unfortunately fsopen(2) alone is not enough (in the generic open_tree(2) case), you need to also layer this lookup stuff on top.

In particular, some features like /proc/self/fd/$n reopening or using /proc/self/exe still requires /proc magic-links, which entails doing all of this stuff. You can imagine implementing these hacks with proper kernel interfaces in a different way (which I have proposals and prototypes for) but that's all future stuff and we need to support older kernels anyway...

And ultimately, sysctls need /proc/sys which (I suspect) is probably never going to be switched to a non-filesystem interface.

Solution to a non-problem

Posted Jan 8, 2026 23:48 UTC (Thu) by milesrout (subscriber, #126894) [Link] (1 responses)

Surely there is no security issue here because unprivileged users cannot mount filesystems?

>Sarai expected some people to be skeptical that this was a real problem, on the basis that only root can mount things. That's not really true in the context of runc, though. There are, of course, mount namespaces to deal with. But there is also the fact that runc is used as a backend by a large number of different high-level containerization programs. Those programs often give users a lot of flexibility to configure container mounts, which can result in unprivileged mounting into containers that runc is working with, sometimes in ways that permit a program to escape the container. Almost all of the vulnerabilities in runc have been misconfiguration bugs, he said. Sometimes, runc can prevent those by recognizing an obviously invalid configuration, and sometimes the higher-level programs need to be patched to avoid generating the problematic configuration in the first place.

Don't mount namespaces only allow you to mount inside the namespace. Why would you try to run code inside that namespace?

It isn't a runc vulnerability, in any circumstances, for someone to misconfigure it and that to lead to a security vulnerability. That is a vulnerability in their configuration.

Solution to a non-problem

Posted Jan 9, 2026 7:25 UTC (Fri) by cyphar (subscriber, #110703) [Link]

Believe me, I understand the skepticism. In theory all of this stuff shouldn't be necessary, and if I could've gotten out of working on this by rejecting this attack model I would've done so several years ago. Unfortunately there are two compounding problems with doing so:

  • runc has to execute in the context of the configured container (i.e., in its mount namespace -- this stage is called runc init by runc) as part of finishing up the configuration before execve(2)-ing the user's code. During this process we need to write to /proc and do a bunch of other similar operations. The three options we have are:
    1. keep around a handle to the host /proc (which can lead to container escapes if a bad configuration can reference the file descriptor (this is what LXC does) -- see CVE-2024-21626 for an example of this style of attack (the key takeaway from this is that PR_SET_DUMPABLE and O_CLOEXEC is not actually sufficient),
    2. use the container's /proc (this is what runc did until now), which leads to the kinds of issues I talked about in this talk; or
    3. use the techniques that libpathrs does to protect against attacks without leaking anything into the container.
  • Higher-level runtimes (like Kubernetes, Podman, or Docker) allow for a lot of configuration that we (runc) would prefer not be possible, which means we cannot just wholesale reject all configuration-based vulnerabilities, we have to treat them on a case-by-case-basis (which was the point of the "pop quiz" section of my talk). One thing we found this time is that you can actually exploit these bugs from a Dockerfile because there is a RUN --mount=... directive which can be combined with multi-stage builds (to hit the race condition). This makes even something as simple as building an image enough to trigger these kinds of exploits (of course, they are also exploitable from basic Kubernetes pod configurations and reasonable-looking Podman/Docker invocations too).

runc has not had a security vulnerability that did not involve a malicious configuration since 2016, but we have to deal with issues that practically arise due to how these systems are put together. This is also the reason we cannot reject security issues that require user namespaces to be disabled -- most Kubernetes and Docker workloads do not use user namespaces, so we cannot just ignore them.

I cannot put into words how much I wish this were not the case.

More openat2

Posted Jan 9, 2026 15:55 UTC (Fri) by alip (subscriber, #170176) [Link] (2 responses)

Extending openat2(2) further is for sure a good idea. RESOLVE_NO_DOTDOT can prevent path traversal attacks before they happen. Having a separate root-fd reference to enforce RESOLVE_IN_ROOT would also go a long way in constructing unprivileged chroots. Syd has trace/deny_dotdot[1][2] option which was inspired by FreeBSD's vfs.lookup_cap_dotdot[3]. Web servers would be one obvious pick to enforce this option.

[1]: https://man.exherbo.org/syd.2.html#trace/deny_dotdot
[2]: https://man.exherbo.org/syd.7.html#Path_Resolution_Restri...
[3]: https://cgit.freebsd.org/src/tree/sys/kern/vfs_lookup.c#n351

More openat2

Posted Jan 9, 2026 19:07 UTC (Fri) by cyphar (subscriber, #110703) [Link] (1 responses)

> Having a separate root-fd reference to enforce RESOLVE_IN_ROOT would also go a long way in constructing unprivileged chroots

Yes, this is also on my list of things to do. When I first wrote openat2(2) I didn't think it'd be necessary but I just recently encountered a case where it would've been really handy to have this capability...

More openat2

Posted Jan 10, 2026 12:43 UTC (Sat) by cyphar (subscriber, #110703) [Link]

That being said, you should still combine RESOLVE_NO_DOTDOT with RESOLVE_BENEATH because of absolute symlinks and also because RESOLVE_BENEATH includes a final sanity check in the kernel to detect escapes. RESOLVE_NO_DOTDOT is kinda helpful but I think people overestimate how useful blocking ".." entirely actually is -- the scoped lookup logic for RESOLVE_BENEATH is safe even though it allows "..".


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds