New AT_ flags for restricting pathname lookup

By Jonathan Corbet
October 4, 2018

System calls like openat() have access to the entire filesystem — or, at least, that part of the filesystem that exists in the current mount namespace and which the caller has the permission to access. There are times, though, when it is desirable to reduce that access, usually for reasons of security; that has proved to be especially true in many container use cases. A new patch set from Aleksa Sarai has revived an old idea: provide a set of AT_ flags that can be used to control the scope of a given pathname lookup operation.

There have been previous attempts at restricting pathname lookup, but none of them have been merged thus far. David Drysdale posted an O_BENEATH option to openat() in 2014 that would require the eventual target to be underneath the starting directory (as provided to openat()) in the filesystem hierarchy. More recently, Al Viro suggested AT_NO_JUMPS as a way of preventing lookups from venturing outside of the current directory hierarchy or the starting directory's mount point. Both ideas have attracted interest, but neither has yet been pushed long or hard enough to make it into the mainline.

Sarai's venture into this territory takes the form of several new AT_ flags that can be used with system calls like openat():

AT_BENEATH would, similar to O_BENEATH, prevent the pathname lookup from moving above the starting point in the filesystem hierarchy. So, as a simple example, an attempt to open ../foo would be blocked. This option does allow the use of ".." in a pathname as long as the result remains below the starting point, though, so opening foo/../bar would work.
AT_XDEV prevents the lookup from crossing a mount-point boundary in either the upward or downward direction.
AT_NO_PROCLINK prevents the following of symbolic links found in the /proc hierarchy; in particular, it is aimed at the links found under fd/ in any specific process's directory.
AT_NO_SYMLINK prevents following any symbolic links at all, including those blocked by AT_NO_PROCLINK.
AT_THIS_ROOT performs the equivalent of a chroot() call (to the starting directory) prior to the beginning of pathname lookup. This option, too, is meant to constrain lookups to the given directory hierarchy; it will also change how absolute symbolic links are interpreted.

There are numerous use cases for these new flags, but the driving force this time around would appear to be container workloads and, in particular, runtime systems for containers. Those systems often have to look inside a container and, perhaps, act on files within a container's directory hierarchy. If the container itself is compromised or otherwise malicious, it can attempt to play games with its filesystems to confuse the runtime system and gain access to the host.

This posting got a reception that was positive overall, but with a number of concerns about the details. For example, Jann Horn liked AT_BENEATH, but would rather that it forbade the use of ".." entirely, even if the result remains beneath the starting point. Doing so would help to block exploitation of various types of directory-traversal bugs, he said. Sarai responded that 37% of all the symbolic links on his system contained ".."; "this indicates to me that you would be restricting a large amount of reasonable resolutions because of this restriction". That said, he indicated a willingness to change the behavior if need be.

Horn also complained about the "footgun potential" of AT_THIS_ROOT which, he said, shares all of the security failings of chroot(). He described a scenario where a hostile container could force an escape by moving directories around: "If the root of your walk is below an attacker-controlled directory, this of course means that you lose instantly". A possible mitigation here would be to require the starting directory in AT_THIS_ROOT lookups to be a mount point; Sarai was amenable to making this change as well.

Horn, along with Andy Lutomirski, questioned the container-management use case; as Lutomirski put it: "Any sane container is based on pivot_root or similar, so the runtime can just do the walk in the container context". In this particular case, it turns out that part of the problem is the result of the fact that the container runtime in question is written in Go:

You're right about this -- for C runtimes. In Go we cannot do a raw clone() or fork() (if you do it manually with RawSyscall you'll end with broken runtime state). So you're forced to do fork+exec (which then means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes for CLONE_VFORK.

Since the system cannot use the relatively cheap ways to get into a container's context, it has to use an expensive workaround instead; this expense could be avoided if files could be opened with the new AT_ flags. Lutomirski responded that he is "not very sympathetic to the argument that 'Go's runtime model is incompatible with the simpler solution'". He proposed an alternative that might work in this setting without adding the new flags.

That alternative might work, but the fact remains that there are other use cases for restricting the scope of pathname lookups; that is why the idea continues to pop up on the kernel's mailing lists. And Lutomirski, too, agreed that some of the flags seem useful. Whether this implementation will be the one that manages to go all the way to the mainline remains to be seen, but it seems likely that, one of these years, the kernel will gain the ability to control lookups in a way similar to the one that has been proposed here.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Security	Linux kernel/Virtual filesystem layer

New AT_ flags for restricting pathname lookup

Posted Oct 4, 2018 21:36 UTC (Thu) by wahern (subscriber, #37304) [Link] (3 responses)

I don't understand why the Go team is so resistant to adding the ability to explicitly pin a goroutine to a machine thread. Goroutines are an amazing, almost ideal construct. But there's a very obvious and unresolvable impedance mismatch between how a goroutine implement threading (linear flow of logical execution) and how traditional operating systems do. A similar mismatch exists with FFI ABIs (i.e. stack details) and with the blocking semantics of some syscalls. In those cases a goroutine *is* pinned to a machine thread; indeed, the very architecture of the Go runtime (the [G]oroutine, OS [M]achine thread, and [P]rocessor scheduling abstractions) is built around this mismatch. It's inexplicable to me why they refuse to expose the scheduling levers that must necessarily exist.

New AT_ flags for restricting pathname lookup

Posted Oct 4, 2018 21:45 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

You can pin a goroutine to a thread using LockOSThread, but it basically locks this thread out of running other goroutines.

(Personally, I'd like for them to add goroutine IDs)

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 7:13 UTC (Fri) by kostix (guest, #119803) [Link] (1 responses)

That wouldn't have helped anyway: the problem with not being able to do the classic fork+exec in Go programs is that the code executing in each of them heavily relies on the live Go runtime (which is linked with/into any compiled Go executable and actually manages the whole lifecycle of the program), and that runtime exploits multiple OS threads — both to run the program's goroutines and do its own chores.

Since fork() clones the state of just a single thread — the one which happened to execute that syscall, — as soon as the control resumes in the child process, there is literally no Go runtime anymore around the goroutine "awoken" in the cloned thread, and as soon as it happens to call anything which would normally reach for the runtime, it is hosed. And normally such a call would happen pretty soon.

So basically the only sensible thing one might safely do after forking a process running a Go program is to do a controlled set of preparations and exec().
And actually that's what the syscall.ForkExec does — with some added complexity stemming from Go having an execution model other than C ;-)

You can look at ForkExec in https://golang.org/src/syscall/exec_unix.go and then at forkAndExecInChild in https://golang.org/src/syscall/exec_linux.go — the code is very easy to follow for any programmer with a C background, and it is extensively commented.

New AT_ flags for restricting pathname lookup

Posted Oct 6, 2018 1:37 UTC (Sat) by wahern (subscriber, #37304) [Link]

Shouldn't it be possible to quiesce the runtime (pause GC, park all other goroutines, and join all kernel threads)? All the machinery in the scheduler must already be there, more or less. Maybe some component is currently running in a dedicated thread in an infinite loop, but conceptually it could be refactored to be able to enter and exit its core loop.

It might not be particularly efficient and come with a ton of gotchas, but it would at least make some currently impossible things possible, such as using geteuid and forking helper processes. Those things tend to happen early on, anyhow, so performance and other limitations wouldn't matter much.

New AT_ flags for restricting pathname lookup

Posted Oct 4, 2018 22:52 UTC (Thu) by neilbrown (subscriber, #359) [Link] (7 responses)

Surely this could be vastly simplified by allowing an eBPF program to be attached to a file descriptor so that when a path_lookup starts from that file descriptor, the eBPF program is used to vet or modify the lookup of each component.

New AT_ flags for restricting pathname lookup

Posted Oct 4, 2018 23:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

No......

Please, no more eBPF. It never ever works outside of kernel developers' machines.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 7:31 UTC (Fri) by flewellyn (subscriber, #5047) [Link] (4 responses)

I believe neilbrown was joking. I have no evidence for this, but I am desperately choosing to believe it anyway.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 7:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

I hope so. I've just spent a day debugging a eBPF filter written by somebody else and it's NOT a nice experience at all.

Debugging infrastructure is sorely lacking for it.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 12:10 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

eBPF is a nice thing to have if machine-generated (it's a rather nice and orthogonal assembler, and the ability to add helpers is just a killer feature that I wish real assemblers had!), but it's about as pleasant to debug programs written in it as any other assembler: i.e. fairly easy if you're familiar with the code generator, a nightmare otherwise, doubly so if this is the less regular land of handwritten code, disassembled and devoid of comments.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 17:14 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

It's way worse than assembly. With assembly you can typically use debuggers to trace the execution and inspect the environment. Nothing comparable exists for eBPF.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 22:24 UTC (Fri) by nix (subscriber, #2304) [Link]

Generally I do the same thing when debugging eBPF that I do when debugging other programs: printf()! In the case of eBPF you throw in a helper that does a printk() and chuck in calls to the helper liberally. (This is not so useful if you can't modify the eBPF, mind you.)

New AT_ flags for restricting pathname lookup

Posted Oct 4, 2018 23:55 UTC (Thu) by luto (guest, #39314) [Link]

It would be “simple” in the sense that getting the eBPF right would be at least as difficult as getting the kernel code with the AT flags right would be. But with eBPF, no one would ever review it carefully or fix the bugs.

eBPF is flexible, but it’s not magic.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 4:13 UTC (Fri) by eru (subscriber, #2753) [Link] (6 responses)

openat() is one of those Linux system calls whose rationale I don't quite understand. It allows opening files relative to a particular directory, but can't you do the same thing by manipulating the path name, or by using chdir() first?

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 4:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

You can't, not in a race-free way anyway.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 10:47 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

For one, chdir affects the entire process rather than the current thread only.

New AT_ flags for restricting pathname lookup

Posted Oct 5, 2018 12:08 UTC (Fri) by nix (subscriber, #2304) [Link]

Others have commented on the problems with chdir(). The problem with using long absolute pathnames is twofold: firstly, you race with people modifying symlinks and/or renaming out from underneath you (*at() can at least reduce this by nailing the walk to specific directory inodes). Secondly, the length of pathnames is capped at pathconf(..., _SC_PATH_MAX): but you can make directory trees of arbitrary depth, with absolute paths much deeper than this and indeed deeper than the hardware page size. Nobody does this manually, but it can and does happen with machine-generated hierarchies, and the deep parts of such hierarchies are *only* traversable via chdir() or the *at() syscalls: while you can compose an absolute path that should reach those parts, the kernel will reject it with -ENAMETOOLONG.

So generic code has no choice but to use chdir() or *at() to traverse hierarchies or fail on such deep hierarchies, and generic multithreaded code or library code which might be run in multithreaded contexts has no choice but to use *at().

New AT_ flags for restricting pathname lookup

Posted Oct 7, 2018 17:22 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

Manipulating pathnames means "doing string operations", something that's fairly cumbersome in C. For an example, consider the following toy-program:

#define _GNU_SOURCE

#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/stat.h>

static char *cwd[] = {
    ".",
    NULL
};

int main(int argc, char **argv)
{
    DIR *dir;
    struct dirent *d_ent;
    struct stat st;
    int dirfd, rc;
    
    ++argv;
    if (!*argv) argv = cwd;
    do {
        dirfd = open(*argv, O_RDONLY, 0);
        if (dirfd == -1) {
            perror("open");
            continue;
        }
        
        dir = fdopendir(dirfd);
        if (!dir) {
            perror("fdopendir");
            continue;
        }

        printf("-----\nfiles in %s\n-----\n", *argv);
        while ((d_ent = readdir(dir))) {
            rc = fstatat(dirfd, d_ent->d_name, &, 0);
            if (rc == -1) {
                if (errno != ENOENT) perror("fstatat");
                continue;
            }

            if (S_ISREG(st.st_mode))
                printf("%s\t\t%zu bytes\n", d_ent->d_name, (size_t)st.st_size);
        }

        closedir(dir);
    } while (*++argv);

    return 0;
}

This takes a list of directory pathnames as arguments and prints the names and sizes of all files in any of the directories. It uses fstatat because the names returned by readdir are filenames relative to the directory being read. Thanks to the *at-call, they can be accessed without doing dynamic string manipulation and buffer management and also without changing the cwd of the process forward and backward for each directory.

Also, chdir is basically unusable in multi-threaded processes as it changes the working directory of the process, ie, it affects all threads, not just the one executing it and, as seen by another thread, the cwd change is an unpredictable, asynchronously occuring event. Eg, a thread desiring to create two files in the same directory might end up creating them in different directories.

Lastly, the directory a process was started in might have been picked intentionally, eg, as location where core dumps should go to, and the process shouldn't change it except if there's a very good reason for that (and this should be documented).

New AT_ flags for restricting pathname lookup

Posted Oct 7, 2018 19:27 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

rc = fstatat(dirfd, d_ent->d_name, &, 0);

This should have been

rc = fstatat(dirfd, d_ent->d_name, &st, 0);

and was but got deleted when "htmlifying" the source ... :-(

New AT_ flags for restricting pathname lookup

Posted Oct 8, 2018 4:18 UTC (Mon) by eru (subscriber, #2753) [Link]

Thanks to all for explaining the need and use of the somethingat() calls.

Places to block filesystem traversal

Posted Oct 5, 2018 7:11 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

It’s not just containers. Path-traversal bugs are a common exploit in archivers like tar or unzip, where unpacking a malicious archive file overwrites things elsewhere in the filesystem. I imagine web servers might also use this flag as an additional defence to make sure they only serve content from the right directory. If the flag existed on all operating systems, a lot of userspace path sanitizing code could be removed.

Places to block filesystem traversal

Posted Oct 5, 2018 14:24 UTC (Fri) by smurf (subscriber, #17840) [Link]

Also, userspace sanitation depends on the fact that no second thread exists that modifies the sanitized path before it's passed to the kernel. In-kernel defenses against that sort of thing at least work.

New AT_ flags for restricting pathname lookup

Posted Oct 7, 2018 0:34 UTC (Sun) by judas_iscariote (guest, #47386) [Link] (1 responses)

It is quite unfortunate that kernel developers insist on extending openat() with more and more contrived semantics, I wish they just added new syscalls with well defined behaviour.

New AT_ flags for restricting pathname lookup

Posted Oct 7, 2018 4:36 UTC (Sun) by cyphar (subscriber, #110703) [Link]

Something like resolveat(2)? The problem is that this would necessarily be conceptually identical to openat(O_PATH). Maybe O_PATH should've been a different syscall but we are mostly stuck with it now, and I think it would be strange to have two methods of opening an O_PATH descriptor. Though, there are some aspects of O_PATH that I think need to be fixed (and would require more convoluted O_ flags -- maybe a new syscall is warranted to fix some of the semantics of O_PATH. I'm not sure.)

And remember that the widespread utility of any resolveat(2) syscall would likely require having AT_EMPTY_PATH support for every *at(2) syscall (which is unfortunately far from the case currently).