Restricting pathname resolution with AT_NO_JUMPS

May 17, 2017

This article was contributed by Nur Hussein

On April 29, Al Viro posted a patch on the linux-api mailing list adding a new flag to be used in conjunction with the ...at() family of system calls. The flag is for containing pathname resolution to the same filesystem and subtree as the given starting point. This is a useful feature to have for implementing file I/O in programs that accept pathnames as untrusted user input. The ensuing discussion made it clear that there were multiple use cases for such a feature, especially if the granularity of its restrictions could be increased.

As an example use case, consider a web server that accepts requests for documents along with a relative path to said documents from the root of the server data directory. It is imperative that the user-supplied pathname sent to a web server is not allowed to name files outside of the web-server subtree. If the web server could use file I/O system calls that guaranteed that any given path will never break out of a certain subdirectory tree, it would give an additional layer of security to the server.

The ...at() family of system calls (such as openat(), fstatat(), mknodat(), etc.) are a relatively new series of POSIX filesystem interfaces, similar to open(), fstat(), mknod(), and friends, but allowing the start point for relative pathnames to be something other than the current working directory. For example, the openat() system call's prototype is:

    int openat(int dirfd, const char *pathname, int flags);

A call to openat() will try to open the file specified by the pathname. If the pathname is a relative path, such as ../home/user1, openat() will try and walk the directory structure relative to the directory bound to the file descriptor dirfd, instead of the current working directory (which is what open() does). The behavior of this and other calls in the same family can be further altered with the flags parameter.

Viro's proposed flag (initially called AT_NO_JUMPS) for the ..at() calls would restrict the directory traversal of those calls to the same subtree as the starting point and, in any case, within the same filesystem. When the flag is set, the ...at() call affected would fail with -ELOOP upon encountering an absolute pathname, an absolute symbolic link, a procfs-style symbolic link, a pathname that results in the traversal of a mount point, or any relative path that starts with "..". This flag thus confines pathname lookups to subdirectories of the starting point that are contained within the same mount point, while allowing the traversal of relative symbolic links that do not violate the aforementioned rules. The error code -ELOOP is used by Unix-like operating systems to tell the user that too many symbolic links were encountered during pathname lookup, but it is used here as a placeholder. The patch only implements AT_NO_JUMPS for fstatat() and friends, but the proposal is for it to be extended to all the ...at() calls.

Jann Horn commented that this proposal is somewhat similar to the O_BENEATH functionality that was sent to the Linux kernel list by David Drysdale in November 2014, but was ultimately not merged. The O_BENEATH flag is a port of a feature from Capsicum, a project that seeks to provide the ability to use capabilities to let applications sandbox themselves with a high degree of control. Horn noted that, while the functionality is similar, the intended use case for O_BENEATH is application sandboxing, whereas the AT_NO_JUMPS flag is to enable user programs to limit their own filesystem access. Viro commented that, unlike O_BENEATH, AT_NO_JUMPS does allow relative symbolic links, which he thinks is the saner option:

It's not quite O_BENEATH, and IMO it's saner that way - a/b/c/../d is bloody well allowed, and so are relative symlinks that do not lead out of the subtree. If somebody has a good argument in favour of flat-out ban on .. (_other_ than "other guys do it that way, and it doesn't need to make sense 'cuz security!!1!!!", please), I'd be glad to hear it.

Andy Lutomirski suggested splitting the flag into two, "one to prevent moving between mounts and one for everything else". This is because web servers and the like will probably be fine with mount-point traversal; they only need the directory containment feature. Horn concurred with Lutomirski about the usefulness of the split.

Viro agreed that the no-mount-point-crossing policy from AT_NO_JUMPS could be split out into a separate flag. He proposed AT_XDEV for preventing mount point crossing, and the original flag be renamed to AT_BENEATH to match the functionality of O_BENEATH, which does allow crossing mount points. The returned error for crossing mount points when AT_XDEV is enabled could be the obvious -EXDEV, while the error for AT_BENEATH would still be -ELOOP (which Viro isn't too satisfied with, but nothing else has been suggested thus far).

Linus Torvalds liked the split, but wanted to go even further:

So I would still like to split that NO_JUMP flag even more. I like the AT_BENEATH | AT_XDEV split, but I think XDEV should be split further, and I think the symlink avoidance should be split more too.

As mentioned last time, at least for the git usage, even relative symlinks are a no-no - not because they'd escape, but simply because git wants to see the *unique* name, and resolve relative symlinks to either the symlink, or to the actual file it points to. So I think that we'd want an additional flag that says "no symlinks at all". And I think the "no mountpoint" traversal might be splittable too.

Torvalds went on to say that, sometimes, the use case is just to guarantee that pathname resolution does not go above a certain point in the directory tree, regardless of whether the directory walk crosses mount points. However this should only be the case of non-bind mount points. Bind mount points are basically views of directories (or files) that are mounted in another place in the directory tree. Since bind mounts can be used to break the directory containment, they should not be allowed. However from the system's point of view, there is no difference between a mounted filesystem and a bind-mounted directory, and thus Torvalds is not sure if it a mount point can be tested if it is a bind mount or not. Viro agreed that it wasn't testable, and thus cannot be handled as a special case.

Viro then proposed that the flags become AT_BENEATH, AT_XDEV, and AT_NO_SYMLINKS, which is the flag used when no symbolic links are allowed at all. This proposal raises a few questions on how to handle some of the combinations of these three flags. Viro asked what AT_XDEV should do with absolute symbolic links. Torvalds replied that, while it might be more consistent to allow an absolute symbolic link to be traversed with AT_XDEV (but without AT_BENEATH) as long as the root directory is on the same mount point as the starting point, it would be more straightforward to just return -EXDEV on all absolute symbolic links.

Next, if the apparently conflicting flags AT_BENEATH (which allows symbolic links) and AT_NO_SYMLINKS (which disallows all symbolic links) are invoked simultaneously, Viro suggested that AT_NO_SYMLINKS take precedence since it was convenient to implement, to which Torvalds agreed.

Finally, Viro asked what should happen when the final component of a pathname is a symbolic link when AT_NO_SYMLINKS is applied. Torvalds thinks it should be an error if the symbolic link is followed, except if paired with AT_SYMLINK_NOFOLLOW, which indicates that the user is fine with a "dangling symlink" at the end, which will not be followed.

If Viro's proposed changes are picked up in the mainline kernel, we should see more robust directory containment options in the Linux API, which application writers can in turn use for file I/O. Since the pathname traversal protection mechanism will be in the kernel, there will no longer be a need for each user program to do its own pathname checking. This should be a welcome feature for anyone writing applications that work with file and directory names as user input.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Security	Linux kernel/Filesystems
GuestArticles	Hussein, Nur

Restricting pathname resolution with AT_NO_JUMPS

Posted May 17, 2017 23:33 UTC (Wed) by jra (subscriber, #55261) [Link] (9 responses)

Cautiously optimistic - this could be very useful for Samba and other userspace file servers !

Restricting pathname resolution with AT_NO_JUMPS

Posted May 17, 2017 23:47 UTC (Wed) by nix (subscriber, #2304) [Link] (8 responses)

With a caveat: application authors should consider that every time they use any of these flags, they are restricting administration flexibility and ruining techniques that sysadmins have used for many decades. Most seriously, mounting in extra disk space where needed on a full filesystem breaks, where right now you can bind-mount or NFS-mount or whatever you want to get more space in there. Tools to manage locally-built software that use symlinks extensively like stow, graft, and DEPOT might break (though one would assume that programs wouldn't use this facility for looking up parts of themselves, but mostly stuff under $HOME etc -- but even then, that's an *assumption*, and God knows what application authors who don't consider or know about such cases will do).

Frankly I think this stuff should be controllable via a (root-only?) prctl() or /proc flag or something, so that it can be flipped off when it's being used wrongly and breaking legitimate use cases. We don't want to end up with a Windows-like world in which mountpoints look just like directories to applications except when the application decides they don't, and you can't tell it otherwise. The inability to rmdir() them is bad enough (and why isn't rmdir() of a mountpoint with no content just like umount() anyway?).

Restricting pathname resolution with AT_NO_JUMPS

Posted May 18, 2017 5:59 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

The ability to mount stuff at any point in the filesystem nowadays looks like a bit of a misdesign. It would be better if all mounts appeared only under /mnt (or perhaps ~/mnt if you allow ordinary users to mount filesystems) and then can be symlinked as necessary. Of course, original Unix didn't have symlinks, so the flexibility of mounting anywhere in the tree was needed, not least for having /usr on a separate filesystem.

I agree, arbitrary rules that stop following symlinks are annoying and I turn them off, but I understand that they do block whole classes of attack. Safe path traversal without going 'up' is certainly a useful thing to have and will simplify logic in many applications, even simple ones like tar(1).

Restricting pathname resolution with AT_NO_JUMPS

Posted May 18, 2017 9:35 UTC (Thu) by ballombe (subscriber, #9523) [Link]

> The ability to mount stuff at any point in the filesystem nowadays looks like a bit of a misdesign.

Actually this is the best feature of UNIX.

Mountpoint security is simple: you can only access a mount point it you have access to the parent directory, and creating mountpoint is a priviledged operation.

By contrast symlink are much more messy: creating symlink is unpriviledged, symlink have no permission by themselves and does not prevent the target to be accessed through a different pass.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 18, 2017 9:49 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

Symlinks don't always cut it -- again, because applications can easily tell they are not directories, and can refuse to traverse them. I have a number of cases where I *had* to use an fs mount because nothing else would work. Your solution breaks real programs.

Another use case for bind mounts being near-invisible to applications, from just this month. I have a system with disks mostly devoted to a bcached, journalled spinning-rust RAID-6 array, but it's bcached onto an SSD that is not really rated for massive write loads, so I'd prefer only to write stuff to the journal that I'm going to read again, and read more than once. That means that I don't want object files from autobuilders on there, nor really video storage, most certainly not half-terabyte uncompressed video intermediates, etc or QEMU disk images I plan to use twice with a reboot in the middle, etc. I could put the things on a tmpfs but when you build real monsters like LibreOffice or Chromium you need really rather a lot of RAM or swap for that to not run you short (and the QEMU disk images can be bigger yet).

So I turned the first quarter-terabyte of each disk in the array (1.25TiB in total) into a "transient store": an uncached RAID-0 with an unjournalled ext4 fs on it. As you'd expect of a RAID-0 array made of the fast front portion of individual disks clocking 240MiB/s each, this is blisteringly fast. It's mounted on /.transient and contains only directories that represent transient mounts, with a file /.transient/.transient giving the names of each mountpoint: each directory is named after the SHA-256 of the mountpoint name, and has an accompanying dotfile with the same SHA-256 name after the dot giving its original mountpoint name again (letting me easily build /.transient/.transient whenever I add or remove a mount, just by catting all those dotfiles together).

There are scripts to create and remove transient mounts, but after those have run *nothing notices* that their objdir or whatever is actually on a different filesystem and not using up SSD lifetime. They could tell, of course, by comparing statfs() output or struct stat.st_dev, but that would take extra work and nobody does it. If they were symlinks it would be much easier to tell.

Bind mounts are really *useful* here -- at boot, we mount the fs (fscking it if needed, *remkfsing* it if that fails, because this is a *transient* filesystem and I don't much care if I lose everything on it, though I'd rather keep it if necessary), cat /.transient/.transient to get the targets of the mountpoints (their names are one SHA away) then bind-mount everything into place, so from my perspective all this machinery is almost invisible: it just happens at boot. I was thinking "I have to buy Al Viro a drink" after hacking this up. It works amazingly well and with almost no effort, and it took less than half an hour to write, and it's all down to bind mounts.

If people start using AT_XDEV liberally, or frankly using it at all for anything not utterly crucial, this use case too collapses, since it absolutely depends on things routinely traversing mount points without noticing: every user of a transient fs does it at least once. I frankly cannot see any case other than things like backup software or rsync where the program should know or care that it's traversing mount points: only root can create them (a restriction I'd rather was lifted, but is helpful here), what does the application think it's doing second-guessing the admin's decisions? I'm damn sure that if this happens to me even once or twice I'm going to be stubbing AT_XDEV out completely. An application that second-guesses and prohibits admin decisions -- and that's what AT_XDEV is doing -- is an insulting application. If it provides a way to turn that behaviour off, fine, but how many applications do you think will bother? That's extra work! Samba will, obviously, but other things?

In particular, Jeremy's example, userspace file servers: mounting extra storage into place so you don't have to reconfigure your clients is *exactly* the sort of thing a userspace file server's admin might find useful, but it won't work if the fileserver is using AT_XDEV, even though you could move the entire thing it was using onto a big enough fs (if you had one!) and it would suddenly start working again: what's the sense in *those* semantics? (Userspace fileservers *are* one case where AT_NO_JUMPS is likely to be desirable: it's useful for building jails at points other than / without chrooting. However, even there, it seems to me this is a process attribute too -- something you will often want such programs to do, but not always, i.e. only when both some prctl() or /proc flag *and* AT_NO_JUMPS are turned on. Sometimes the sysadmin will actually want a symlink farm, and the application should not be allowed to gainsay her.)

Restricting pathname resolution with AT_NO_JUMPS

Posted May 20, 2017 5:26 UTC (Sat) by linuxrocks123 (subscriber, #34648) [Link] (1 responses)

If it comes to it, you should be able to write a .so that intercepts any calls to the glibc functions using these flags and manually takes them out, then LD_PRELOAD that .so when running a problem application.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 20, 2017 11:44 UTC (Sat) by nix (subscriber, #2304) [Link]

It's almost as easy just to hack the kernel to add a prctl() (inherited across fork() and exec()) to dike out AT_NO_JUMPS and/or AT_XDEV. With that in place, an exec()ing wrapper is all that's needed.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 25, 2017 22:56 UTC (Thu) by Wol (subscriber, #4433) [Link]

> In particular, Jeremy's example, userspace file servers: mounting extra storage into place so you don't have to reconfigure your clients is *exactly* the sort of thing a userspace file server's admin might find useful,

My moan with samba, though, is this seems to be a server-level option, not a share-level option. I have a lot of symlinks in home directories, and I have symlink traversal in samba disabled by default. I would like to enable it for CERTAIN SHARES ONLY, but that doesn't seem to be an option :-(

Cheers,
Wol

Why isn't rmdir of a mountpoint just like umount?

Posted May 19, 2017 0:46 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (1 responses)

History and the general caution of breaking applications.

You can can remove a directory a mount is on and it does work, but you have to be in a mount namespace where that directory is not a mountpoint.

Why isn't rmdir of a mountpoint just like umount?

Posted May 19, 2017 3:24 UTC (Fri) by zlynx (guest, #2285) [Link]

I saw that once, heh. It was one of those very surprising things when the mount failed after reboot, because its mount point was gone.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 18, 2017 9:50 UTC (Thu) by nix (subscriber, #2304) [Link]

I'm trying to figure out what it means if an application uses an absolute path with AT_XDEV (without AT_NO_JUMPS). It's bad enough with AT_NO_JUMPS because it's likely there are symlinks on the fs somewhere, quite possibly symlinks to directories, but use of AT_XDEV suddenly means that you have a program that works on the developer's default-configured RH box (with one big / fs and not much else) but not on production systems with more filesystems, because if there are mount points *anywhere* that app will fail when it tries to look into them.

I can hear the vendor of crapware who decided to use AT_XDEV everywhere for "security" or "speed" now: "no, we only support one big filesystem". Of course they won't fix it, they'll just make excuses, and the only reason they'd bother to do this is because AT_XDEV makes rarely-sensible behaviour so easy to do.

This is *exactly* the problem we see on Windows systems that only allow you to install to C:, and MS just went through hell trying to fake things out so applications thought they were installing to C: even though they aren't. Do we really want to bring the same problem to Linux?

Restricting pathname resolution with AT_NO_JUMPS

Posted May 19, 2017 1:17 UTC (Fri) by mattrose (guest, #19610) [Link] (2 responses)

So, AT_BENEATH would check for absolute pathnames (beginning with "/") and check for leading ".." but allow "a/b/../c" because not allowing .. is insane?
But the reason that most of these checks disallow ".." anywhere in the pathname is so that you can't break out of the restriction by doing "a/../../../etc/passwd"

I didn't read the patches, but I assume it has some other way of detecting that, but if not that seems like an obvious shortfall

Restricting pathname resolution with AT_NO_JUMPS

Posted May 19, 2017 21:11 UTC (Fri) by ssmith32 (subscriber, #72404) [Link]

I skimmed the article first and thought the same, but then read it more closely, and, no, you can't break out like that:

"Viro's proposed flag (initially called AT_NO_JUMPS) for the ..at() calls would restrict the directory traversal of those calls to the same subtree as the starting point"

Restricting pathname resolution with AT_NO_JUMPS

Posted May 20, 2017 3:40 UTC (Sat) by viro (subscriber, #7872) [Link]

Among the prohibited things there's this: "traversal of .. in the starting point of pathname resolution". So your a/../../<whatever> will start at some directory, traverse the first component and go into subdirectory named "a", traverse the first ".." and go back to the starting point, then try to traverse the second ".." and run afoul of that restriction.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 19, 2017 2:07 UTC (Fri) by helsleym (guest, #92730) [Link] (4 responses)

What if, instead of a simple flag there was a numeric limit? For example at most <limit> mount points could be traversed if userspace takes the following steps:
0) Optionally use fcntl() to set the F_AT_XDEV_LIMIT (default: 0) on the fd that will be used in *at() calls.
1) Call *at() and pass an AT_LIMIT_XDEV flag to enforce the limit.

bikeshed paint: Not a fan of the name AT_NO_XDEV -- why not XMNT instead of XDEV to better reflect the semantics?

Restricting pathname resolution with AT_NO_JUMPS

Posted May 19, 2017 11:09 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

Because -xdev is what find(1) uses, and every other command to speak of that spells it using more than one letter.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 19, 2017 18:16 UTC (Fri) by mbunkus (subscriber, #87248) [Link] (1 responses)

Both tar and rsync use "--one-file-system", not "-(-)xdev". Not that "one file system" is any more correct than "do not cross devices" if you consider bind mounts, of course.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 20, 2017 9:17 UTC (Sat) by ballombe (subscriber, #9523) [Link]

rm also uses "--one-file-system"

Restricting pathname resolution with AT_NO_JUMPS

Posted May 21, 2017 11:34 UTC (Sun) by smcv (subscriber, #53363) [Link]

Perhaps a more compelling argument (than the existence of find -xdev) for XDEV over something longer is that the POSIX-standardized error raised by syscalls like link() that already can't work across mounts is EXDEV.

Restricting pathname resolution with AT_NO_JUMPS

Posted May 24, 2017 4:15 UTC (Wed) by cyphar (subscriber, #110703) [Link]

I've actually implemented something like this in Go (and Docker has a similar implementation), and am trying to get it into the standard library (though it has semantics more similar to chroot(2) when it comes to symlinks and absolute links)[1]. It is used heavily by Docker and other container runtimes in order to make it safe for a management process outside of a container to mutate a container root filesystem.

I'm not convinced that the API described is all that useful for the usecases I would want to use it in. I feel like we need to make it possible to scope resolution in a more sane way than "just use chroot(2)" -- which doesn't work for unprivileged users for example. Though ultimately I'm sort of on the fence whether this should be done in the kernel.

[1]: https://github.com/cyphar/filepath-securejoin

Restricting pathname resolution with AT_NO_JUMPS

Posted May 27, 2017 0:59 UTC (Sat) by kmeyer (subscriber, #50720) [Link]

> Viro commented that, unlike O_BENEATH, AT_NO_JUMPS does allow relative symbolic links, which he thinks is the saner option:

> It's not quite O_BENEATH, and IMO it's saner that way - a/b/c/../d is bloody well allowed, and so are relative symlinks that do not lead out of the subtree.

For what it's worth, Capsicum (as implemented in FreeBSD) allows ".." and relative symlinks that don't lead out of the directory.