Restricting pathname resolution with AT_NO_JUMPS
On April 29, Al Viro posted a patch on the linux-api mailing list adding a new flag to be used in conjunction with the ...at() family of system calls. The flag is for containing pathname resolution to the same filesystem and subtree as the given starting point. This is a useful feature to have for implementing file I/O in programs that accept pathnames as untrusted user input. The ensuing discussion made it clear that there were multiple use cases for such a feature, especially if the granularity of its restrictions could be increased.
As an example use case, consider a web server that accepts requests for documents along with a relative path to said documents from the root of the server data directory. It is imperative that the user-supplied pathname sent to a web server is not allowed to name files outside of the web-server subtree. If the web server could use file I/O system calls that guaranteed that any given path will never break out of a certain subdirectory tree, it would give an additional layer of security to the server.
The ...at() family of system calls (such as openat(), fstatat(), mknodat(), etc.) are a relatively new series of POSIX filesystem interfaces, similar to open(), fstat(), mknod(), and friends, but allowing the start point for relative pathnames to be something other than the current working directory. For example, the openat() system call's prototype is:
int openat(int dirfd, const char *pathname, int flags);
A call to openat() will try to open the file specified by the pathname. If the pathname is a relative path, such as ../home/user1, openat() will try and walk the directory structure relative to the directory bound to the file descriptor dirfd, instead of the current working directory (which is what open() does). The behavior of this and other calls in the same family can be further altered with the flags parameter.
Viro's proposed flag (initially called AT_NO_JUMPS) for the ..at() calls would restrict the directory traversal of those calls to the same subtree as the starting point and, in any case, within the same filesystem. When the flag is set, the ...at() call affected would fail with -ELOOP upon encountering an absolute pathname, an absolute symbolic link, a procfs-style symbolic link, a pathname that results in the traversal of a mount point, or any relative path that starts with "..". This flag thus confines pathname lookups to subdirectories of the starting point that are contained within the same mount point, while allowing the traversal of relative symbolic links that do not violate the aforementioned rules. The error code -ELOOP is used by Unix-like operating systems to tell the user that too many symbolic links were encountered during pathname lookup, but it is used here as a placeholder. The patch only implements AT_NO_JUMPS for fstatat() and friends, but the proposal is for it to be extended to all the ...at() calls.
Jann Horn commented that this proposal is somewhat similar to the O_BENEATH functionality that was sent to the Linux kernel list by David Drysdale in November 2014, but was ultimately not merged. The O_BENEATH flag is a port of a feature from Capsicum, a project that seeks to provide the ability to use capabilities to let applications sandbox themselves with a high degree of control. Horn noted that, while the functionality is similar, the intended use case for O_BENEATH is application sandboxing, whereas the AT_NO_JUMPS flag is to enable user programs to limit their own filesystem access. Viro commented that, unlike O_BENEATH, AT_NO_JUMPS does allow relative symbolic links, which he thinks is the saner option:
Andy Lutomirski suggested
splitting the flag into two, "one to prevent moving between mounts
and one for everything else
". This is because web servers and the
like will probably be fine with mount-point traversal; they only need the
directory containment feature. Horn concurred
with Lutomirski about the usefulness of the split.
Viro agreed that the no-mount-point-crossing policy from AT_NO_JUMPS could be split out into a separate flag. He proposed AT_XDEV for preventing mount point crossing, and the original flag be renamed to AT_BENEATH to match the functionality of O_BENEATH, which does allow crossing mount points. The returned error for crossing mount points when AT_XDEV is enabled could be the obvious -EXDEV, while the error for AT_BENEATH would still be -ELOOP (which Viro isn't too satisfied with, but nothing else has been suggested thus far).
Linus Torvalds liked the split, but wanted to go even further:
As mentioned last time, at least for the git usage, even relative symlinks are a no-no - not because they'd escape, but simply because git wants to see the *unique* name, and resolve relative symlinks to either the symlink, or to the actual file it points to. So I think that we'd want an additional flag that says "no symlinks at all". And I think the "no mountpoint" traversal might be splittable too.
Torvalds went on to say that, sometimes, the use case is just to guarantee that pathname resolution does not go above a certain point in the directory tree, regardless of whether the directory walk crosses mount points. However this should only be the case of non-bind mount points. Bind mount points are basically views of directories (or files) that are mounted in another place in the directory tree. Since bind mounts can be used to break the directory containment, they should not be allowed. However from the system's point of view, there is no difference between a mounted filesystem and a bind-mounted directory, and thus Torvalds is not sure if it a mount point can be tested if it is a bind mount or not. Viro agreed that it wasn't testable, and thus cannot be handled as a special case.
Viro then proposed that the flags become AT_BENEATH, AT_XDEV, and AT_NO_SYMLINKS, which is the flag used when no symbolic links are allowed at all. This proposal raises a few questions on how to handle some of the combinations of these three flags. Viro asked what AT_XDEV should do with absolute symbolic links. Torvalds replied that, while it might be more consistent to allow an absolute symbolic link to be traversed with AT_XDEV (but without AT_BENEATH) as long as the root directory is on the same mount point as the starting point, it would be more straightforward to just return -EXDEV on all absolute symbolic links.
Next, if the apparently conflicting flags AT_BENEATH (which allows symbolic links) and AT_NO_SYMLINKS (which disallows all symbolic links) are invoked simultaneously, Viro suggested that AT_NO_SYMLINKS take precedence since it was convenient to implement, to which Torvalds agreed.
Finally, Viro asked what should happen when the final component of a pathname is a symbolic link when AT_NO_SYMLINKS is applied. Torvalds thinks it should be an error if the symbolic link is followed, except if paired with AT_SYMLINK_NOFOLLOW, which indicates that the user is fine with a "dangling symlink" at the end, which will not be followed.
If Viro's proposed changes are picked up in the mainline kernel, we should see more robust directory containment options in the Linux API, which application writers can in turn use for file I/O. Since the pathname traversal protection mechanism will be in the kernel, there will no longer be a need for each user program to do its own pathname checking. This should be a welcome feature for anyone writing applications that work with file and directory names as user input.
Index entries for this article | |
---|---|
Kernel | Filesystems/Virtual filesystem layer |
Security | Linux kernel/Filesystems |
GuestArticles | Hussein, Nur |
Posted May 17, 2017 23:33 UTC (Wed)
by jra (subscriber, #55261)
[Link] (9 responses)
Posted May 17, 2017 23:47 UTC (Wed)
by nix (subscriber, #2304)
[Link] (8 responses)
Frankly I think this stuff should be controllable via a (root-only?) prctl() or /proc flag or something, so that it can be flipped off when it's being used wrongly and breaking legitimate use cases. We don't want to end up with a Windows-like world in which mountpoints look just like directories to applications except when the application decides they don't, and you can't tell it otherwise. The inability to rmdir() them is bad enough (and why isn't rmdir() of a mountpoint with no content just like umount() anyway?).
Posted May 18, 2017 5:59 UTC (Thu)
by epa (subscriber, #39769)
[Link] (5 responses)
I agree, arbitrary rules that stop following symlinks are annoying and I turn them off, but I understand that they do block whole classes of attack. Safe path traversal without going 'up' is certainly a useful thing to have and will simplify logic in many applications, even simple ones like tar(1).
Posted May 18, 2017 9:35 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
Actually this is the best feature of UNIX.
Mountpoint security is simple: you can only access a mount point it you have access to the parent directory, and creating mountpoint is a priviledged operation.
By contrast symlink are much more messy: creating symlink is unpriviledged, symlink have no permission by themselves and does not prevent the target to be accessed through a different pass.
Posted May 18, 2017 9:49 UTC (Thu)
by nix (subscriber, #2304)
[Link] (3 responses)
Another use case for bind mounts being near-invisible to applications, from just this month. I have a system with disks mostly devoted to a bcached, journalled spinning-rust RAID-6 array, but it's bcached onto an SSD that is not really rated for massive write loads, so I'd prefer only to write stuff to the journal that I'm going to read again, and read more than once. That means that I don't want object files from autobuilders on there, nor really video storage, most certainly not half-terabyte uncompressed video intermediates, etc or QEMU disk images I plan to use twice with a reboot in the middle, etc. I could put the things on a tmpfs but when you build real monsters like LibreOffice or Chromium you need really rather a lot of RAM or swap for that to not run you short (and the QEMU disk images can be bigger yet).
So I turned the first quarter-terabyte of each disk in the array (1.25TiB in total) into a "transient store": an uncached RAID-0 with an unjournalled ext4 fs on it. As you'd expect of a RAID-0 array made of the fast front portion of individual disks clocking 240MiB/s each, this is blisteringly fast. It's mounted on /.transient and contains only directories that represent transient mounts, with a file /.transient/.transient giving the names of each mountpoint: each directory is named after the SHA-256 of the mountpoint name, and has an accompanying dotfile with the same SHA-256 name after the dot giving its original mountpoint name again (letting me easily build /.transient/.transient whenever I add or remove a mount, just by catting all those dotfiles together).
There are scripts to create and remove transient mounts, but after those have run *nothing notices* that their objdir or whatever is actually on a different filesystem and not using up SSD lifetime. They could tell, of course, by comparing statfs() output or struct stat.st_dev, but that would take extra work and nobody does it. If they were symlinks it would be much easier to tell.
Bind mounts are really *useful* here -- at boot, we mount the fs (fscking it if needed, *remkfsing* it if that fails, because this is a *transient* filesystem and I don't much care if I lose everything on it, though I'd rather keep it if necessary), cat /.transient/.transient to get the targets of the mountpoints (their names are one SHA away) then bind-mount everything into place, so from my perspective all this machinery is almost invisible: it just happens at boot. I was thinking "I have to buy Al Viro a drink" after hacking this up. It works amazingly well and with almost no effort, and it took less than half an hour to write, and it's all down to bind mounts.
If people start using AT_XDEV liberally, or frankly using it at all for anything not utterly crucial, this use case too collapses, since it absolutely depends on things routinely traversing mount points without noticing: every user of a transient fs does it at least once. I frankly cannot see any case other than things like backup software or rsync where the program should know or care that it's traversing mount points: only root can create them (a restriction I'd rather was lifted, but is helpful here), what does the application think it's doing second-guessing the admin's decisions? I'm damn sure that if this happens to me even once or twice I'm going to be stubbing AT_XDEV out completely. An application that second-guesses and prohibits admin decisions -- and that's what AT_XDEV is doing -- is an insulting application. If it provides a way to turn that behaviour off, fine, but how many applications do you think will bother? That's extra work! Samba will, obviously, but other things?
In particular, Jeremy's example, userspace file servers: mounting extra storage into place so you don't have to reconfigure your clients is *exactly* the sort of thing a userspace file server's admin might find useful, but it won't work if the fileserver is using AT_XDEV, even though you could move the entire thing it was using onto a big enough fs (if you had one!) and it would suddenly start working again: what's the sense in *those* semantics? (Userspace fileservers *are* one case where AT_NO_JUMPS is likely to be desirable: it's useful for building jails at points other than / without chrooting. However, even there, it seems to me this is a process attribute too -- something you will often want such programs to do, but not always, i.e. only when both some prctl() or /proc flag *and* AT_NO_JUMPS are turned on. Sometimes the sysadmin will actually want a symlink farm, and the application should not be allowed to gainsay her.)
Posted May 20, 2017 5:26 UTC (Sat)
by linuxrocks123 (subscriber, #34648)
[Link] (1 responses)
Posted May 20, 2017 11:44 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted May 25, 2017 22:56 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
My moan with samba, though, is this seems to be a server-level option, not a share-level option. I have a lot of symlinks in home directories, and I have symlink traversal in samba disabled by default. I would like to enable it for CERTAIN SHARES ONLY, but that doesn't seem to be an option :-(
Cheers,
Posted May 19, 2017 0:46 UTC (Fri)
by ebiederm (subscriber, #35028)
[Link] (1 responses)
You can can remove a directory a mount is on and it does work, but you have to be in a mount namespace where that directory is not a mountpoint.
Posted May 19, 2017 3:24 UTC (Fri)
by zlynx (guest, #2285)
[Link]
Posted May 18, 2017 9:50 UTC (Thu)
by nix (subscriber, #2304)
[Link]
I can hear the vendor of crapware who decided to use AT_XDEV everywhere for "security" or "speed" now: "no, we only support one big filesystem". Of course they won't fix it, they'll just make excuses, and the only reason they'd bother to do this is because AT_XDEV makes rarely-sensible behaviour so easy to do.
This is *exactly* the problem we see on Windows systems that only allow you to install to C:, and MS just went through hell trying to fake things out so applications thought they were installing to C: even though they aren't. Do we really want to bring the same problem to Linux?
Posted May 19, 2017 1:17 UTC (Fri)
by mattrose (guest, #19610)
[Link] (2 responses)
I didn't read the patches, but I assume it has some other way of detecting that, but if not that seems like an obvious shortfall
Posted May 19, 2017 21:11 UTC (Fri)
by ssmith32 (subscriber, #72404)
[Link]
"Viro's proposed flag (initially called AT_NO_JUMPS) for the ..at() calls would restrict the directory traversal of those calls to the same subtree as the starting point"
Posted May 20, 2017 3:40 UTC (Sat)
by viro (subscriber, #7872)
[Link]
Posted May 19, 2017 2:07 UTC (Fri)
by helsleym (guest, #92730)
[Link] (4 responses)
bikeshed paint: Not a fan of the name AT_NO_XDEV -- why not XMNT instead of XDEV to better reflect the semantics?
Posted May 19, 2017 11:09 UTC (Fri)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted May 19, 2017 18:16 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link] (1 responses)
Posted May 20, 2017 9:17 UTC (Sat)
by ballombe (subscriber, #9523)
[Link]
Posted May 21, 2017 11:34 UTC (Sun)
by smcv (subscriber, #53363)
[Link]
Posted May 24, 2017 4:15 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
I'm not convinced that the API described is all that useful for the usecases I would want to use it in. I feel like we need to make it possible to scope resolution in a more sane way than "just use chroot(2)" -- which doesn't work for unprivileged users for example. Though ultimately I'm sort of on the fence whether this should be done in the kernel.
Posted May 27, 2017 0:59 UTC (Sat)
by kmeyer (subscriber, #50720)
[Link]
> It's not quite O_BENEATH, and IMO it's saner that way - a/b/c/../d is bloody well allowed, and so are relative symlinks that do not lead out of the subtree.
For what it's worth, Capsicum (as implemented in FreeBSD) allows ".." and relative symlinks that don't lead out of the directory.
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Wol
Why isn't rmdir of a mountpoint just like umount?
Why isn't rmdir of a mountpoint just like umount?
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
But the reason that most of these checks disallow ".." anywhere in the pathname is so that you can't break out of the restriction by doing "a/../../../etc/passwd"
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
0) Optionally use fcntl() to set the F_AT_XDEV_LIMIT (default: 0) on the fd that will be used in *at() calls.
1) Call *at() and pass an AT_LIMIT_XDEV flag to enforce the limit.
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS
Restricting pathname resolution with AT_NO_JUMPS