Retrying revoke()
There is only one problem: Linux does not support revoke(), and every attempt to add it over the years has ended in failure. The functionality behind revoke() turns out to be quite difficult to implement in a safe way. The latest attempt at a revoke() implementation may well come to a similar conclusion; there is not even a proof-of-concept patch to evaluate, after all. But, since the developer behind it is Al Viro, one assumes that its chances of success are mildly better than average.
Not every file or device will support revoke(); in some cases, it may still prove too hard to do properly. With Al's proposal, in cases where revocation is supported, there would be a new structure associated with the relevant device (or other) structure:
struct revokable { atomic_t in_use; // number of threads in methods, spinlock_t lock; hlist_head list; struct completion *c; void (*kick)(struct revokable *); };
The in_use field is charged with tracking how many threads are actively executing in the file_operations methods associated with this object. Performing this tracking would require changing every method call site throughout the kernel to call a couple of helper functions and check for a revoked file. So a call that currently looks like:
ret = file->f_op->read(...);
Would be turned into something like:
if (start_using(file)) { ret = file->f_op->read(...); stop_using(file); } else { ret = -EIO; /* File revoked */ }
The start_using() and stop_using() helper functions increment and decrement the in_use counter. If that counter is negative, though, access is being revoked and start_using() will return false; in such cases, the file_operations method should not be called and an appropriate error code should be returned. Naturally, the details of these helper functions are a bit more complex than this; see Al's posting for a more complete story. As Al notes, there are quite a few call sites for file_operations methods in the kernel, so this particular change would be relatively intrusive.
The purpose of the kick() callback is to instruct the object's driver that access is being revoked and any outstanding I/O operations should be brought to an end. Processes waiting on I/O should return with an error code and the I/O canceled. After the kick() call, the number of threads running within the object's file_operations should quickly drop to zero.
When open() is called on an object that supports revocation, the associated file structure will gain a pointer to a structure like:
struct revoke { struct file *file; struct revokable *revokable; struct hlist_node list; bool closing; struct completion *c; };
The list field is used to track all open files associated with a given revocable object. As the last step in an open() implementation, the make_revokable() helper will be called to allocate the revoke structure and attach it to the list in the object's revokable structure.
With this infrastructure in place, an implementation of revoke() becomes possible. The steps, roughly, are these:
- Mark the object as being revoked by subtracting a large number from
its in_use counter, turning that counter negative. That will
prevent any further calls to the object's file_operations
methods.
- If in_use indicates that threads are currently running in the
object's file_operations, call kick() to encourage
them all to finish and wait until they all complete.
- For each open file, call the release() method to close that file, and remove the file from the list.
There is, of course, one other thorny little problem: what do to about processes that have used mmap() to map the object into their address space. One possibility is to forcibly unmap the memory, tearing down the associated page tables and marking the virtual memory area (VMA) structure accordingly; the process would then most likely receive a SIGSEGV signal if it attempted to access that address space. That approach is secure, but also risks causing programs to crash unexpectedly. In cases where device memory has been mapped, a better solution might be to just cause all accesses to return 0xff (extended out to the correct width for the specific access). Proper handling of mmap() in this situation is an open question, and one apparently without precedent in the current implementations of revoke() in other systems — revoke() on BSD systems works only on devices without mapped memory.
There is a fair gap between an RFC posting with a clever idea and an
actual, working implementation; it may well be that this approach to
revoke() will, like its predecessors, run aground in the real
world. But the lack of a working revoke() has been seen as a
shortcoming in Linux for many years; it would be nice to finally get this
functionality into place. So, just maybe, things will work out this time
around.
Index entries for this article | |
---|---|
Kernel | revoke() |
Posted Apr 11, 2013 2:14 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (15 responses)
SIGSEGV is the only reasonable option here. Corrupting the data the process reads is much more likely to make a program crash in a bad way. Think banking, finance, or industrial control. Instead of a stopped process, you could have corrupted transactions.
Posted Apr 11, 2013 2:42 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
Posted Apr 14, 2013 12:54 UTC (Sun)
by Karellen (subscriber, #67644)
[Link]
Processes could then try to catch SIGBUS and internally mark any mmap()ed regions as invalid if they want. Or mremap() them? Which could return EINVAL rather than EFAULT for revoked mappings?
There will presumably still be race conditions where one thread might access the mapped region before the signal handler completes, but presumably that's still better than the alternative - no notice, guaranteed segv.
Posted Apr 11, 2013 11:23 UTC (Thu)
by cavok (subscriber, #33216)
[Link] (10 responses)
Posted Apr 11, 2013 16:06 UTC (Thu)
by nix (subscriber, #2304)
[Link] (9 responses)
Personally I'm hoping this will *finally* let us have a non-root X server :}
Posted Apr 11, 2013 16:20 UTC (Thu)
by apoelstra (subscriber, #75205)
[Link] (8 responses)
Can you elaborate on this? I've been running 'startx' as an unprivileged user for a couple years and haven't noticed anything awful happening.
Posted Apr 11, 2013 16:39 UTC (Thu)
by walters (subscriber, #7396)
[Link] (5 responses)
(Note: this is a huge attack surface, and at least in e.g. gnome-ostree I simply don't make Xorg setuid, and don't ship startx; you have to log in via GDM)
Posted Apr 13, 2013 19:32 UTC (Sat)
by guillemj (subscriber, #49706)
[Link] (4 responses)
If that's a Debian-based distribution, then the X binary is just a pretty small setuid wrapper that checks if the user can invoke the real non-setuid Xorg binary based off some policies from a wrapper-specific configuration file.
<http://anonscm.debian.org/gitweb/?p=pkg-xorg/debian/xorg.git;...>
> (Note: this is a huge attack surface, and at least in e.g. gnome-ostree I simply don't make Xorg setuid, and don't ship startx; you have to log in via GDM)
Doesn't GDM also run as root, and consequently also the executed Xorg process?
Posted Apr 13, 2013 19:35 UTC (Sat)
by apoelstra (subscriber, #75205)
[Link] (1 responses)
I'm running Fedora -- if I remove the setuid bit, X won't start because it lacks permission to hijack a tty. (Maybe I can fix this, but I don't know how. There are so many special groups on modern desktops..)
Posted Apr 14, 2013 12:36 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
On a related note, that's the reason why a systemd --user session doesn't work right now: I get denied taking over the TTY, but I can't use a different TTY because PolicyKit denies nice things like suspend and shutdown.
Posted Apr 15, 2013 16:31 UTC (Mon)
by walters (subscriber, #7396)
[Link] (1 responses)
You are also conflating the setuid bit on Xorg with running as root - these are two independent things.
Posted Apr 21, 2013 19:07 UTC (Sun)
by guillemj (subscriber, #49706)
[Link]
I was referring to apoelstra's or nix's systems but anyway, nice to know. :)
> You are also conflating the setuid bit on Xorg with running as root - these are two independent things.
Not really. You mentioned that Xorg is running as root because it's setuid root, and that this was a "huge attack surface", without specifying which part. So while I agree making the full-blown Xorg setuid root is an attack vector, to me it's just tiny (because it's easy to avoid with the Debian wrapper for example) in comparison to running the X server as root, which I assume is still the case with something like GDM. The whole point of this subthread was the possibility of being able to finally run the X server as non-root, which would get rid of the actual (IMO) huge attack surface.
Posted Apr 11, 2013 16:40 UTC (Thu)
by dark (guest, #8483)
[Link] (1 responses)
Posted Apr 13, 2013 9:55 UTC (Sat)
by mlankhorst (subscriber, #52260)
[Link]
I haven't figured out how to close the mmap race there, but for the revoke case it might be important.
Posted Apr 15, 2013 3:58 UTC (Mon)
by jzbiciak (guest, #5246)
[Link] (1 responses)
What about just converting the mapping to a private, anonymous, no-permission mapping? That is, leave a mapping there (so future mmap() won't use the address space), but mark it no-access (PROT_NONE). Furthermore, block mprotect() from adding access back to the memory—ie. always return EACCESS. That'll prevent the mremap/mmap address reuse mentioned downthread, by reserving the address range, and it will still allow an munmap to unmap the memory without incident if the process finds out the resource went away through some other mechanism. It'll give SIGSEGV, not SIGBUS as some argued might be better, but I'm not really sure I see why SIGBUS is better. SIGBUS means you tried to access physical memory that wasn't there, or in a way that it doesn't support (ie. misaligned access on architectures that don't support it). SIGSEGV means that you're trying access a virtual mapping you do not have rights to. In this case, it's because your rights have been revoke()'d.
Posted Apr 15, 2013 17:10 UTC (Mon)
by nybble41 (subscriber, #55106)
[Link]
Isn't that exactly what happens when you try to access memory which has been revoke()'d? You mapped a range of a file/device into memory, but when you tried to access it the backing device wasn't available.
> SIGSEGV means that you're trying access a virtual mapping you do not have rights to. In this case, it's because your rights have been revoke()'d.
Generally these rights are the ones the program set up via mmap() and mprotect(). They don't change asynchronously. SIGBUS, on the other hand, can already occur due to asynchronous events, like an I/O error reading from an mmap()'d file. (Which could be due to e.g. unplugging a USB drive, one of the use cases for revoke(). Choosing SIGBUS would mean unmodified programs continue to see the same behavior in such cases.)
Posted Jul 13, 2013 1:40 UTC (Sat)
by vomlehn (guest, #45588)
[Link]
No, please, no! Solutions which return some value or which allow reuse of the process' virtual address space without notification are going to cause the process to use bad data without any way to know this has happened. Huge possibilities for *bad* things to happen. Signals may be the best way to handle this because they, at least, guarantee that naive programs, i.e. those that are not programmed to handle revocation of a file descriptor, will quickly draw attention to themselves as requiring a fix. One other possibility would be to disallow revoke() if the device is mapped unless the process has indicated it can handle it via some other system call (revoke_ok(), perhaps). Then it can chose to handle it with a signal, by getting some magic value from memory, or something else entirely. This may the the most backward compatible approach.
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
> Personally I'm hoping this will *finally* let us have a non-root X server :}
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
Retrying revoke()
The sad truth:
Retrying revoke()
-rwsr-sr-x 1 root root 14256 Mar 3 2012 /usr/bin/X
Retrying revoke()
Retrying revoke()
Retrying revoke()
Handling of revoke() with mapped memory
...a better solution might be to just cause all accesses to return 0xff...