Retrying revoke()

By Jonathan Corbet
April 9, 2013

The revoke() system call has one conceptually simple job: close any open file descriptors for the pathname given as its argument and prevent any further access via those descriptors. The classic use case is to ensure that no evil programs are holding a terminal or console device open before allowing logins there, but others exist as well. For example, a functioning revoke() implementation could be used within the kernel to cleanly disconnect any file descriptors referring to a device that has been removed from the system; filesystems like /proc could also use it to cleanly remove no-longer-needed virtual files.

There is only one problem: Linux does not support revoke(), and every attempt to add it over the years has ended in failure. The functionality behind revoke() turns out to be quite difficult to implement in a safe way. The latest attempt at a revoke() implementation may well come to a similar conclusion; there is not even a proof-of-concept patch to evaluate, after all. But, since the developer behind it is Al Viro, one assumes that its chances of success are mildly better than average.

Not every file or device will support revoke(); in some cases, it may still prove too hard to do properly. With Al's proposal, in cases where revocation is supported, there would be a new structure associated with the relevant device (or other) structure:

    struct revokable {
	atomic_t in_use;		// number of threads in methods,
	spinlock_t lock;
	hlist_head list;
	struct completion *c;
	void (*kick)(struct revokable *);
    };

The in_use field is charged with tracking how many threads are actively executing in the file_operations methods associated with this object. Performing this tracking would require changing every method call site throughout the kernel to call a couple of helper functions and check for a revoked file. So a call that currently looks like:

    ret = file->f_op->read(...);

Would be turned into something like:

    if (start_using(file)) {
	ret = file->f_op->read(...);
	stop_using(file);
    } else {
	ret = -EIO;  /* File revoked */
    }

The start_using() and stop_using() helper functions increment and decrement the in_use counter. If that counter is negative, though, access is being revoked and start_using() will return false; in such cases, the file_operations method should not be called and an appropriate error code should be returned. Naturally, the details of these helper functions are a bit more complex than this; see Al's posting for a more complete story. As Al notes, there are quite a few call sites for file_operations methods in the kernel, so this particular change would be relatively intrusive.

The purpose of the kick() callback is to instruct the object's driver that access is being revoked and any outstanding I/O operations should be brought to an end. Processes waiting on I/O should return with an error code and the I/O canceled. After the kick() call, the number of threads running within the object's file_operations should quickly drop to zero.

When open() is called on an object that supports revocation, the associated file structure will gain a pointer to a structure like:

    struct revoke {
	struct file *file;
	struct revokable *revokable;
	struct hlist_node list;
	bool closing;
	struct completion *c;
    };

The list field is used to track all open files associated with a given revocable object. As the last step in an open() implementation, the make_revokable() helper will be called to allocate the revoke structure and attach it to the list in the object's revokable structure.

With this infrastructure in place, an implementation of revoke() becomes possible. The steps, roughly, are these:

Mark the object as being revoked by subtracting a large number from its in_use counter, turning that counter negative. That will prevent any further calls to the object's file_operations methods.
If in_use indicates that threads are currently running in the object's file_operations, call kick() to encourage them all to finish and wait until they all complete.
For each open file, call the release() method to close that file, and remove the file from the list.

At the end of this process, there should be no open files for the given object and no threads will be running in any file operations associated with that object. The latter point is important; a robust and secure revoke() implementation is possible only if the kernel can be sure that all previous references to the revoked object are truly gone. Once that has happened, it should then be possible to free any associated resources or allow new processes to open the object.

There is, of course, one other thorny little problem: what do to about processes that have used mmap() to map the object into their address space. One possibility is to forcibly unmap the memory, tearing down the associated page tables and marking the virtual memory area (VMA) structure accordingly; the process would then most likely receive a SIGSEGV signal if it attempted to access that address space. That approach is secure, but also risks causing programs to crash unexpectedly. In cases where device memory has been mapped, a better solution might be to just cause all accesses to return 0xff (extended out to the correct width for the specific access). Proper handling of mmap() in this situation is an open question, and one apparently without precedent in the current implementations of revoke() in other systems — revoke() on BSD systems works only on devices without mapped memory.

There is a fair gap between an RFC posting with a clever idea and an actual, working implementation; it may well be that this approach to revoke() will, like its predecessors, run aground in the real world. But the lack of a working revoke() has been seen as a shortcoming in Linux for many years; it would be nice to finally get this functionality into place. So, just maybe, things will work out this time around.

Index entries for this article
Kernel	revoke()

Retrying revoke()

Posted Apr 11, 2013 2:14 UTC (Thu) by butlerm (subscriber, #13312) [Link] (15 responses)

> the process would then most likely receive a SIGSEGV signal if it attempted to access that address space. That approach is secure, but also risks causing programs to crash unexpectedly

SIGSEGV is the only reasonable option here. Corrupting the data the process reads is much more likely to make a program crash in a bad way. Think banking, finance, or industrial control. Instead of a stopped process, you could have corrupted transactions.

Retrying revoke()

Posted Apr 11, 2013 2:42 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

I would think SIGBUS would make more sense than SIGSEGV. IIRC that is the signal you get when the underlying device reports an I/O error. Simply unmapping the memory could still lead to the wrong data being read (or written!) if a later mmap call reuses the now-free address space.

Retrying revoke()

Posted Apr 14, 2013 12:54 UTC (Sun) by Karellen (subscriber, #67644) [Link]

How about, if a process has a mapping on a revoke()d fd, it gets a SIGBUS as a notification *at the time revoke() happens*. If it tries to access the mapping, it gets a SIGSEGV.

Processes could then try to catch SIGBUS and internally mark any mmap()ed regions as invalid if they want. Or mremap() them? Which could return EINVAL rather than EFAULT for revoked mappings?

There will presumably still be race conditions where one thread might access the mapped region before the signal handler completes, but presumably that's still better than the alternative - no notice, guaranteed segv.

Retrying revoke()

Posted Apr 11, 2013 11:23 UTC (Thu) by cavok (subscriber, #33216) [Link] (10 responses)

Why not fail the revoke() if the file is mmapped anywhere?

Retrying revoke()

Posted Apr 11, 2013 16:06 UTC (Thu) by nix (subscriber, #2304) [Link] (9 responses)

Because generally things that call revoke() don't deal well with it failing. It's meant to say 'this device is going away or has gone away': nobody's allowed to say 'no it hasn't' in response to that.

Personally I'm hoping this will *finally* let us have a non-root X server :}

Retrying revoke()

Posted Apr 11, 2013 16:20 UTC (Thu) by apoelstra (subscriber, #75205) [Link] (8 responses)

> Because generally things that call revoke() don't deal well with it failing. It's meant to say 'this device is going away or has gone away': nobody's allowed to say 'no it hasn't' in response to that.
> Personally I'm hoping this will *finally* let us have a non-root X server :}

Can you elaborate on this? I've been running 'startx' as an unprivileged user for a couple years and haven't noticed anything awful happening.

Retrying revoke()

Posted Apr 11, 2013 16:39 UTC (Thu) by walters (subscriber, #7396) [Link] (5 responses)

That "works" because Xorg is setuid root on your system, so it's actually running as root.

(Note: this is a huge attack surface, and at least in e.g. gnome-ostree I simply don't make Xorg setuid, and don't ship startx; you have to log in via GDM)

Retrying revoke()

Posted Apr 13, 2013 19:32 UTC (Sat) by guillemj (subscriber, #49706) [Link] (4 responses)

> That "works" because Xorg is setuid root on your system, so it's actually running as root.

If that's a Debian-based distribution, then the X binary is just a pretty small setuid wrapper that checks if the user can invoke the real non-setuid Xorg binary based off some policies from a wrapper-specific configuration file.

<http://anonscm.debian.org/gitweb/?p=pkg-xorg/debian/xorg.git;...>

> (Note: this is a huge attack surface, and at least in e.g. gnome-ostree I simply don't make Xorg setuid, and don't ship startx; you have to log in via GDM)

Doesn't GDM also run as root, and consequently also the executed Xorg process?

Retrying revoke()

Posted Apr 13, 2013 19:35 UTC (Sat) by apoelstra (subscriber, #75205) [Link] (1 responses)

> If that's a Debian-based distribution, then the X binary is just a pretty small setuid wrapper that checks if the user can invoke the real non-setuid Xorg binary based off some policies from a wrapper-specific configuration file.

I'm running Fedora -- if I remove the setuid bit, X won't start because it lacks permission to hijack a tty. (Maybe I can fix this, but I don't know how. There are so many special groups on modern desktops..)

Retrying revoke()

Posted Apr 14, 2013 12:36 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

I'm assuming you're using startx for this? If that's the case, I had filed a bug about programs in the X session being denied PolicyKit since the TTY didn't match the login TTY. You can pass "vt02" to launch on a separate TTY, but I think you still need suid to do that.

On a related note, that's the reason why a systemd --user session doesn't work right now: I get denied taking over the TTY, but I can't use a different TTY because PolicyKit denies nice things like suspend and shutdown.

Retrying revoke()

Posted Apr 15, 2013 16:31 UTC (Mon) by walters (subscriber, #7396) [Link] (1 responses)

No, it's based on OpenEmbedded.

You are also conflating the setuid bit on Xorg with running as root - these are two independent things.

Retrying revoke()

Posted Apr 21, 2013 19:07 UTC (Sun) by guillemj (subscriber, #49706) [Link]

> No, it's based on OpenEmbedded.

I was referring to apoelstra's or nix's systems but anyway, nice to know. :)

> You are also conflating the setuid bit on Xorg with running as root - these are two independent things.

Not really. You mentioned that Xorg is running as root because it's setuid root, and that this was a "huge attack surface", without specifying which part. So while I agree making the full-blown Xorg setuid root is an attack vector, to me it's just tiny (because it's easy to avoid with the Debian wrapper for example) in comparison to running the X server as root, which I assume is still the case with something like GDM. The whole point of this subthread was the possibility of being able to finally run the X server as non-root, which would get rid of the actual (IMO) huge attack surface.

Retrying revoke()

Posted Apr 11, 2013 16:40 UTC (Thu) by dark (guest, #8483) [Link] (1 responses)

The sad truth:

-rwsr-sr-x 1 root root 14256 Mar  3  2012 /usr/bin/X

Retrying revoke()

Posted Apr 13, 2013 9:55 UTC (Sat) by mlankhorst (subscriber, #52260) [Link]

I'm curious about revoke. For forced drm module unloading of drivers (nouveau/ati) I've done something similar, but the real problem is racing with mmaps, during mmap and munmap there is a point where the mapping is removed off (or not yet added) to the list, but is still valid, just not yet tracked.

I haven't figured out how to close the mmap race there, but for the revoke case it might be important.

Retrying revoke()

Posted Apr 15, 2013 3:58 UTC (Mon) by jzbiciak (guest, #5246) [Link] (1 responses)

What about just converting the mapping to a private, anonymous, no-permission mapping? That is, leave a mapping there (so future mmap() won't use the address space), but mark it no-access (PROT_NONE). Furthermore, block mprotect() from adding access back to the memory—ie. always return EACCESS.

That'll prevent the mremap/mmap address reuse mentioned downthread, by reserving the address range, and it will still allow an munmap to unmap the memory without incident if the process finds out the resource went away through some other mechanism.

It'll give SIGSEGV, not SIGBUS as some argued might be better, but I'm not really sure I see why SIGBUS is better. SIGBUS means you tried to access physical memory that wasn't there, or in a way that it doesn't support (ie. misaligned access on architectures that don't support it). SIGSEGV means that you're trying access a virtual mapping you do not have rights to. In this case, it's because your rights have been revoke()'d.

Retrying revoke()

Posted Apr 15, 2013 17:10 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> SIGBUS means you tried to access physical memory that wasn't there ...

Isn't that exactly what happens when you try to access memory which has been revoke()'d? You mapped a range of a file/device into memory, but when you tried to access it the backing device wasn't available.

> SIGSEGV means that you're trying access a virtual mapping you do not have rights to. In this case, it's because your rights have been revoke()'d.

Generally these rights are the ones the program set up via mmap() and mprotect(). They don't change asynchronously. SIGBUS, on the other hand, can already occur due to asynchronous events, like an I/O error reading from an mmap()'d file. (Which could be due to e.g. unplugging a USB drive, one of the use cases for revoke(). Choosing SIGBUS would mean unmodified programs continue to see the same behavior in such cases.)

Handling of revoke() with mapped memory

Posted Jul 13, 2013 1:40 UTC (Sat) by vomlehn (guest, #45588) [Link]

...a better solution might be to just cause all accesses to return 0xff...

No, please, no! Solutions which return some value or which allow reuse of the process' virtual address space without notification are going to cause the process to use bad data without any way to know this has happened. Huge possibilities for *bad* things to happen.

Signals may be the best way to handle this because they, at least, guarantee that naive programs, i.e. those that are not programmed to handle revocation of a file descriptor, will quickly draw attention to themselves as requiring a fix.

One other possibility would be to disallow revoke() if the device is mapped unless the process has indicated it can handle it via some other system call (revoke_ok(), perhaps). Then it can chose to handle it with a signal, by getting some magic value from memory, or something else entirely. This may the the most backward compatible approach.