LWN.net Logo

Expanding seccomp

By Jake Edge
May 4, 2011

Sandboxing processes such that they cannot make "dangerous" system calls is an attractive feature that has already been implemented in a limited way for Linux with seccomp. Two years ago, we looked at a proposal to expand seccomp to allow more fine-grained control over which system calls would be allowed. That proposal has been mostly dormant since then, but was recently resurrected after incorporating some of the suggestions made at that time. The reaction to the current proposal so far seems positive, and it might just be gaining some traction that the previous patchset lacked.

Seccomp (from "secure computing") is enabled via a prctl() call and, once enabled, restricts the process from making any further system calls beyond read(), write(), exit(), and sigreturn()—any other system call will abort the process. That creates a pretty secure sandbox, but it is also extremely limited as there are other things that developers might want to do from within such a sandbox. In fact, the Chromium browser has gone to great lengths to implement its own sandbox that uses seccomp, but expands the range of legal system calls through some contortions.

That led Adam Langley of the Chromium team to propose adding a bitmask of allowable system calls for a new seccomp mode. That would have allowed processes to make a binary choice (allowed or disallowed) for each system call. At the time, Ingo Molnar suggested using the Ftrace filter code to make the interface even more flexible by allowing filters to be applied to the system call arguments. Essentially, that would make for three choices for each system call: enabled, disabled, or filtered.

Fast-forward to today, and that is what a patchset from Will Drewry implements. It should come as no surprise that Molnar was pleased to see his idea result in working code: "Ok, color me thoroughly impressed - AFAICS you implemented my suggestions [...] and you made it work in practice!". Eric Paris was likewise impressed, noting that an expanded seccomp could be used for QEMU. Molnar and Paris did not agree about replacing the LSM approach using filters, but that was something of an aside. Serge E. Hallyn also pointed out that the new feature would be useful for containers: "to try and provide some bit of mitigation for the fact that they are sharing a kernel with the host".

The proposed interface, which is likely to change based on comments in the thread, looks like:

    const char *filters[] =
      "sys_read: (fd == 1) || (fd == 2)\n"
      "sys_write: (fd == 0)\n"
      "sys_exit: 1\n"
      "sys_exit_group: 1\n"
      "on_next_syscall: 1";
    prctl(PR_SET_SECCOMP, 2, filters);
That example is taken from Drewry's documentation file that accompanies the patches. It would allow reading from two file descriptors (1 and 2) and writing to one (0), while allowing any calls to the two other system calls listed. The on_next_syscall means that the rules would not be enforced until after one more system call is made. That would allow a parent to fork(), set up the seccomp sandbox in the child process, then exec a new program which would be governed by the new rules.

That on_next_syscall piece drew a few comments. As it turns out, there are really only two cases that need to be handled, either the rules should go into effect immediately (for a process that wants to restrict itself before handling untrusted input for example), or they should go into effect after an exec (for a parent that is spawning an untrusted child). Making the "after exec" case the default, while still allowing a process to request immediate application, seems to be the way things are headed.

There were also questions about using kernel-internal symbol names like sys_read. Exporting those as a kernel ABI is not likely to pass muster, as it might restrict the option of changing those function names down the road—or require a messy compatibility layer if they did change. Drewry wanted to avoid using the system call numbers as Langley's original patch did, but as Frederic Weisbecker pointed out, those numbers are already part of the kernel ABI. Drewry is planning to make that switch and users of the interface will need to use the unistd.h header file or a library to map system call names to numbers.

The patches also modify the /proc/PID/status file to output any existing filters that are applied to the process. Given that most applications that read that file don't need the extra information, though, Motohiro Kosaki suggested that seccomp get its own file. Drewry's plan is to provide that information in the /proc/PID/seccomp_filter file instead, and remove it from status.

Since it uses the Ftrace infrastructure and hooks, the new seccomp mode only works for those system calls that have Ftrace events associated with them. Using one of those non-instrumented system calls in the filters will result in an EINVAL from the prctl() call. Enabling CONFIG_SECCOMP_FILTER (which depends on CONFIG_FTRACE_SYSCALLS) will allow the use of the new mode.

Overall, Drewry has been very receptive to suggestions for changes, and the feedback to the concept has been pretty uniformly positive. Molnar suggested breaking out the Ftrace filter engine further—beyond the minimal changes that Drewry's patches make—so that it would be available for more widespread use in the kernel. Molnar does wonder whether Linus Torvalds or Andrew Morton might object to more use of the filter mechanism, however: "are you guys opposed to such flexible, dynamic filters conceptually? I think we should really think hard about the actual ABI as this could easily spread to more applications than Chrome/Chromium." So far, neither has spoken up one way or the other.

Currently it would seem that Drewry is off working on the next revision of the patchset, and it certainly doesn't seem like anything that would be merged in the upcoming 2.6.40 cycle. As Molnar notes, the ABI needs to be carefully thought-out, there are still some RCU issues that are being discussed, and it probably needs some soaking time in the -next tree, but barring some major complaint cropping up, it's a feature that will likely make its way into the mainline relatively soon. While that won't allow Chromium to immediately ditch its complicated sandboxing arrangement, it may well be able to do so a few years down the road. Other applications will benefit from an expanded seccomp as well.


(Log in to post comments)

Expanding seccomp

Posted May 6, 2011 13:48 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

very interesting.

what is the cost of compiling this in to the kernel if no filters are defined?

Expanding seccomp

Posted May 10, 2011 22:03 UTC (Tue) by mhelsley (subscriber, #11324) [Link]

Very interesting indeed. Checkpoint/restart (out-of-tree at present) might be able to use expanded seccomp to nicely detect and limit unsupported interface usage in forward-compatible ways.

Expanding seccomp

Posted May 11, 2011 8:47 UTC (Wed) by cras (guest, #7000) [Link]

I had long been wishing SELinux worked like this. I as a developer know better what my application is allowed to do than some package maintainer / sysadmin stracing the process and guessing what is ok and what is not..

Expanding seccomp

Posted May 12, 2011 3:05 UTC (Thu) by TomMD (guest, #56998) [Link]

You could write your own SELinux policies. That was once the idea, SELinux policies would be provided as modules along with each program. Mind if I ask why you don't write policies for your own program? Evidently you are interested in the benefits SELinux has to offer.

Expanding seccomp

Posted May 12, 2011 6:41 UTC (Thu) by cras (guest, #7000) [Link]

The problems I see with writing my own policy are (assuming my SELinux understanding is correct):

* They couldn't automatically be used by anyone. Distros might pick them up for their own packages and admins might manually copy the file somewhere to use it (maybe replace the distro's own), but there is no good way to start using the policy automatically with just "make install".

* People seem to disable SELinux often, because it breaks some software. Having a policy isn't useful if the whole SELinux is disabled. It would be nice if SELinux had also a new mode: Globally disabled, but enabled for apps that explicitly enable it for themselves.

* The policies can't be dynamic. I might want slightly different policies depending on what my config file contains.

Expanding seccomp

Posted May 12, 2011 18:20 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

as I understand it, it's not possible to write a SELinux policy for just one application, due to the simple fact that SELinux policies work on the basis of each file having a single tag.

so all policies that have to touch a file (or directory) have to agree on what tag to use for that file or directory.

this makes it impossible to ship a policy for your software, as you have to coordinate the tags with everything else on the system.

this is one of the things that I see as making AppArmor so much better in the real world. since it doesn't depend on global tags, but instead lists what files are allowed, the AA policy for a particular app really can be independent of the policy for all other apps. So it could be provided by the software developer.

Expanding seccomp

Posted May 13, 2011 0:06 UTC (Fri) by cras (guest, #7000) [Link]

I'm not too interested in the "tag" vs "path" debate. For my use case (IMAP server) I don't think it makes any real difference. Many people use virtual users, where all users' mails are stored using
the same UNIX UID. For extra security it is possible to chroot into a user's mail directory though. So I'd primarily want to avoid any potential ways to get around that chroot into other users' mails, by preventing syscalls that just aren't necessary.

Expanding seccomp

Posted May 11, 2011 19:59 UTC (Wed) by Baylink (subscriber, #755) [Link]

Well, this is all good cheese... but as someone who's spent the larger part of my career as a sysadmin rather than a programmer... *I can't see into it*.

SUID is pretty easy to audit. Capabilities, though I haven't used them much, are -- so I gather -- similar to audit from the sysadmin viewpoint.

This is going to affect security *down inside the source code where I can't see it*, is it not? Now, sure, it *reduces* the things a process can do.

But from what? If this *expands* the universe of stuff I gotta audit *because it inspires people to require more capabilities than they really need, and then drop the stuff they don't want... then it's going to make sysadmins' lives harder.

Expanding seccomp

Posted May 11, 2011 20:48 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> But from what? If this *expands* the universe of stuff I gotta audit *because it inspires people to require more capabilities than they really need, and then drop the stuff they don't want... then it's going to make sysadmins' lives harder.

I don't really see the problem. Before file capabilities, if a process required *any* extra capabilities it needed to be SUID to root. Now processes can start out with a subset of those capabilities rather than full SUID. Worst case, I would think you could simply treat any executable file with a non-empty set of file capabilities as if it were SUID.

Or are you concerned that people will add individual capabilities to programs that formerly didn't require any, where the stigma of requiring full SUID would have dissuaded them?

Expanding seccomp

Posted May 12, 2011 8:06 UTC (Thu) by joib (guest, #8541) [Link]

Isn't this conceptually somewhat similar to the Capsicum project ( http://www.cl.cam.ac.uk/research/security/capsicum/ )?

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds