LWN.net Logo

Seccomp and sandboxing

By Jonathan Corbet
May 13, 2009
Back in 2005, Andrea Arcangeli, mostly known for memory management work in those days, wandered into the security field with the "secure computing" (or "seccomp") feature. Seccomp was meant to support a side business of his which would enable owners of Linux systems to rent out their CPUs to people doing serious processing work. Allowing strangers to run arbitrary code is something that people tend to be nervous about; they require some pretty strong assurance that this code will not have general access to their systems.

Seccomp solves this problem by putting a strict sandbox around processes running code from others. A process running in seccomp mode is severely limited in what it can do; there are only four system calls - read(), write(), exit(), and sigreturn() - available to it. Attempts to call any other system call result in immediate termination of the process. The idea is that a control process could obtain the code to be run and load it into memory. After setting up its file descriptors appropriately, this process would call:

    prctl(PR_SET_SECCOMP, 1);

to enable seccomp mode. Once straitjacketed in this way, it would jump into the guest code, knowing that no real harm could be done. The guest code can run in the CPU and communicate over the file descriptors given to it, but it has no other access to the system.

Andrea's CPUShare never quite took off, but seccomp remained in the kernel. Last February, when a security hole was found in the seccomp code, Linus wondered whether it was being used at all. It seems likely that there were, in fact, no users at that time, but there was one significant prospective user: Google.

Google is not looking to use seccomp to create a distributed computing network; one assumes that, by now, they have developed other solutions to that problem. Instead, Google is looking for secure ways to run plugins in its Chrome browser. The Chrome sandbox is described this way:

Sandbox leverages the OS-provided security to allow code execution that cannot make persistent changes to the computer or access information that is confidential. The architecture and exact assurances that the sandbox provides are dependent on the operating system. Currently the only finished implementation is for Windows.

It seems that the Google developers thought that seccomp would make a good platform on which to create a "finished implementation" for Linux. Google developer Markus Gutschke said:

Simplicity is really the beauty of seccomp. It is very easy to verify that it does the right thing from a security point of view, because any attempt to call unsafe system calls results in the kernel terminating the program. This is much preferable over most ptrace solutions which is more difficult to audit for correctness.

The downside is that the sandbox'd code needs to delegate execution of most of its system calls to a monitor process. This is slow and rather awkward. Although due to the magic of clone(), (almost) all system calls can in fact be serialized, sent to the monitor process, have their arguments safely inspected, and then executed on behalf of the sandbox'd process. Details are tedious but we believe they are solvable with current kernel APIs.

There is, however, the little problem that sandboxed code can usefully (and safely) invoke more than the four allowed system calls. That limitation can be worked around ("tedious details"), but performance suffers. What the Chrome developers would like is a more flexible way of specifying which system calls can be run directly by code inside the sandbox.

One suggestion that came out was to add a new "mode" to seccomp. The API was designed with the idea that different applications might have different security requirements; it includes a "mode" value which specifies the restrictions that should be put in place. Only the original mode has ever been implemented, but others can certainly be added. Creating a new mode which allowed the initiating process to specify which system calls would be allowed would make the facility more useful for situations like the Chrome sandbox.

Adam Langley (also of Google) has posted a patch which does just that. The new "mode 2" implementation accepts a bitmask describing which system calls are accessible. If one of those is prctl(), then the sandboxed code can further restrict its own system calls (but it cannot restore access to system calls which have been denied). All told, it looks like a reasonable solution which could make life easier for sandbox developers.

That said, this code may never be merged because the discussion has since moved on to other possibilities. Ingo Molnar, who has been arguing for the use of the ftrace framework in a number of situations, thinks that ftrace is a perfect fit for the Chrome sandbox problem as well. He might be right, but only for a version of ftrace which is not, yet, generally available.

Using ftrace for sandboxing may seem a little strange; a tracing framework is supposed to report on what is happening while perturbing the situation as little as possible. But ftrace has a couple of tools which may be useful in this situation. The system call tracer is there now, making it easy to hook into every system call made by a given process. In addition, the current development tree (perhaps destined for 2.6.31) includes an event filter mechanism which can be used to filter out events based on an arbitrary boolean expression. By using ftrace's event filters, the sandbox could go beyond just restricting system calls; it could also place limits on the arguments to those system calls. An example supplied by Ingo looks like this:

    { "sys_read",		"fd == 0" },
    { "sys_write",		"fd == 1" },
    { "sys_sigreturn",		"1" },
    { "sys_gettimeofday",	"tz == NULL" },

These expressions implement something similar to mode 1 seccomp. But, additionally, read() is limited to the standard input and write() to the standard output. The sandboxed process is also allowed to call gettimeofday(), but it is not given access to the time zone information.

The expressions can be arbitrarily complex. They are also claimed to be very fast; Ingo claims that they are quicker than the evaluation of security module hooks. And, if straight system call filtering is not enough, arbitrary tracepoints can be placed elsewhere. All told, it does seem like a fairly general mechanism for restricting what a given process can do.

The problem cannot really be seen as solved yet, though. The event tracing code is very new and mostly unused so far. It is out of the mainline still, meaning that it could easily be a year or so until it shows up in kernels shipped by distributions. The code allowing this mechanism to be used to control execution is yet to be written. So Chrome will not have a sandbox based on anything other than mode 1 seccomp for some time (though the Chrome developers are also evaluating using SELinux for this purpose).

Beyond that, there are some real doubts about whether system call interception is the right way to sandbox a process. There are well-known difficulties with trying to verify parameters if they are stored in user space; a hostile process can attempt to change them between the execution of security checks and the actual use of the data. There are also interesting interactions between system calls and multiple ways to do a number of things, all of which can lead to a leaky sandbox. All of this has led James Morris to complain:

I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes.

Ingo is not worried, though; he notes that the ability to place arbitrary tracepoints allows filtering at any spot, not just at system call entry. So the problems associated with system call interception are not necessarily an issue with the ftrace-based scheme. Beyond that, this is a specific sort of security problem:

Your argument really pertains to full-system security solutions - while maximising compatibility and capability and minimizing user inconvenience. _That_ is an extremely hard problem with many pitfalls and snake-oil merchants flooding the roads. But that is not our goal here: the goal is to restrict execution in very brutal but still performant ways.

This has the look of a discussion which will take some time to play out. There is sure to be opposition to turning the event filtering code into another in-kernel security policy language. It may turn out that the simple seccomp extension is more generally palatable. Or something completely different could come along. What is clear is that the sandboxing problem is hard; many smart people have tried to implement it in a number of different ways with varying levels of success. There is no assurance that that the solution will be easier this time around.


(Log in to post comments)

Posted May 14, 2009 3:30 UTC (Thu) by jamesmrh (guest, #31622) [Link]

It's like TCP or Unix, which people keep reinventing poorly.

They start out with an idea which superficially seems simple and efficient, yet once all of the hard-learned lessons of the past are applied with all of their subtleties and nuances, the end result is just some variation on an existing scheme, but without the benefit of having been closely scrutinized and shaken-out over time.

That's what I'm sensing in this case, although I'm more than happy to be proven wrong.

hammer

Posted May 14, 2009 13:41 UTC (Thu) by fuhchee (subscriber, #40059) [Link]

It may just be a case of a new shiny hammer being thought perfect for all suddenly nail-resembling problems.

Posted May 17, 2009 14:05 UTC (Sun) by davecb (subscriber, #1574) [Link]

jamesmrh wrote: It's like TCP or Unix, which people keep reinventing poorly.

A useful area to look at for previous successful solutions is MAC, or Mandatory Access Control, which is a necessary and sufficient component of a secure system, from work done back in 1985.

Besides being part of SE Linux, it's also one of the building blocks of the Solaris version of kernel virtual machines, "zones", so it's not just well-understood, it's well-tested.

For the original wheel, see the Department of Defense Trusted Computer System Evaluation Criteria. Accept no substitutes: the "common criteria" are watered-down political compromises with no technical content (;-))

--dave

Losing the simplicity

Posted May 14, 2009 4:28 UTC (Thu) by felixfix (subscriber, #242) [Link]

It does seem like a fun project, but I shudder thinking of how the utter simplicity will be thrown out for endless complexity. It's bad enough having a bitmask to allow arbitrary syscalls, but to then monitor syscall arguments? Seems to me the word "security" hardly deserves to be treated so shabbily as to be associated with this.

Simplicity is useful

Posted May 14, 2009 14:41 UTC (Thu) by job (guest, #670) [Link]

The most successful sandbox must be chroot+setuid. Probably because is it portable, simple and easy to understand. Both the administrator and the programmer knows directly what they can trust such a process with.

That's why I think something like seccomp would be usable. Anything outside of pure computation must be done outside it. No flexibility, nothing. Attack vectors are isolated to the monitor process.

Simplicity is useful

Posted May 20, 2009 17:40 UTC (Wed) by sfink (subscriber, #6405) [Link]

I agree, chroot + setuid is one of the most successful models out there -- assuming you're measuring success by popularity. If you factor in effectiveness, on the other hand, I was under the impression that it's a disaster.

setuid is good, but privilege escalation flaws are not that hard to come by. And once you have root privileges, chroot is no longer a security mechanism, it's just a convenient filesystem remapping trick. Nothing prevents you from creating your own special device and mounting the entire filesystem within your chroot jail. And that's only one of many, many ways to escape chroot.

Simplicity is useful

Posted May 20, 2009 18:32 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

if you assume that privilage escalation exploits are everywhere, nothing less than a fully locked down SELinux system can do you any good (and note that _no_ distro is shipping a _fully_ locked down SELinux system)

if privilage escalation exploits are not everywhere then chroot is much stronger.

and even though it's not as strong as other security mechanisms could be, the fact that those other mechanisms aren't used makes them pretty useless

however, I will disagree slightly with chroot being the most successful model, I'll point out that it builds on the basic unix user/group permissions, and I would call _that_ the most successful model

Simplicity is useful

Posted Jul 17, 2009 22:44 UTC (Fri) by job (guest, #670) [Link]

Nothing is a security mechanism against privilege escalation flaws in the mechanism itself. That's what appeals to me with seccomp, it should be possible to be made secure, as opposed to complex stuff such as LSM- or SELinux-arbitrated access control.

Seccomp and sandboxing

Posted May 14, 2009 16:26 UTC (Thu) by deater (subscriber, #11746) [Link]

what? Ingo re-inventing the wheel while ignoring all the lessons learned from existing implementations? What an unexpected development

Seccomp and sandboxing

Posted May 21, 2009 17:46 UTC (Thu) by robert_s (subscriber, #42402) [Link]

This google sandbox does sound like the dumbest thing ever. It's bad enough having Adobe not supplying Flash for all architectures, but actually having web content that is fundamentally x86-only makes my toes curl.

Seccomp and sandboxing

Posted Aug 19, 2009 15:57 UTC (Wed) by walters (subscriber, #7396) [Link]

That's Native Client, which is not the same thing as what's being discussed
here.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds