Seccomp and sandboxing
Seccomp solves this problem by putting a strict sandbox around processes running code from others. A process running in seccomp mode is severely limited in what it can do; there are only four system calls - read(), write(), exit(), and sigreturn() - available to it. Attempts to call any other system call result in immediate termination of the process. The idea is that a control process could obtain the code to be run and load it into memory. After setting up its file descriptors appropriately, this process would call:
prctl(PR_SET_SECCOMP, 1);
to enable seccomp mode. Once straitjacketed in this way, it would jump into the guest code, knowing that no real harm could be done. The guest code can run in the CPU and communicate over the file descriptors given to it, but it has no other access to the system.
Andrea's CPUShare never quite took off, but seccomp remained in the kernel. Last February, when a security hole was found in the seccomp code, Linus wondered whether it was being used at all. It seems likely that there were, in fact, no users at that time, but there was one significant prospective user: Google.
Google is not looking to use seccomp to create a distributed computing network; one assumes that, by now, they have developed other solutions to that problem. Instead, Google is looking for secure ways to run plugins in its Chrome browser. The Chrome sandbox is described this way:
It seems that the Google developers thought that seccomp would make a good platform on which to create a "finished implementation" for Linux. Google developer Markus Gutschke said:
The downside is that the sandbox'd code needs to delegate execution of most of its system calls to a monitor process. This is slow and rather awkward. Although due to the magic of clone(), (almost) all system calls can in fact be serialized, sent to the monitor process, have their arguments safely inspected, and then executed on behalf of the sandbox'd process. Details are tedious but we believe they are solvable with current kernel APIs.
There is, however, the little problem that sandboxed code can usefully (and safely) invoke more than the four allowed system calls. That limitation can be worked around ("tedious details"), but performance suffers. What the Chrome developers would like is a more flexible way of specifying which system calls can be run directly by code inside the sandbox.
One suggestion that came out was to add a new "mode" to seccomp. The API was designed with the idea that different applications might have different security requirements; it includes a "mode" value which specifies the restrictions that should be put in place. Only the original mode has ever been implemented, but others can certainly be added. Creating a new mode which allowed the initiating process to specify which system calls would be allowed would make the facility more useful for situations like the Chrome sandbox.
Adam Langley (also of Google) has posted a patch which does just that. The new "mode 2" implementation accepts a bitmask describing which system calls are accessible. If one of those is prctl(), then the sandboxed code can further restrict its own system calls (but it cannot restore access to system calls which have been denied). All told, it looks like a reasonable solution which could make life easier for sandbox developers.
That said, this code may never be merged because the discussion has since moved on to other possibilities. Ingo Molnar, who has been arguing for the use of the ftrace framework in a number of situations, thinks that ftrace is a perfect fit for the Chrome sandbox problem as well. He might be right, but only for a version of ftrace which is not, yet, generally available.
Using ftrace for sandboxing may seem a little strange; a tracing framework is supposed to report on what is happening while perturbing the situation as little as possible. But ftrace has a couple of tools which may be useful in this situation. The system call tracer is there now, making it easy to hook into every system call made by a given process. In addition, the current development tree (perhaps destined for 2.6.31) includes an event filter mechanism which can be used to filter out events based on an arbitrary boolean expression. By using ftrace's event filters, the sandbox could go beyond just restricting system calls; it could also place limits on the arguments to those system calls. An example supplied by Ingo looks like this:
{ "sys_read", "fd == 0" },
{ "sys_write", "fd == 1" },
{ "sys_sigreturn", "1" },
{ "sys_gettimeofday", "tz == NULL" },
These expressions implement something similar to mode 1 seccomp. But, additionally, read() is limited to the standard input and write() to the standard output. The sandboxed process is also allowed to call gettimeofday(), but it is not given access to the time zone information.
The expressions can be arbitrarily complex. They are also claimed to be very fast; Ingo claims that they are quicker than the evaluation of security module hooks. And, if straight system call filtering is not enough, arbitrary tracepoints can be placed elsewhere. All told, it does seem like a fairly general mechanism for restricting what a given process can do.
The problem cannot really be seen as solved yet, though. The event tracing code is very new and mostly unused so far. It is out of the mainline still, meaning that it could easily be a year or so until it shows up in kernels shipped by distributions. The code allowing this mechanism to be used to control execution is yet to be written. So Chrome will not have a sandbox based on anything other than mode 1 seccomp for some time (though the Chrome developers are also evaluating using SELinux for this purpose).
Beyond that, there are some real doubts about whether system call interception is the right way to sandbox a process. There are well-known difficulties with trying to verify parameters if they are stored in user space; a hostile process can attempt to change them between the execution of security checks and the actual use of the data. There are also interesting interactions between system calls and multiple ways to do a number of things, all of which can lead to a leaky sandbox. All of this has led James Morris to complain:
Ingo is not worried, though; he notes that the ability to place arbitrary tracepoints allows filtering at any spot, not just at system call entry. So the problems associated with system call interception are not necessarily an issue with the ftrace-based scheme. Beyond that, this is a specific sort of security problem:
This has the look of a discussion which will take some time to play out.
There is sure to be opposition to turning the event filtering code into
another in-kernel security policy language. It may turn out that the
simple seccomp extension is more generally palatable. Or something
completely different could come along. What is clear is that the
sandboxing problem is hard; many smart people have tried to implement it in
a number of different ways with varying levels of success. There is no
assurance that that the solution will be easier this time around.
| Index entries for this article | |
|---|---|
| Kernel | Ftrace |
| Kernel | Security/seccomp |
| Security | Linux kernel |
