By Jake Edge
May 4, 2011
Sandboxing processes such that they cannot make "dangerous" system calls is
an attractive feature that has already been implemented in a limited way
for Linux with seccomp. Two years ago, we looked at a proposal to expand seccomp to
allow more fine-grained control over which system calls would be allowed.
That proposal has been mostly dormant since then, but was recently
resurrected after incorporating some of the suggestions made at that time.
The reaction to the current proposal so far seems positive, and it might
just be gaining some traction that the previous patchset lacked.
Seccomp (from "secure computing") is enabled via a prctl() call
and, once enabled, restricts the process from making any further system calls beyond
read(), write(), exit(), and
sigreturn()—any other system call will abort the
process. That creates a pretty secure sandbox, but it is also
extremely limited as there are other things that developers might want to
do from within such a sandbox. In fact, the Chromium browser has gone to great lengths to implement its own
sandbox that uses seccomp, but expands the range of legal system calls
through some contortions.
That led Adam Langley of the Chromium team to propose adding a bitmask of allowable system
calls for a new seccomp mode. That would have allowed processes to make a
binary choice (allowed or disallowed) for each system call. At the time,
Ingo Molnar suggested using the Ftrace filter
code to make the interface even more flexible by allowing filters to be
applied to the system call arguments. Essentially, that would make for
three choices for each system call: enabled, disabled, or filtered.
Fast-forward to today, and that is what a patchset from Will Drewry implements. It
should come as no surprise that Molnar was pleased to see his idea result in working
code: "Ok, color me thoroughly
impressed - AFAICS you implemented my suggestions [...] and you made it
work in practice!". Eric Paris was likewise impressed, noting that an expanded seccomp
could be used for QEMU. Molnar and Paris did not agree about replacing the
LSM approach using filters, but that was something of an aside. Serge
E. Hallyn also pointed out that the new feature
would be useful for containers: "to try and provide some bit of
mitigation for the fact that they are sharing a kernel with the
host".
The proposed interface, which is likely to change based on comments in the
thread, looks like:
const char *filters[] =
"sys_read: (fd == 1) || (fd == 2)\n"
"sys_write: (fd == 0)\n"
"sys_exit: 1\n"
"sys_exit_group: 1\n"
"on_next_syscall: 1";
prctl(PR_SET_SECCOMP, 2, filters);
That example is taken from Drewry's
documentation file that accompanies the patches.
It would allow reading from two file descriptors (1 and 2) and writing to
one (0), while
allowing any calls to the two other system calls listed. The
on_next_syscall means that the rules would not be enforced until
after one more system call is made. That would allow a parent to
fork(), set up the seccomp sandbox in the child process, then exec
a new
program which would be governed by the new rules.
That on_next_syscall piece drew a few comments. As it turns out,
there are really only two cases that need to be handled, either the rules
should go into effect immediately (for a process that wants to restrict
itself before handling untrusted input for example), or they should go into
effect after an exec (for a parent that is spawning an untrusted child).
Making the "after exec" case the default, while still allowing a
process
to request immediate application, seems to be the way things are headed.
There were also questions about using kernel-internal symbol names like
sys_read. Exporting those as a kernel ABI is not likely to pass
muster, as it might restrict the option of changing those function names
down the road—or require a messy compatibility layer if they did
change. Drewry wanted
to avoid using the system call numbers as Langley's original patch did, but
as Frederic Weisbecker pointed out, those
numbers are already part of the kernel ABI. Drewry is planning to make
that switch and users of the interface will need to use the
unistd.h header file or a library to map system call names to
numbers.
The patches also modify the /proc/PID/status file to output any
existing filters that are applied to the process. Given that most applications
that read that file don't need the extra information, though, Motohiro
Kosaki suggested that seccomp get its own
file. Drewry's plan is to provide that information in the
/proc/PID/seccomp_filter file instead, and remove it from
status.
Since it uses the Ftrace infrastructure and hooks, the new seccomp mode
only works for those system calls that have Ftrace events associated with
them. Using one of those non-instrumented system calls in the filters will
result in an
EINVAL from the prctl() call.
Enabling CONFIG_SECCOMP_FILTER (which depends on
CONFIG_FTRACE_SYSCALLS) will allow the use of the new mode.
Overall, Drewry has been very receptive to suggestions for changes, and
the feedback to the concept has been pretty uniformly positive. Molnar
suggested breaking
out the Ftrace filter engine further—beyond the minimal changes that
Drewry's patches make—so that it would be available for more
widespread use in the kernel. Molnar does wonder whether Linus Torvalds or
Andrew Morton might object to more use of the filter mechanism, however: "are you guys opposed to such flexible, dynamic
filters conceptually? I think we should really think hard about the actual ABI
as this could easily spread to more applications than
Chrome/Chromium." So far, neither has spoken up one way or the other.
Currently it would seem that Drewry is off working on the next revision of
the patchset, and it certainly doesn't seem like anything that would be
merged in the upcoming 2.6.40 cycle. As Molnar notes, the ABI needs to be carefully
thought-out, there are still some RCU issues that are being discussed, and
it probably needs some soaking time in the -next tree, but barring some major
complaint
cropping up, it's a feature that will likely make its way into the
mainline relatively soon. While that won't allow Chromium to immediately ditch
its complicated sandboxing arrangement, it may well be able to
do so a few years down the road. Other applications will benefit
from an expanded seccomp as
well.
(
Log in to post comments)