LWN.net Logo

Limiting system calls via control groups?

By Jake Edge
October 19, 2011

Limiting the system calls available to processes is fairly hot topic in the kernel security community these days. There have been several different proposals and the topic was discussed at some length at the recent Linux Security Summit but, so far, no solution has made its way into the mainline. Łukasz Sowa recently posted an RFC for a different mechanism to restrict syscalls, which may have advantages over other approaches. It also has a potential disadvantage as it uses a feature that is unpopular with some kernel hackers: control groups.

Conceptually, Sowa's idea is pretty straightforward. An administrator could place a process or processes into a control group and then restrict which syscalls those processes (and their children) could make. The current interface uses system call numbers that are written to the syscalls.allow and syscalls.deny cgroup control files. Any system calls can be denied, but only those available to a parent cgroup could be enabled that way. Any process that makes a denied system call would get an ENOSYS error return.

Using system call numbers seems somewhat painful (and those numbers are not the same across architectures), but may be unavoidable. But there are some other bigger problems, performance to begin with. Sowa reports 5% more system time used by processes in the root cgroup, which is a hefty penalty to pay. His patch hooks into the assembly language syscall fastpath, which is probably not going to fly. It is also architecture-specific and only implemented for x86 currently. Paul Menage points out that hooking into the ptrace() path may avoid those problems:

Can't you hook into the ptrace callpath? That's already implemented on every architecture. Set the thread bit that triggers diverting to syscall_trace_enter() only when any of the thread's syscalls are denied, and then you don't have to work in assembly.

Menage also mentions some other technical issues with the patch, but he is skeptical overall of the need for it. "I'd guess that most vulnerabilities in a system can be exploited just using system calls that almost all applications need in order to get regular work done (open, write, exec ,mmap, etc) which limits the utility of only being able to turn them off by syscall number." Because the approach only allows a binary on or off choice for the system calls, he doesn't necessarily think that it has the right level of granularity. The granularity argument echoes the one made by Ingo Molnar on a 2009 proposal to extend seccomp by adding a bitmask of allowed system calls.

But there have been a number of projects that have expressed interest in having a more flexible seccomp-like feature in the kernel, starting with the Chromium browser team who have proposed several ways to do so. Seccomp provides a way to restrict processes to a few syscalls (read(), write(), exit(), and sigreturn()), but that is too inflexible for many projects. But Molnar has been very vocal in opposition to approaches that only allow binary decisions about system call usage, and he prefers a mechanism that filters system calls using Ftrace-style conditionals. That approach, however, is not popular with some of the other tracing and instrumentation developers.

It is a quandary. There are a number of projects (e.g. QEMU, vsftpd, LXC) interested in such a feature, but no implementation (so far) has passed muster. Sowa's cgroup-based solution may well be yet another. Certainly the current performance for processes that are not in a cgroup (i.e. are in the root cgroup) is not going to be popular—an understatement—but even if Menage's suggestion (or some other mechanism) leads to a solution with little or no performance impact, there may be complaints because of the unpopularity of cgroups.

There may be hope on the horizon in the form of a proposed discussion about expanding seccomp (or providing a means to disable certain syscalls) at the upcoming Kernel Summit, though it does not seem to have made it onto the agenda. Certainly many of the participants in the mailing list discussions will be present. Control groups is on the agenda, though, so there will be some discussion of that rather contentious topic. Look for LWN's coverage of the summit on next week's Kernel page.


(Log in to post comments)

Limiting system calls via control groups?

Posted Oct 20, 2011 8:06 UTC (Thu) by dw (subscriber, #12017) [Link]

I guess it's been considered already, but something like iptables or the Linux socket filter make sense to me. Provide unprivileged userspace with a small handful of operations for testing syscall number, doing comparisons and jumps, and looking strings up in a set, then leave the rest to a userspace library (actually I guess underneath this is probably what OS X' sandbox looks like too). That way extending the interface later is only a matter of adding extra operations.

If the overhead for running the pseudocode on each syscall was too high, then perhaps a declarative approach would be possible, where the kernel could transform the supplied rules into lookup tables, or some hybrid combination of both.

Limiting system calls via control groups?

Posted Oct 20, 2011 8:09 UTC (Thu) by dw (subscriber, #12017) [Link]

Grumble, slightly incomprehensible comment. By mention of the library and "unprivileged userspace", I meant something like how BPF or iptables works, where complexity of parsing some expression (or rule set) is handled by a library, which produces easily verifiable byte code, which is then handed off to the kernel.

Limiting system calls via control groups?

Posted Oct 20, 2011 13:43 UTC (Thu) by alonz (subscriber, #815) [Link]

I wonder if it wouldn't be better to start from the other end of the solution space—small, incremental extensions to seccomp.

For example: just adding recvmsg and poll to the set of system calls permitted by seccomp is already a huge increase in the capabilities of sandboxed processes—the “controller” process will be able to open files on behalf of the sandbox (after applying proper policy), or pipes, or sockets, or even supply interfaces to signals (using signalfd) and interval timers (using timerfd), and pass these fd's to the sandbox via a UNIX domain socket. And by using poll, the sandbox will have full control over the way it processes the available data / events.

Limiting system calls via control groups?

Posted Oct 20, 2011 20:05 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

filtering for a syscall looks like a very hard problem (both in what filters make sense and in performance)

it would seem to me that the right approach is to implement a limiting function that can do either

allow
block
filter

but initially just implment the allow/block modes, and have some sort of experimental loadable module support for the filter mode so that different filters can be experimented with easily

Limiting system calls via control groups?

Posted Oct 21, 2011 12:31 UTC (Fri) by davecb (subscriber, #1574) [Link]

Another possible approach is to retarget the mechanism once used for SCO emulation to do something quite close to what dw suggested.

If a process is started under a cgroup with syscall control enabled, it gets a different "interpreter" with a different syscall mapping table. Cgroups without syscall imitations get the standard one.

One then has the ability to permit, deny or filter in an arbitrary way the syscalls a given cgroup sees. The management would be in user-space, the implementation a hook and a set of "interpreter" syscall tables in a kernel module. The rest of the interpreter mechanisms would continue unchanged, which is important as they're still used for running alien binaries on Linux.

--dave

Limiting system calls via control groups?

Posted Oct 22, 2011 17:20 UTC (Sat) by alonz (subscriber, #815) [Link]

Unfortunately, the “personality” mechanism (used for SCO emulation) hinges on the difference in syscall ABIs between Linux and SCO (specifically: Linux uses sysenter/syscall instructions, while SCO used lcall7).

The existing seccomp uses the trace path, which is a nice compromise—it requires a single hook in the (performance-critical) system-call-entry code for any non-standard behavior, which translates to either tracing or seccomp-limitation of the system calls. To be workable, any solution will need to maintain this level of performance (= nearly zero impact when disabled).

Limiting system calls via control groups?

Posted Oct 27, 2011 2:26 UTC (Thu) by kevinm (guest, #69913) [Link]

It seems like railing against cgroups is rapidly shifting into Don Quixote territory. You may not like the interface, but the likelihood of it being removed seems remote at this point - like ptrace() and /proc, for better or for worse we're stuck with it now.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds