Limiting system calls via control groups?
Limiting the system calls available to processes is fairly hot topic in the kernel security community these days. There have been several different proposals and the topic was discussed at some length at the recent Linux Security Summit but, so far, no solution has made its way into the mainline. Łukasz Sowa recently posted an RFC for a different mechanism to restrict syscalls, which may have advantages over other approaches. It also has a potential disadvantage as it uses a feature that is unpopular with some kernel hackers: control groups.
Conceptually, Sowa's idea is pretty straightforward. An administrator could place a process or processes into a control group and then restrict which syscalls those processes (and their children) could make. The current interface uses system call numbers that are written to the syscalls.allow and syscalls.deny cgroup control files. Any system calls can be denied, but only those available to a parent cgroup could be enabled that way. Any process that makes a denied system call would get an ENOSYS error return.
Using system call numbers seems somewhat painful (and those numbers are not the same across architectures), but may be unavoidable. But there are some other bigger problems, performance to begin with. Sowa reports 5% more system time used by processes in the root cgroup, which is a hefty penalty to pay. His patch hooks into the assembly language syscall fastpath, which is probably not going to fly. It is also architecture-specific and only implemented for x86 currently. Paul Menage points out that hooking into the ptrace() path may avoid those problems:
Menage also mentions some other technical issues with the patch, but he is
skeptical overall of the need for it. "I'd guess
that most vulnerabilities in a system can be exploited just using
system calls that almost all applications need in order to get regular
work done (open, write, exec ,mmap, etc) which limits the utility of
only being able to turn them off by syscall number.
" Because the
approach only allows a binary on or off choice for the system calls, he
doesn't necessarily think that it has the right level of granularity.
The granularity argument echoes the one made by Ingo
Molnar on a 2009 proposal to extend
seccomp by adding a bitmask of allowed system calls.
But there have been a number of projects that have expressed interest in having a more flexible seccomp-like feature in the kernel, starting with the Chromium browser team who have proposed several ways to do so. Seccomp provides a way to restrict processes to a few syscalls (read(), write(), exit(), and sigreturn()), but that is too inflexible for many projects. But Molnar has been very vocal in opposition to approaches that only allow binary decisions about system call usage, and he prefers a mechanism that filters system calls using Ftrace-style conditionals. That approach, however, is not popular with some of the other tracing and instrumentation developers.
It is a quandary. There are a number of projects (e.g. QEMU, vsftpd, LXC) interested in such a feature, but no implementation (so far) has passed muster. Sowa's cgroup-based solution may well be yet another. Certainly the current performance for processes that are not in a cgroup (i.e. are in the root cgroup) is not going to be popular—an understatement—but even if Menage's suggestion (or some other mechanism) leads to a solution with little or no performance impact, there may be complaints because of the unpopularity of cgroups.
There may be hope on the horizon in the form of a proposed discussion about expanding seccomp (or providing a means to disable certain syscalls) at the upcoming Kernel Summit, though it does not seem to have made it onto the agenda. Certainly many of the participants in the mailing list discussions will be present. Control groups is on the agenda, though, so there will be some discussion of that rather contentious topic. Look for LWN's coverage of the summit on next week's Kernel page.
| Index entries for this article | |
|---|---|
| Kernel | Control groups |
| Kernel | Security |
| Kernel | System calls |
