LWN.net Logo

System call filtering and no_new_privs

By Jake Edge
January 18, 2012

We briefly covered a proposal for restricting system calls using the kernel packet filtering mechanism on the January 12 Kernel page, but, at that time, there hadn't been any comments on the proposal. Since then there have been several rounds of comments and revisions of the patch set, along with a revival of an older idea to let a process limit itself and its children to its current privilege level. So far, both sets of patches have received generally positive feedback, to the point where it seems like general-purpose system call filtering just might make it into the mainline sometime in the not-too-distant future.

For some time now, Will Drewry has been trying to find an acceptable way to enhance the seccomp ("secure computing") facility in the kernel so that more flexible system call filtering can be done. His target for the feature is the Chrome/Chromium web browser in order to sandbox untrusted code, but other projects (including QEMU, openssh, vsftpd, and others) have expressed interest in the feature as well. He (and others) have tried various approaches over the last few years without finding one that passed muster. His latest attempt, which uses the BPF (Berkeley Packet Filter) engine to filter the system calls, seems like it avoids many of the problems that were noted in the earlier attempts.

The basic idea is that instead of examining packet contents, the filters will examine system calls and any arguments passed in registers (that means that it won't follow pointers to avoid time-of-check-to-time-of-use races). The code will only allow those calls that pass the filter tests to be executed. The filtering fails "closed" so that any calls not listed in the filter, or whose arguments don't correspond to the filter rules, will return an EACCES error. The syntax for creating a filter, as described in the documentation file, is fairly painful, but Eric Paris has already started on a translator to turn a more readable form into the BPF rules needed.

In order to avoid a longstanding problem with the interactions between binaries that can change their privileges (e.g. setuid or file-based capabilities) and mechanisms to reduce privileges for a process, Drewry's initial patch would restrict the ability of a process to make an execve() call once a filter had been installed. The problem is that privilege-changing binaries can get confused when faced with an environment with fewer privileges than are expected. That confusion can lead to privilege escalation or other security holes. This is why things like chroot(), bind mounts, and, eventually, user namespaces are restricted to root-privileged processes.

If a filtered process can't successfully call execve(), though, all of the concerns about confusing those binaries is gone. It does make using the system call filtering a little clunky, however. One would expect that a parent could set up filters and then spawn a child that would be bound by those filters, but, without a way to exec, that won't work. That can be worked around for most existing programs with some LD_PRELOAD trickery, but in the discussion another potential solution was proposed.

Andrew Lutomirski pointed to his execve_nosecurity proposal as a possible solution. That would allow processes to set a flag so that they (and their children) would be unable to call execve() and would add a new variant (called, somewhat confusingly, execve_nosecurity()) that could be used instead but would not allow any security transitions for the executed program. That means that setuid, LSM context changes, changing capabilities, and so on would not be allowed. Linus Torvalds agreed that adding a way to restrict privilege changes would be useful:

We could easily introduce a per-process flag that just says "cannot escalate privileges". Which basically just disables execve() of suid/sgid programs (and possibly other things too), and locks the process to the current privileges. And then make the rule be that *if* that flag is set, you can then filter across an execve, or chroot as a normal user, or whatever.

That led Lutomirski to propose a flag in struct task_struct called no_new_privs that would be set via the PR_SET_NO_NEW_PRIVS flag to prctl(). It would be a one-way gate as there would be no way to unset the flag. If set, the flag would restrict executing binaries in much the same way that the nosuid mount flag works. In addition, it would disallow processes changing capabilities on exec or SELinux security context transitions.

But, Lutomirski's patch does not implement a sandbox, as it can still be subverted via ptrace() as Alan Cox points out. Cox was also concerned that preventing SELinux, AppArmor, or other LSMs from changing privileges could lead to other problems because those transitions may actually be changing the context to a less privileged state. Simply keeping the previous context, as Lutomirski's patch does, could lead to executing programs in a more-privileged context. But Eric Paris clarifies that SELinux, at least, will still make the same policy decision even without the transition (as it does for nosuid mounts), so that the execution will still fail if the process has the wrong context.

Lutomirski also notes that a sandbox will be much less useful if execve() has to fail when there is any kind of security transition, as Cox suggested. The presence of a policy on a particular binary would make that binary unusable from within a sandbox, no matter what the policy is. A better solution, Lutomirski said, is to set the no_new_privs bit, then set up a sandbox (using Drewry's seccomp system call filtering for example), then execute the binary, which will succeed or fail based on the actual mandatory access control (MAC) policy. That solves the problem of ptrace() and other circumvention methods as well because a sandbox requires both the no_new_privs patch and some other mechanism to filter system calls:

no_new_privs is not intended to be a sandbox at all -- it's a way to make it safe for a task to manipulate itself in a way that would allow it to subvert its own children (or itself after execve). So ptrace isn't a problem at all -- PR_SET_NO_NEW_PRIVS + chroot + ptrace is exactly as unsafe as ptrace without PR_SET_NO_NEW_PRIVS. Neither one allows privilege escalation beyond what you started with.

If you want a sandbox, call PR_SET_NO_NEW_PRIVS, then enable seccomp (or whatever) to disable ptrace, evil file access, connections on unix sockets that authenticate via uid, etc.

Meanwhile, Drewry has been revising his patches to take advantage of no_new_privs. One of those revisions brought about some other concerns regarding whether dropping privileges should be allowed after the bit is set. Torvalds is worried that allowing privilege dropping will somehow lead to confusing other programs: "We've had security bugs that were *due* to dropped capabilities - people dropped one capability but not another, and fooled code into doing things they weren't expecting it to do." Lutomirski's patches do not restrict things like calls to setuid() because they are not meant to implement a sandbox—that's what the existing seccomp, or an enhanced version from Drewry's patches (aka seccomp mode 2) will do. As Lutomirski explains:

Another way of saying this is: no_new_privs is not a sandbox. It's just a way to make it safe for sandboxes and other such weird things processes can do to themselves safe across execve. If you want a sandbox, use seccomp mode 2, which will require you to set no_new_privs.

It's clear that Lutomirski, at least, thinks the no_new_privs changes cannot lead to the problems that Torvalds and others (notably Smack developer Casey Schaufler) are concerned about. But, any program that uses no_new_privs needs to be aware of what it does (and doesn't) do. Coupling it with a system call filtering mechanism seems like it could only increase the security of the system. But, interactions between security mechanisms often have unforeseen effects, typically resulting in security holes, so it makes sense to be cautious.

So far, these changes are still being discussed, and no subsystem maintainer has volunteered to take them, but the two proposals seem to have support that other similar ideas have lacked. Whether Lutomirski can convince the other kernel hackers that no_new_privs can't lead to other problems, or whether he needs to figure out how to stop the dropping of privileges is unclear. But it does seem like there may now be a path for an enhanced seccomp to reach the mainline.


(Log in to post comments)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds