By Jake Edge
January 18, 2012
We briefly covered a proposal for
restricting system calls using the kernel packet filtering mechanism on the
January 12 Kernel page, but, at that time,
there hadn't been any comments on the proposal. Since then there have been
several rounds of comments and revisions of the patch set, along with a
revival of an older idea to let a process limit itself and its children
to its current privilege level. So far, both sets of patches have received
generally positive feedback, to the point where it seems like
general-purpose system call filtering just might make it into the mainline
sometime
in the not-too-distant future.
For some time now, Will Drewry has been trying to find an acceptable way to
enhance the
seccomp ("secure computing") facility in the kernel so that more flexible
system call filtering can be done. His target for the feature is the
Chrome/Chromium
web browser in order to sandbox untrusted code, but
other projects (including QEMU, openssh, vsftpd, and others) have expressed
interest in the feature as well. He (and others) have tried various
approaches over the last few years without finding one that passed muster.
His latest attempt, which uses the BPF (Berkeley
Packet Filter) engine to filter the system calls, seems like it avoids
many of the problems that were noted in the earlier attempts.
The basic idea is that instead of examining packet contents, the filters
will examine system calls and any arguments passed
in registers (that means that it won't follow
pointers to avoid
time-of-check-to-time-of-use races). The code will only allow those calls
that pass the
filter tests to
be executed. The filtering fails "closed" so that any calls not listed in
the filter, or whose arguments don't
correspond to the filter rules, will return an EACCES
error. The syntax for creating a filter, as described in the documentation file, is fairly painful, but
Eric Paris has already started on a translator to turn a more readable form into
the BPF rules needed.
In order to avoid a longstanding problem
with the interactions between
binaries that can change their privileges (e.g. setuid or file-based
capabilities) and mechanisms to reduce privileges for a process, Drewry's
initial patch would restrict the ability of a process to make an
execve() call once a filter had been installed. The problem
is that privilege-changing binaries can get confused when faced with an
environment with fewer privileges than are expected. That confusion can
lead to privilege
escalation or other security holes. This is why things like
chroot(), bind mounts, and, eventually, user namespaces are
restricted to root-privileged processes.
If a filtered process can't successfully call execve(), though,
all of the concerns about confusing those binaries is gone. It does make using
the system call filtering a little clunky, however. One would expect that
a parent could set up filters and then spawn a child that would be bound by
those filters, but, without a way to exec, that won't work. That can be
worked around for most existing programs with some
LD_PRELOAD trickery, but in the discussion another potential
solution was proposed.
Andrew Lutomirski pointed to his execve_nosecurity proposal as a possible
solution. That would allow processes to set a flag so that they (and their
children) would be unable to call execve() and would add a new
variant (called, somewhat confusingly, execve_nosecurity()) that
could be used instead but would not allow any security transitions for the
executed program. That
means that setuid, LSM context changes, changing capabilities, and so on
would not
be allowed. Linus Torvalds agreed that
adding a way to restrict privilege changes would be useful:
We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.
That led Lutomirski to propose a flag in
struct task_struct called no_new_privs that would be set
via the PR_SET_NO_NEW_PRIVS flag to prctl(). It would be
a one-way gate as there would be no way to unset the flag. If set, the flag
would restrict executing binaries in much the same way that the
nosuid mount flag works. In addition, it would disallow processes
changing
capabilities on exec or SELinux security context transitions.
But, Lutomirski's patch does not implement a sandbox, as it can
still be subverted via ptrace() as Alan Cox points out. Cox was also concerned that
preventing SELinux, AppArmor, or other LSMs from changing privileges could
lead to other problems because those transitions may actually be changing
the context to a less privileged state. Simply keeping the previous
context, as Lutomirski's patch does, could lead to executing programs in a
more-privileged context. But Eric Paris clarifies that SELinux, at least, will still
make the same policy decision even without the transition (as it does for
nosuid mounts), so that the execution will still fail if the
process has the wrong context.
Lutomirski also notes that a sandbox will
be much less useful if execve() has to fail when there is any kind
of security transition, as Cox suggested. The presence of a policy on a
particular binary would make that binary unusable from within a sandbox, no
matter what the policy is. A better solution, Lutomirski said, is to set the
no_new_privs bit, then set up a sandbox (using Drewry's seccomp
system call filtering for example), then execute the binary, which will
succeed or fail based on the actual mandatory access control (MAC) policy.
That solves the problem of ptrace() and other circumvention
methods as well
because a sandbox requires both the no_new_privs patch and some
other mechanism to filter system calls:
no_new_privs is not intended to be a sandbox at all -- it's a way to
make it safe for a task to manipulate itself in a way that would allow
it to subvert its own children (or itself after execve). So ptrace
isn't a problem at all -- PR_SET_NO_NEW_PRIVS + chroot + ptrace is
exactly as unsafe as ptrace without PR_SET_NO_NEW_PRIVS. Neither one
allows privilege escalation beyond what you started with.
If you want a sandbox, call PR_SET_NO_NEW_PRIVS, then enable seccomp
(or whatever) to disable ptrace, evil file access, connections on unix
sockets that authenticate via uid, etc.
Meanwhile, Drewry has been revising his patches to take advantage of
no_new_privs. One of those revisions brought about some other
concerns regarding whether dropping privileges should be allowed
after the bit is set. Torvalds is worried
that allowing privilege dropping will
somehow lead to confusing other programs:
"We've had security bugs that were *due* to dropped capabilities -
people dropped one capability but not another, and fooled code into
doing things they weren't expecting it to do." Lutomirski's patches
do not restrict things like calls to setuid() because they are not
meant to implement a sandbox—that's what the existing seccomp, or an
enhanced version from Drewry's patches (aka seccomp mode 2) will do. As Lutomirski explains:
Another way of saying this is: no_new_privs is not a sandbox. It's
just a way to make it safe for sandboxes and other such weird things
processes can do to themselves safe across execve. If you want a
sandbox, use seccomp mode 2, which will require you to set
no_new_privs.
It's clear that Lutomirski, at least, thinks the no_new_privs
changes cannot lead to the problems that Torvalds and others (notably Smack
developer Casey Schaufler) are concerned
about. But, any program that uses no_new_privs needs to be aware
of what it does (and doesn't) do. Coupling it with a system call filtering
mechanism seems like it could only increase the security of the system.
But, interactions between security mechanisms often have unforeseen
effects, typically resulting in security holes, so it makes sense to be
cautious.
So far, these changes are still being discussed, and no subsystem
maintainer has volunteered to take them, but the two proposals seem to have
support that other similar ideas have lacked. Whether Lutomirski can
convince the other kernel hackers that no_new_privs can't lead to
other problems, or whether he needs to figure out how to stop the dropping
of privileges is unclear. But it does seem like there may now be a path
for an enhanced seccomp to reach the mainline.
(
Log in to post comments)