Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call
|| ||Ingo Molnar <mingo-AT-elte.hu> |
|| ||James Morris <jmorris-AT-namei.org> |
|| ||Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call
|| ||Fri, 13 May 2011 14:10:34 +0200|
|| ||linux-mips-AT-linux-mips.org, linux-sh-AT-vger.kernel.org,
Peter Zijlstra <peterz-AT-infradead.org>,
Frederic Weisbecker <fweisbec-AT-gmail.com>,
Heiko Carstens <heiko.carstens-AT-de.ibm.com>,
Oleg Nesterov <oleg-AT-redhat.com>, David Howells <dhowells-AT-redhat.com>,
Paul Mackerras <paulus-AT-samba.org>,
Eric Paris <eparis-AT-redhat.com>, "H. Peter Anvin" <hpa-AT-zytor.com>,
sparclinux-AT-vger.kernel.org, Jiri Slaby <jslaby-AT-suse.cz>,
Russell King <linux-AT-arm.linux.org.uk>, x86-AT-kernel.org,
Linus Torvalds <torvalds-AT-linux-foundation.org>,
Ingo Molnar <mingo-AT-redhat.com>, linux-arm-kernel-AT-lists.infradead.org,
Benjamin Herrenschmidt <benh-AT-kernel.crashing.org>,
kees.cook-AT-canonical.com, "Serge E. Hallyn" <serge-AT-hallyn.com>,
Peter Zijlstra <a.p.zijlstra-AT-chello.nl>, microblaze-uclinux-AT-itee.uq.edu.au,
Steven Rostedt <rostedt-AT-goodmis.org>,
Martin Schwidefsky <schwidefsky-AT-de.ibm.com>,|
|| ||Article, Thread
* James Morris <email@example.com> wrote:
> On Thu, 12 May 2011, Ingo Molnar wrote:
> > Funnily enough, back then you wrote this:
> > " I'm concerned that we're seeing yet another security scheme being designed on
> > the fly, without a well-formed threat model, and without taking into account
> > lessons learned from the seemingly endless parade of similar, failed schemes. "
> > so when and how did your opinion of this scheme turn from it being an
> > "endless parade of failed schemes" to it being a "well-defined and readily
> > understandable feature"? :-)
> When it was defined in a way which limited its purpose to reducing the attack
> surface of the sycall interface.
Let me outline a simple example of a new filter expression based security
feature that could be implemented outside the narrow system call boundary you
find acceptable, and please tell what is bad about it.
Say i'm a user-space sandbox developer who wants to enforce that sandboxed code
should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
It is a simple and sensible security feature, agreed? It allows most code to
run well and link to countless libraries - but no access to other files is
I would also like my sandbox app to be able to install this policy without
having to be root. I do not want the sandbox app to have permission to create
labels on /lib and /usr/lib and what not.
Firstly, using the filter code i deny the various link creation syscalls so
that sandboxed code cannot escape for example by creating a symlink to outside
the permitted VFS namespace. (Note: we opt-in to syscalls, that way new
syscalls added by new kernels are denied by defalt. The current symlink
creation syscalls are not opted in to.)
But the next step, actually checking filenames, poses a big hurdle: i cannot
implement the filename checking at the sys_open() syscall level in a secure
way: because the pathname is passed to sys_open() by pointer, and if i check it
at the generic sys_open() syscall level, another thread in the sandbox might
modify the underlying filename *after* i've checked it.
But if i had a VFS event at the fs/namei.c::getname() level, i would have
access to a central point where the VFS string becomes stable to the kernel and
can be checked (and denied if necessary).
A sidenote, and not surprisingly, the audit subsystem already has an event
Unfortunately this audit callback cannot be used for my purposes, because the
event is single-purpose for auditd and because it allows no feedback (no
deny/accept discretion for the security policy).
But if had this simple event there:
err = event_vfs_getname(result);
I could implement this new filename based sandboxing policy, using a filter
like this installed on the vfs::getname event and inherited by all sandboxed
tasks (which cannot uninstall the filter, obviously):
if (strstr(name, ".."))
if (!strncmp(name, "/home/sandbox/", 14) &&
!strncmp(name, "/lib/", 5) &&
!strncmp(name, "/usr/lib/", 9))
# Note1: Obviously the filter engine would be extended to allow such simple string
# match functions. )
# Note2: ".." is disallowed so that sandboxed code cannot escape the restrictions
# using "/..".
This kind of flexible and dynamic sandboxing would allow a wide range of file
ops within the sandbox, while still isolating it from files not included in the
specified VFS namespace.
( Note that there are tons of other examples as well, for useful security features
that are best done using events outside the syscall boundary. )
The security event filters code tied to seccomp and syscalls at the moment is
useful, but limited in its future potential.
So i argue that it should go slightly further and should become:
- unprivileged: application-definable, allowing the embedding of security
policy in *apps* as well, not just the system
- flexible: can be added/removed runtime unprivileged, and cheaply so
- transparent: does not impact executing code that meets the policy
- nestable: it is inherited by child tasks and is fundamentally stackable,
multiple policies will have the combined effect and they
are transparent to each other. So if a child task within a
sandbox adds *more* checks then those add to the already
existing set of checks. We only narrow permissions, never
- generic: allowing observation and (safe) control of security relevant
parameters not just at the system call boundary but at other
relevant places of kernel execution as well: which
points/callbacks could also be used for other types of event
extraction such as perf. It could even be shared with audit ...
I argue that this is the LSM and audit subsystems designed right: in the long
run it could allow everything that LSM does at the moment - and so much more
And you argue that allowing this would be bad, if it was extended like that
then you'd consider it a failed scheme? Why?
to post comments)