|From:||Ingo Molnar <mingo-AT-elte.hu>|
|To:||Adam Langley <agl-AT-google.com>, Andrew Morton <akpm-AT-linux-foundation.org>, Frédéric Weisbecker <fweisbec-AT-gmail.com>, Tom Zanussi <tzanussi-AT-gmail.com>, Li Zefan <l|
|Subject:||Re: [RFC 1/1] seccomp: Add bitmask of allowed system calls.|
|Date:||Fri, 8 May 2009 00:14:47 +0200|
(i've restored the Cc: line of the previous thread) * Adam Langley <email@example.com> wrote: > (This is a discussion email rather than a patch which I'm > seriously proposing be landed.) > > In a recent thread my colleague, Markus, mentioned that we > (Chrome Linux) are investigating using seccomp to implement our > rendering sandbox on Linux. > > In the same thread, Ingo mentioned that he thought a bitmap of > allowed system calls would be reasonable. If we had such a thing, > many of the acrobatics that we currently need could be avoided. > Since we need to support the currently existing kernels, we'll > need to have the code for both, but allowing signal handling, > gettimeofday, epoll etc would save a lot of overhead for common > operations. > > The patch below implements such a scheme. It's written on top of > the current seccomp for the moment, although it looks like seccomp > might be written in terms of ftrace soon. > > Briefly, it adds a second seccomp mode (2) where one uploads a > bitmask. Syscall n is allowed if, and only if, bit n is true in > the bitmask. If n is beyond the range of the bitmask, the syscall > is denied. > > If prctl is allowed by the bitmask, then a process may switch to > mode 1, or may set a new bitmask iff the new bitmask is a subset > of the current one. (Possibly moving to mode 1 should only be > allowed if read, write, sigreturn, exit are in the currently > allowed set.) > > If a process forks/clones, the child inherits the seccomp state of > the parent. (And hopefully I'm managing the memory correctly > here.) > > Ingo subsequently floated the idea of a more expressive interface > based on ftrace which could introspect the arguments, although I > think the discussion had fallen off list at that point. > > He suggested using an ftrace parser which I'm not familiar with, but can > be summed up with: > seccomp_prctl("sys_write", "fd == 3") // allow writes only to fd 3 It's the ftrace filter parser and execution engine. I.e. we first parse the filter expression when setting up a seccomp context. Each syscall has the following attributes: on # enabled unconditionally off # disabled unconditionally filtered In the filtered case, the filter can be simple: "fd == 0" To restrict sys_write() to a single fd (but still allow sys_read() from other fds). Or as complex as: (fd == 4 || fd == 5) && (buf == 0x12340000) && (size <= 4096) To restrict IO to two specific fds and to restrict output to a specific memory address and to restrict size to 4K or smaller. This is how the filter engine works: we parse the string and save it into a binay expression structure (cache) that can later on be run by the engine in a pretty fast way. (without any string parsing or formatting overhead in the validation fastpath) The filter is thus evaluated in the sandbox task's context, without the need for any context-switching. It's very, very fast. It is i think faster than LSM rules, and it is also atomic and lockless (RCU based). > In general, I believe that ftrace based solutions cannot safely > validate arguments which are in user-space memory when multiple > threads could be racing to change the memory between ftrace and > the eventual copy_from_user. Because of this, many useful > arguments (such as the sockaddr to connect, the filename to open > etc) are out of reach. LSM hooks appear to be the best way to > impose limits in such cases. (Which we are also experimenting > with). That assessment is incorrect, there's no difference between safety here really. LSM cannot magically inspect user-space memory either when multiple threads may access it. The point would be to define filters for system call _arguments_, which are inherently thread-local and safe. > However, such a parser could be very useful in one particular > case: socketcall on IA32. Allowing recvmsg and sendmsg, but not > socket, connect etc is certainly something that we would be > interested in. There are two problems with the bitmap scheme, which i also suggested in a previous thread but then found it to be lacking: 1) enumeration: you define a bitmap. That will be problematic between compat and native 64-bit (both have different syscall vectors). 2) flexibility. It's an on/off selection per syscall. With the filter we have on, off, or filtered. That's a _whole_ lot more flexible. The filter expression based solution does not suffer from this: it is string enumerated. "sys_read" means that syscall, and we could specify whether it's the compat or the native one. Ingo
Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds