|| ||Frederic Weisbecker <fweisbec-AT-gmail.com> |
|| ||Will Drewry <wad-AT-chromium.org> |
|| ||Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call
|| ||Thu, 28 Apr 2011 19:36:17 +0200|
|| ||linux-kernel-AT-vger.kernel.org, kees.cook-AT-canonical.com,
eparis-AT-redhat.com, agl-AT-chromium.org, mingo-AT-elte.hu,
Ingo Molnar <mingo-AT-redhat.com>,
Andrew Morton <akpm-AT-linux-foundation.org>,
Tejun Heo <tj-AT-kernel.org>, Michal Marek <mmarek-AT-suse.cz>,
Oleg Nesterov <oleg-AT-redhat.com>,
Roland McGrath <roland-AT-redhat.com>,
Peter Zijlstra <a.p.zijlstra-AT-chello.nl>,
Jiri Slaby <jslaby-AT-suse.cz>,
David Howells <dhowells-AT-redhat.com>,
"Serge E. Hallyn" <serge-AT-hallyn.com>|
|| ||Article, Thread
On Thu, Apr 28, 2011 at 11:48:47AM -0500, Will Drewry wrote:
> On Thu, Apr 28, 2011 at 11:13 AM, Frederic Weisbecker
> <email@example.com> wrote:
> > On Thu, Apr 28, 2011 at 10:29:11AM -0500, Will Drewry wrote:
> >> On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
> >> <firstname.lastname@example.org> wrote:
> >> > Instead of having such multiline filter definition with syscall
> >> > names prepended, it would be nicer to make the parsing simplier.
> >> >
> >> > You could have either:
> >> >
> >> > prctl(PR_SET_SECCOMP, mode);
> >> > /* Works only if we are in mode 2 */
> >> > prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
> >> It'd need to be syscall_name instead of syscall_nr. Otherwise we're
> >> right back to where Adam's patch was 2+ years ago :) Using the event
> >> names from the syscalls infrastructure means the consumer of the
> >> interface doesn't need to be confident of the syscall number.
> > Is it really a problem?
> I'd prefer using numbers myself since it gives more flexibility around
> filtering entries that haven't yet made it into ftrace syscalls hooks.
> However, I think there's some complexity in ensuring it matches the
> host kernel. (It would also allow compat_task support which doesn't
> seem to be possible now.)
You'd have some problems with compat and syscall naming, I think some
of them are not the same between compat architectures.
Note that your new prctl() request is turning syscall handler names (in-kernel API)
into a user ABI. We won't be able to change the name of those kernel handlers
after that. I'm not sure that's desired.
OTOH, syscall numbers are a user ABI,
> > There are libraries that can resolve that. Of course I can't recall their name.
> asm-generic/unistd.h will provide the mapping, if available. It would
> mean that any helper libraries would need to be matched to the locally
> running kernel.
Yeah. But syscall numbers don't move inside a given architecture (could they?)
Or if so then you'd a dynamic library.
> >> > or:
> >> > /*
> >> > * If mode == 2, set the filter to syscall_nr
> >> > * Recall this for each syscall that need a filter.
> >> > * If a filter was previously set on the targeted syscall,
> >> > * it will be overwritten.
> >> > */
> >> > prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
> >> >
> >> > One can erase a previous filter by setting the new filter "1".
> >> >
> >> > Also, instead of having a bitmap of syscall to accept. You could
> >> > simply set "0" as a filter to those you want to deactivate:
> >> >
> >> > prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
> >> >
> >> > Hm?
> >> I like the simplicity in not needing to parse anything extra, but it
> >> does add the need for extra state - either a bit or a new field - to
> >> represent "enabled/enforcing".
> > And by the way I'm really puzzled about these. I don't understand
> > well why we need this.
> Right now, if you are in seccomp mode !=0 , your system calls will be
> hooked (if TIF_SECCOMP is set too). The current patch begins filtering
> immediately. And once filtering is enabled, the process is not
> allowed to expand its access to the kernel system cals - only shrink
> it. An "enabled" bit would be needed to tell it that even though it
> is running in mode=2, it should/shouldn't kill the process for system
> call filter violations and it should allow filters to be added, not
> just dropped. It's why the current patch does it in one step: set the
> filters and the mode. While an enabled bit could be hidden behind
> whether TIF_SECCOMP is applied, I'm not sure if that would just be
> confusing. There'd still be a need to APPLY_FILTERS to get the
> TIF_SECCOMP flag set to start hooking the system calls.
It seems the filter is only ever needed for the childs, right?
And moreover when they exec. So Steve's suggestion to apply
the filters only after the next exec makes sense.
That would solve the problem of both ON_NEXT_SYSCALL and APPLY_FILTER.
I may be missing something obvious though. Do you see a limitation there?
> > As for the enable_on_next_syscall. The documentation says
> > it's useful if you want the filter to only apply to the
> > child. So if fork immediately follows, you will be able to fork
> > but if the child doesn't have the right to exec, it won't be able
> > to do so. Same for the mmap() that involves...
> > So I'm a bit confused about that.
> Here's an example:
> You'd use it by calling fork() ... prctl(..., ON_NEXT_SYSCALL) then
> execve since you'll already have fork()d. It seems to work, but maybe
> I'm missing something :)
Nope, I see.
> > But yeah if that's really needed, it looks to me better to
> > reduce the parsing and cut it that way:
> > prctl(PR_SET_SECCOMP, 2, syscall_name_or_nr, filter);
> > prctl(PR_SECCOMP_APPLY_FILTERS, enable_on_next_syscall?)
> > or something...
> Cool - if there are any other flags, it could be something like:
> prctl(PR_SECCOMP_SET_FLAGS, SECCOMP_ENFORCE|SECCOMP_DELAYED_ENFORCEMENT);
> But only if there are other flags worth considering. I had a few
> others in mind, but now I've completely forgotten :/
to post comments)