Seccomp user-space notification and signals

By Jonathan Corbet
April 9, 2021

The seccomp() mechanism allows the imposition of a filter program (expressed in "classic" BPF) that makes policy decisions on whether to allow each system call invoked by the target process. The user-space notification feature further allows those decisions to be deferred to another process. As this recent patch set from Sargun Dhillon shows, though, user-space notification still has some rough edges, especially when it comes to signals. This patch makes a simple change to try to address a rather complex problem brought to the fore by changes in the Go language's preemption model.

Normally, seccomp() is used to implement a simple sort of attack-surface reduction, making much of the system-call space off limits for the affected process. User-space notification can be used to that end, but the objective there is often different: it allows a supervisor process to emulate system calls for the target process. An example might be a container manager that wishes to make mount() available inside a container, but with some strict limits on what can actually be mounted. User-space notification allows the (privileged) supervisor to actually perform the mount operations it approves of and return the results to the target process.

While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back. Should that process receive a signal, though, it will stop waiting and respond immediately to the signal; if the signal itself is not fatal, the result may well be the system call returning an EINTR error to the target process. The supervisor, instead, will not know about the signal until it tries to give the kernel its answer to the original notification; at that point, it will get an ENOENT error indicating that the notification is no longer alive.

This sort of interruption can be inconvenient, especially if the supervisor has carried out some sort of long task on the target's behalf. If the signal does not kill the target process, it is likely that the same operation will be retried shortly, leading to extra work being done. Most of the time, though, non-fatal signals of this type are likely to be rare in programs running under seccomp() monitoring.

Go signal

More accurately, that was once true, but the developers of the Go language had a problem of their own to solve. That language's "goroutine" lightweight thread model requires that the Go runtime handle scheduling, switching between goroutines as needed so that they all get a chance to run. Beyond that, there is a need for occasional "stop the world" events where all goroutines are paused so that the garbage collector can do its job. This has been handled by having the compiler put preemption checks at the beginning of each function.

What happens, though, if a goroutine runs for a long time without calling any functions? This can happen if the routine is running inside of some sort of tight loop; in the worst case, that loop could be spinning on a lock and preventing the lock holder from running to release it, a situation that tends to increase the overall level of user disgruntlement. Another way to delay preemption is to make a long-running system call.

The Go developers have tried a few ways of solving this problem. One of them involved inserting preemption checks at backward jumps in the code (thus at the end of a loop, for example). Even when that check was reduced to a single instruction, the resulting performance penalty was deemed to be too high; this approach also doesn't help in the long-running system-call case. So the Go community decided to address this problem with a non-cooperative preemption mechanism instead. In simple terms, any goroutine that runs for 10ms without yielding will receive a SIGURG signal from the runtime, which will then reschedule the thread, initiate garbage collection, or do whatever else needs to be done at that time.

System calls that end up being referred to another process via seccomp() tend to run longer than usual, and the sorts of tasks that a supervisor process might carry out — mounting a filesystem, for example — can take longer yet. This has evidently led to a lot of interrupted, seccomp()-mediated system calls in Go programs and an associated desire to find a way to stop those interruptions.

Masking non-fatal signals

To address this problem, Dhillon's patch set adds a new flag (called SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) to the SECCOMP_IOCTL_NOTIF_RECV ioctl() command that is used by the supervisor process to receive notifications. If that flag is set when a notification is given to the supervisor, the target process will be put into a "killable" wait, meaning that fatal signals will still be delivered, but any others will be masked until after the supervisor has responded to the notification. Non-fatal signals will thus no longer interrupt system calls while the supervisor process is working on them.

Note that if a non-fatal signal arrives before the supervisor reads the notification, the target's system call will be interrupted as usual. The notification will be canceled, and the supervisor will get an error if it tries to read that notification. The end result in that case is as if the system call never happened in the first place. Once the notification is delivered, though, the system call will run to completion. It is a relatively small change that solves this problem, though that solution comes at the expense of adding arbitrary delays to Go's preemption mechanism when seccomp() and user-space notification are in use. That is just the sort of delay that the preemption mechanism was created to prevent, but it will at least be under the control of the supervisor and, presumably, bounded.

This patch set has been posted twice as of this writing; it has not received much in the way of responses. That may suggest that few people have looked at it so far, which is not an ideal situation for a security-related change to the user-space API. Until that situation changes, this work seems unlikely to advance and users of Go with seccomp() and user-space notifications will continue to have problems.

Index entries for this article
Kernel	Security/seccomp
Security	Linux kernel/Seccomp

Seccomp user-space notification and signals

Posted Apr 9, 2021 16:58 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> Even when that check was reduced to a single instruction, the resulting performance penalty was deemed to be too high; this approach also doesn't help in the long-running system-call case.
This can be solved by adding a preemption check after the syscall. This way the GC can just work in parallel with an active syscall, and once the syscall returns we can be sure it's not going to race with the active GC stop-the-world phase.

Seccomp user-space notification and signals

Posted Apr 9, 2021 17:36 UTC (Fri) by quanstro (guest, #77996) [Link] (1 responses)

the premise is that the syscall per ce runs too long.

Seccomp user-space notification and signals

Posted Apr 9, 2021 17:45 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

So? If a thread is in a syscall then there's no inherent problem with the syscall taking its time.

GC needs to do stop-the-world to make sure that it can correctly identify all the root pointers. If regular Go code is allowed to run, then it can create new root pointers that GC will miss. But there's no chance a syscall can create new root pointers, so it's absolutely OK to let it run.

You just need to add a check for the GC-in-progress flag immediately after the syscall. And in this case this is not a problem, because the cost of one branch is completely negligible compared to the cost of a syscall.

Seccomp user-space notification and signals

Posted Apr 9, 2021 19:04 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

What you are missing is that goroutines are not 1:1 with "real" threads (i.e. kernel-level processes). The Go language explicitly supports creating and running thousands of goroutines at once; the Go runtime schedules these goroutines onto threads, just like the kernel schedules threads onto CPUs. If a goroutine blocks for a long time, for any reason, then it blocks the underlying thread, and the runtime has to 1) recognize the problem, 2) spin up a replacement thread, and 3) move other goroutines off of the blocked thread. For simple and infrequent cases, this is no big deal. But if it happens too often, or in a fashion which the runtime is unable to consistently recognize as a blockage, it can cause unfair scheduling and/or total loss of forward progress (for example, by maxing out the number of threads which the runtime is willing to create, or by various forms of livelock).

So instead, they moved to a model of preemptive multitasking, just like various operating systems did a million years ago.

Seccomp user-space notification and signals

Posted Apr 9, 2021 19:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I'm very familiar with Go's internals.

> If a goroutine blocks for a long time, for any reason, then it blocks the underlying thread
Goroutine can't actually block. It either executes Go code (and in this case it's pre-emptible) or it's in a C call or a syscall, in which case the scheduler knows about it.

And the blocking calls per se are not a problem with Go and they absolutely happen already. Go has "work stealing" support, so if a thread is blocked for too long then goroutines assigned to it will be picked up by another thread.

Here's a quick explanation: https://medium.com/a-journey-with-go/go-work-stealing-in-...

Seccomp user-space notification and signals

Posted Apr 9, 2021 20:22 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> It either executes Go code (and in this case it's pre-emptible)

It is *now* preemptible, but I was discussing the prior state of affairs, before they introduced non-cooperative preemption. I probably should have used the past tense for that.

> or it's in a C call or a syscall, in which case the scheduler knows about it.

Sure, but that won't save you if the OS implementation is crazy and has decided to trap and emulate every syscall with a network round-trip or something equally ridiculous. You can't just keep spinning up threads indefinitely.

Seccomp user-space notification and signals

Posted Apr 9, 2021 22:15 UTC (Fri) by sargun (guest, #123301) [Link]

Rodrigo Campos (Rata) also contributed to this patchset! Just pointing it out.

Seccomp user-space notification and signals

Posted Apr 10, 2021 18:04 UTC (Sat) by luto (guest, #39314) [Link] (4 responses)

Wait, how does it work without this patch? Not all system calls should be interruptible?

Seccomp user-space notification and signals

Posted Apr 10, 2021 18:14 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (3 responses)

IIRC, under the existing model a process which enters the D state (traditionally "disk wait") during a system call cannot actually be interrupted until it leaves D state – even if you send it SIGKILL, which is uncatchable.

Seccomp user-space notification and signals

Posted Apr 11, 2021 15:44 UTC (Sun) by Paf (subscriber, #91811) [Link] (2 responses)

This is correct. Interruptability of syscalls is patchy anyway - it’s more about how a particular piece of kernel code is choosing to wait than anything else. (Normally, the kernel only checks for signals when a process goes to sleep - this includes user processes executing a syscall.)

If a piece of code chooses to wait unkillably, then that syscall can’t be interrupted at that point. It might be interruptible at some other point. There’s some theory behind which waits are chosen, but it is definitely not consistent. (Source: Me, I spent a bunch of time debugging issues related to syscall interruption in the Lustre file system and searched in vain for a consistent pattern in the kernel.)

Seccomp user-space notification and signals

Posted Apr 11, 2021 15:50 UTC (Sun) by Paf (subscriber, #91811) [Link]

Historically the wait mode is also tied to load average calculations - processes waiting unkillably contribute to load average, because they’re waiting on disk I/O, you see. This plays very badly with a heavily multithreaded distributed file system, which can have many many threads waiting for remote operations as part of normal operation. So is it correct for them to contribute to load average? Well, sure, except that because of the multi-threading, it generates load averages that are way outside the bounds of most monitoring tool defaults. Because if you had 250 threads waiting on your local disk, you’re in bad shape, at least on a spinning disk. But that’s fine in a many-threaded file system client.

This was a minor pain point for years. The kernel just added - a few years ago - a wait type that combines waiting unkillably with *not* contributing to load average.

Seccomp user-space notification and signals

Posted Apr 11, 2021 18:44 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

The theory is roughly: A system call waiting for an external event which may never occur (eg, input from a terminal or data arriving from the network) should sleep interruptibly so that the process/ thread gets woken up when a signal needs to be handled.

How that's actually implemented in the kernel would be a different conversation. On top of that, there's some historical misbehaviour here (I had work around recently again): An epoll_wait call can - in absence of any user defined signal handlers - fail with EINTR on SIGTRAP (eg, attaching strace to a running process) or SIGCONT (continuing a stopped process).