Seccomp user-space notification and signals
Normally, seccomp() is used to implement a simple sort of attack-surface reduction, making much of the system-call space off limits for the affected process. User-space notification can be used to that end, but the objective there is often different: it allows a supervisor process to emulate system calls for the target process. An example might be a container manager that wishes to make mount() available inside a container, but with some strict limits on what can actually be mounted. User-space notification allows the (privileged) supervisor to actually perform the mount operations it approves of and return the results to the target process.
While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back. Should that process receive a signal, though, it will stop waiting and respond immediately to the signal; if the signal itself is not fatal, the result may well be the system call returning an EINTR error to the target process. The supervisor, instead, will not know about the signal until it tries to give the kernel its answer to the original notification; at that point, it will get an ENOENT error indicating that the notification is no longer alive.
This sort of interruption can be inconvenient, especially if the supervisor has carried out some sort of long task on the target's behalf. If the signal does not kill the target process, it is likely that the same operation will be retried shortly, leading to extra work being done. Most of the time, though, non-fatal signals of this type are likely to be rare in programs running under seccomp() monitoring.
Go signal
More accurately, that was once true, but the developers of the Go language had a problem of their own to solve. That language's "goroutine" lightweight thread model requires that the Go runtime handle scheduling, switching between goroutines as needed so that they all get a chance to run. Beyond that, there is a need for occasional "stop the world" events where all goroutines are paused so that the garbage collector can do its job. This has been handled by having the compiler put preemption checks at the beginning of each function.
What happens, though, if a goroutine runs for a long time without calling any functions? This can happen if the routine is running inside of some sort of tight loop; in the worst case, that loop could be spinning on a lock and preventing the lock holder from running to release it, a situation that tends to increase the overall level of user disgruntlement. Another way to delay preemption is to make a long-running system call.
The Go developers have tried a few ways of solving this problem. One of them involved inserting preemption checks at backward jumps in the code (thus at the end of a loop, for example). Even when that check was reduced to a single instruction, the resulting performance penalty was deemed to be too high; this approach also doesn't help in the long-running system-call case. So the Go community decided to address this problem with a non-cooperative preemption mechanism instead. In simple terms, any goroutine that runs for 10ms without yielding will receive a SIGURG signal from the runtime, which will then reschedule the thread, initiate garbage collection, or do whatever else needs to be done at that time.
System calls that end up being referred to another process via seccomp() tend to run longer than usual, and the sorts of tasks that a supervisor process might carry out — mounting a filesystem, for example — can take longer yet. This has evidently led to a lot of interrupted, seccomp()-mediated system calls in Go programs and an associated desire to find a way to stop those interruptions.
Masking non-fatal signals
To address this problem, Dhillon's patch set adds a new flag (called SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) to the SECCOMP_IOCTL_NOTIF_RECV ioctl() command that is used by the supervisor process to receive notifications. If that flag is set when a notification is given to the supervisor, the target process will be put into a "killable" wait, meaning that fatal signals will still be delivered, but any others will be masked until after the supervisor has responded to the notification. Non-fatal signals will thus no longer interrupt system calls while the supervisor process is working on them.
Note that if a non-fatal signal arrives before the supervisor reads the notification, the target's system call will be interrupted as usual. The notification will be canceled, and the supervisor will get an error if it tries to read that notification. The end result in that case is as if the system call never happened in the first place. Once the notification is delivered, though, the system call will run to completion. It is a relatively small change that solves this problem, though that solution comes at the expense of adding arbitrary delays to Go's preemption mechanism when seccomp() and user-space notification are in use. That is just the sort of delay that the preemption mechanism was created to prevent, but it will at least be under the control of the supervisor and, presumably, bounded.
This patch set has been posted twice as of this writing; it has not
received much in the way of responses. That may suggest that few people
have looked at it so far, which is not an ideal situation for a
security-related change to the user-space API. Until that situation
changes, this work seems unlikely to advance and users of Go with
seccomp() and user-space notifications will continue to have
problems.
| Index entries for this article | |
|---|---|
| Kernel | Security/seccomp |
| Security | Linux kernel/Seccomp |
Posted Apr 9, 2021 16:58 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Apr 9, 2021 17:36 UTC (Fri)
by quanstro (guest, #77996)
[Link] (1 responses)
Posted Apr 9, 2021 17:45 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
GC needs to do stop-the-world to make sure that it can correctly identify all the root pointers. If regular Go code is allowed to run, then it can create new root pointers that GC will miss. But there's no chance a syscall can create new root pointers, so it's absolutely OK to let it run.
You just need to add a check for the GC-in-progress flag immediately after the syscall. And in this case this is not a problem, because the cost of one branch is completely negligible compared to the cost of a syscall.
Posted Apr 9, 2021 19:04 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
So instead, they moved to a model of preemptive multitasking, just like various operating systems did a million years ago.
Posted Apr 9, 2021 19:11 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
> If a goroutine blocks for a long time, for any reason, then it blocks the underlying thread
And the blocking calls per se are not a problem with Go and they absolutely happen already. Go has "work stealing" support, so if a thread is blocked for too long then goroutines assigned to it will be picked up by another thread.
Here's a quick explanation: https://medium.com/a-journey-with-go/go-work-stealing-in-...
Posted Apr 9, 2021 20:22 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
It is *now* preemptible, but I was discussing the prior state of affairs, before they introduced non-cooperative preemption. I probably should have used the past tense for that.
> or it's in a C call or a syscall, in which case the scheduler knows about it.
Sure, but that won't save you if the OS implementation is crazy and has decided to trap and emulate every syscall with a network round-trip or something equally ridiculous. You can't just keep spinning up threads indefinitely.
Posted Apr 9, 2021 22:15 UTC (Fri)
by sargun (guest, #123301)
[Link]
Posted Apr 10, 2021 18:04 UTC (Sat)
by luto (guest, #39314)
[Link] (4 responses)
Posted Apr 10, 2021 18:14 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (3 responses)
Posted Apr 11, 2021 15:44 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (2 responses)
If a piece of code chooses to wait unkillably, then that syscall can’t be interrupted at that point. It might be interruptible at some other point. There’s some theory behind which waits are chosen, but it is definitely not consistent. (Source: Me, I spent a bunch of time debugging issues related to syscall interruption in the Lustre file system and searched in vain for a consistent pattern in the kernel.)
Posted Apr 11, 2021 15:50 UTC (Sun)
by Paf (subscriber, #91811)
[Link]
This was a minor pain point for years. The kernel just added - a few years ago - a wait type that combines waiting unkillably with *not* contributing to load average.
Posted Apr 11, 2021 18:44 UTC (Sun)
by rweikusat2 (subscriber, #117920)
[Link]
How that's actually implemented in the kernel would be a different conversation. On top of that, there's some historical misbehaviour here (I had work around recently again): An epoll_wait call can - in absence of any user defined signal handlers - fail with EINTR on SIGTRAP (eg, attaching strace to a running process) or SIGCONT (continuing a stopped process).
Seccomp user-space notification and signals
This can be solved by adding a preemption check after the syscall. This way the GC can just work in parallel with an active syscall, and once the syscall returns we can be sure it's not going to race with the active GC stop-the-world phase.
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Goroutine can't actually block. It either executes Go code (and in this case it's pre-emptible) or it's in a C call or a syscall, in which case the scheduler knows about it.
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
Seccomp user-space notification and signals
