Redox to implement POSIX signals in user space

[Posted July 16, 2024 by daroc]

Redox has received a grant to work on implementing POSIX-compatible signals. The draft design calls for them to be implemented nearly completely in user space.

So far, the signals project has been going according to plan, and hopefully, POSIX support for signals will be mostly complete by the end of summer, with in-kernel improvements to process management. After that, work on the userspace process manager will begin, possibly including new kernel performance and/or functionality improvements to facilitate this.

Not really userspace

Posted Jul 17, 2024 1:03 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

This is not really signal handling in userspace. It's more like signal handling with some of the parts delegated to the userspace, but the kernel still has almost all logic for the signal delivery.

How I would do truly userspace signals:

1. Listen for signals in a separate background thread that is always active. They arrive through some kind of an IPC channel.
2. Add a syscall for the kernel to interrupt and pause the given process/thread, and return its runtime context (registers file).
3. This is all.

The signal handling implementation has two options:
1. It can "steal" the thread to point the next instruction to the user's signal handling code, and then un-pause it.
2. Take a leaf out of Windows NT and just run the signal handling from within the background thread.

The second approach will probably be slightly non-POSIX and can be opt-in. But it provides a _sane_ way to handle everything, even asynchronous signals like SIGSEGV. The monitoring thread can even be hardened with a separate address space.

Not really userspace

Posted Jul 17, 2024 8:24 UTC (Wed) by ddevault (subscriber, #99589) [Link] (11 responses)

I agree that the Redox implementation mostly lives in kernel space, not user space, at least for the important bits. I would correct this, though:

>Listen for signals in a separate background thread that is always active. They arrive through some kind of an IPC channel.

A background thread (thread here implying that it lives in the same address space as the signaled process) is not necessary and probably not desirable. In a microkernel context what you're more likely to have is some kind of process server which oversees running processes and implements some process management semantics for them, which has capabilities (or whatever) to instruct the kernel to suspend/resume the target process, read/write the register file, and modify the target process's address space. This is (almost) enough to implement signals in user space. Such a process server would also probably be responsible for implementing some stuff like fork/exec, pthread_create, etc.

But the only real clarification I'd add to your comment is that it'd probably be better as a supervisor server rather than as a thread in the process.

Not really userspace

Posted Jul 17, 2024 17:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

A separate process is definitely a possibility, especially for async signals like SIGSEGV. But signals are also used as a means of intra-process communication, so handlers should be runnable within the same address space.

Ultimately, if signals are split into two primitives: thread suspension, and code injection into a thread's context, - then combining them in different ways is a much more powerful tool than the current signals.

Not really userspace

Posted Jul 17, 2024 17:50 UTC (Wed) by ddevault (subscriber, #99589) [Link] (9 responses)

With a supervisor process the signal handler still runs in the target process and its address space. This is assuming POSIX-compatible signals. The supervisor receives the request to deliver a signal, instructs the kernel to pause the supervised process, injects a signal frame and writes the register file, and resumes the supervised process. Then the signal handler function runs in the supervised process.

Not really userspace

Posted Jul 17, 2024 21:08 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Ah, OK. This makes sense. You still need tools to poke the target memory across the process boundaries, but it's really no different than ptrace().

In your case, the supervisor (signald?) will be responsible for all the signals in the system. It'd be OK for POSIX software, and native software will just use the "suspend thread" call directly instead. Like Go runtime that uses signals to pre-empt threads that are running inside tight inner loops for their user-space scheduling.

Not really userspace

Posted Jul 18, 2024 9:22 UTC (Thu) by 4lDO2 (guest, #172237) [Link] (4 responses)

> You still need tools to poke the target memory across the process boundaries, but it's really no different than ptrace().

Assuming the current implementation proposal does not significantly change when the process manager takes over the signal sending role from the kernel, this is to some extent true, but it doesn't require any dynamic mapping of other processes' memory, or using kernel interfaces for register modification. The target threads themselves handle register save/restore, and the temporary old register values (like the instruction pointer before being overwritten) are stored in a shared memory page, so apart from the suspend/interrupt logic, the kernel only needs to be able to set the target instruction pointer. It's too early to say, but maybe this will be reduced to a userspace-controlled IPI primitive?

(The kernel does already support ptrace though.)

Not really userspace

Posted Jul 18, 2024 19:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> It's too early to say, but maybe this will be reduced to a userspace-controlled IPI primitive?

Looks like it. Basically, the whole design can be:

1. A separate signald process that provides the API for the signal masking and queueing.
2. Signal functions in libc simply do RPC calls to the signald.

The kernel then needs to have this additional functionality:
1. A syscall to pause a given thread, and return the thread context (register file and whatever additional information needed). The pause functionality can work even if the thread is in the kernel space, or it can be deferred to the syscall return time.
2. A syscall to resume a given thread with the provided thread context.
3. Asynchronous exceptions (like SIGBUS/SIGSEGV) in the kernel automatically pause the offending thread, and submit the thread context to the signald via some kind of IPC.

Signald can then do all the processing and masking logic. It also neatly removes from the kernel all the corner cases with double signals, signal targeting, and so on.

It also opens the way for a better API in the future.

Not really userspace

Posted Jul 19, 2024 15:34 UTC (Fri) by 4lDO2 (guest, #172237) [Link] (2 responses)

> 1. A separate signald process that provides the API for the signal masking and queueing.
> 2. Signal functions in libc simply do RPC calls to the signald.

This is not how the current implementation works, and would probably be too inefficient for signals to be meaningful for non-legacy software. Currently, sigprocmask/pthread_sigmask, sigaction, sigpending, and the sigentry asm which calls actual signal handlers, are implemented without any syscalls/IPC calls, but instead only modify shared memory locations. Sending process signals (kill, sigqueue) requires calling the kernel (later, the process manager) for synchronization reasons. And although sending thread signals (raise, pthread_kill) currently also calls the kernel, it's possible the latter will also be possible to do in userspace too, only calling the kernel if the target thread was blocked at the time the signal was sent, like futex, which is what I meant by "userspace-controlled IPI primitive".

> The kernel then needs to have this additional functionality:
> 1. A syscall to pause a given thread, and return the thread context (register file and whatever additional information needed). The pause functionality can work even if the thread is in the kernel space, or it can be deferred to the syscall return time.
> 2. A syscall to resume a given thread with the provided thread context.
> 3. Asynchronous exceptions (like SIGBUS/SIGSEGV) in the kernel automatically pause the offending thread, and submit the thread context to the signald via some kind of IPC.

That is exactly what ptrace allows, but this signals implementation is not based on tracing the target thread and externally saving/restoring the context, it's based on *internally* saving/restoring the context on the same thread. Very similar to how an interrupt handler would work. The kernel only needs to be able to save the old instr pointer, jump userspace to the sigentry asm, mask signals, and the target context will *itself* obtain a stack and push registers, etc. The same applies for exceptions, which will be *synchronously* handled (using a similar mechanism as signals), also analogous to CPU exceptions like page faults. Though it might make sense to allow configuring exceptions asynchronously as an alternative, so a (new) tracer is always notified when e.g. a crash occurs, if a program is not explicitly prepared for such events.

Not really userspace

Posted Jul 19, 2024 17:40 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> This is not how the current implementation works, and would probably be too inefficient for signals to be meaningful for non-legacy software

Honestly, signals shouldn't be used for non-legacy software. It's a bad primitive, they're not composable, there's a limited number of them, and so on.

If instead you have a primitive specifically designed as a way to manipulate running threads, then it might be more useful. A great example is Go runtime that is using signals to interrupt inner loops. Once the thread is interrupted, they run a conservative pointer scanning on the most recent stack frame and registers, to protect new objects against being garbage-collected.

Additionally, the handler doesn't _have_ to be in a different process. It can be in a background thread within the same process, so the amount of context switches can be the same compared to regular signal handling.

> That is exactly what ptrace allows, but this signals implementation is not based on tracing the target thread and externally saving/restoring the context, it's based on *internally* saving/restoring the context on the same thread.

Yeah, that has been a constant issue with signals. It depends on the thread's environment being sane, so sigaltstack() was an inevitability. And if you have sigaltstack(), then why not just extend it to handling via an IPC?

Not really userspace

Posted Jul 19, 2024 21:14 UTC (Fri) by 4lDO2 (guest, #172237) [Link]

> Honestly, signals shouldn't be used for non-legacy software. It's a bad primitive, they're not composable, there's a limited number of them, and so on.

I agree signals are a bad primitive for high-level code, and it's a shame POSIX has reserved almost all of the standard signals, many of which signals are too low-level for (SIGWINCH for example). Signalfd or sigwait are much better in those cases, or using a high level queue-based abstraction like the `signal-hook` crate. It would probably be better if the 'misc' signals were instead queue-only, or not signals at all, and if exceptions and signals would be separated. And possibly making SIGKILL and SIGSTOP non-signals.

> If instead you have a primitive specifically designed as a way to manipulate running threads, then it might be more useful.

This is sort of what I'm trying to reduce the kernel part of the implementation into. Just a way to IPI a running thread and set its instruction pointer, and then let that thread decide what it should do. Possibly even literally using IPIs, such as Intel's SENDUIPI feature, and possibly using "switch back from timer interrupt" hooks (with the additional benefit of automagically supporting restartable sequences). This would be without any context switches at all, although a mode switch for the receiver, if both the sender and receiver are simultaneously running.

This is of course useful for runtimes like Go, the JVM, and possibly even async Rust runtimes (maybe a Redox driver can be signaled directly if a hardware interrupt occurs, coordinated with the runtime), which aren't (necessarily) based on switching stacks.

> Additionally, the handler doesn't _have_ to be in a different process. It can be in a background thread within the same process, so the amount of context switches can be the same compared to regular signal handling.

> Yeah, that has been a constant issue with signals. It depends on the thread's environment being sane, so sigaltstack() was an inevitability. And if you have sigaltstack(), then why not just extend it to handling via an IPC?

Switching stacks is, apart from TLS (assuming x86 psabi TLS is required), virtually the same thing as switching between green threads, and on some OSes regular threads and green threads are even the same (pre-Windows 11 with UMS, AFAIK). That could perhaps eventually include Redox. I don't understand why one would want IPC (assuming you mean process and not processor) except when tracing, as that'd suffer from the usual context switch overhead, which is probably too high for a language/async runtime.

Not really userspace

Posted Jul 18, 2024 9:28 UTC (Thu) by 4lDO2 (guest, #172237) [Link] (2 responses)

Although it's possible this may change, it's currently sufficient for the kernel to just set the instruction pointer (and save the old one inside a shared memory page), so no suspend is is required. This is done when the kernel switches back from a timer interrupt, whereas EINTR syscall returns are handled explicitly by userspace itself, by similarly jumping to the sigentry asm.

Not really userspace

Posted Jul 19, 2024 7:42 UTC (Fri) by ddevault (subscriber, #99589) [Link] (1 responses)

Won't this fall apart when SMP is added to the mix? (Does Redox support SMP yet?)

Not really userspace

Posted Jul 19, 2024 15:10 UTC (Fri) by 4lDO2 (guest, #172237) [Link]

Redox has supported SMP from very early on, even though it's not that optimized yet. As for the mechanism of signal delivery, that's mostly unaffected by SMP, since there's a (sigatomic) flag to disable signals (like cli/sti) which is set when the kernel jumps to sigentry, and which is also set inside many redox-rt functions like pthread_sigmask. And this location is only jumped to when the kernel is already running the same context.

The synchronization works as follows: for thread signals (pthread_kill, raise), the thread pending and thread mask bits are inside the same 64-bit atomic word, so the thread is always either correctly unblocked by the kernel/proc manager, or pthread_sigmask will see that signals were pending at the time it blocked/unblocked signals. For process signals (kill, sigqueue, etc.), it's a little trickier; the process pending bits are cleared by notified threads competitively, when detected, and in rare cases spurious signals may occur (which are properly handled). Both process and thread signals currently require the kernel to hold an exclusive lock when *generating* the signal (especially the SIGCHLD logic which really can't be synchronized otherwise), but it's possible pthread_sigmask will be able to even bypass the kernel later.

Not really userspace

Posted Jul 17, 2024 8:56 UTC (Wed) by Wol (subscriber, #4433) [Link] (8 responses)

> The second approach will probably be slightly non-POSIX and can be opt-in. But it provides a _sane_ way to handle everything, even asynchronous signals like SIGSEGV. The monitoring thread can even be hardened with a separate address space.

Given that Linus regularly ignores what he considers stupid Posix design decisions, that's the way I'd go. "This is the Redox way. This is the design rationale. If you want Posix here's a bodge to support it".

If you can implement Posix within a sane design rationale, then go for it! If you can't, well don't let Posix break the rest of your design!

Cheers,
Wol

Not really userspace

Posted Jul 18, 2024 5:25 UTC (Thu) by wahern (subscriber, #37304) [Link] (7 responses)

SIGSEGV is usually a synchronous signal, not asynchronous. That is, it's delivered to the same thread which triggered it, so you can rely on the control flow being paused rather than continuing to run, and having the ability to inspect and modify it safely from the same context, address space and all. POSIX permits longjmp'ing from a signal handler and this can be used in a well-defined manner to greatly increase (as in 10-100x) the performance of certain operations, such as hardware-accelerated array boundary checking, for which you would usually use a synchronous signal like SIGSEGV. In hardware environments like CHERI, most array bounds checking could be implemented by language runtimes using direct hardware facilities, rather than only explicitly in niche scenarios. The JVM uses a similar trick ubiqituously for NULL dereference checking; it's implementation doesn't rely merely on POSIX, but would require a similar facility all the same.[1]

These are considered "hacks" now, but this was the original purpose of signals--to provide a thin and safe[2] abstraction over hardware interrupts that userspace could make use of. As common usage and experience of signals evolved, e.g. relying on SIGHUP as a notice to reload a configuration file, signals began to seem hopelessly baroque and anachronistic. Most developers now live in a world much further up the stack of abstractions. But the original motivation and need still exists, and if we really care about software security, we need *more* *safe* programatic access to hardware facilities from userspace to remove complexity from the kernel. I agree userspace signals sounds like a good idea (notwithstanding that kernel implementations are already quite simple and thin, shunting much of the complexity to userspace by design), but don't throw the baby out with the bath water by only catering to the usages that don't really benefit from Unix signal semantics.

[1] In microbenchmarking branch prediction makes explicit boundary and NULL checking seem free, but branch prediction is a limited resource and is better spent on other application-specific logic. SIGSEGV effectively lets you delegate these tasks to the MMU, which it has to do anyhow. Why do it twice, soaking up other precious hardware resources in the process?

[2] From the kernel's and system's perspective, as opposed to letting processes poke at hardware facilities more directly using more complex protocols.

Not really userspace

Posted Jul 18, 2024 7:50 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> POSIX permits longjmp'ing from a signal handler and this can be used in a well-defined manner to greatly increase (as in 10-100x) the performance of certain operations, such as hardware-accelerated array boundary checking, for which you would usually use a synchronous signal like SIGSEGV.

"Press [X] to doubt".

Signals are not at all fast because they necessarily involve several context switches. I wrote a short benchmark: https://gist.github.com/Cyberax/5bd53bff3308d6e026d414f93... - it takes about 2 seconds on my Linux machine to run 1 million iterations of SIGSEGV handling.

Perf data: https://gist.github.com/Cyberax/aa96a237a9b04ed6f25e09c63... - so around 500k signals per second. That's not bad, but it's also not that _great_ for a good primitive for high-performance apps.

Not really userspace

Posted Jul 18, 2024 8:31 UTC (Thu) by mgb (guest, #3226) [Link] (1 responses)

But what if most of your array accesses are in bounds and don't cause a signal?

Not really userspace

Posted Jul 18, 2024 19:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Probably. But it can safely work only with hardware-assisted bounds control, where each pointer is annotated with a length.

And if you have such hardware capabilities, you should use a different primitive for flow control.

Not really userspace

Posted Jul 18, 2024 9:40 UTC (Thu) by 4lDO2 (guest, #172237) [Link] (2 responses)

A SIGSEGV signal, which on Redox will be handled as an 'exception' that userspace can later call a signal handler for (basically the same type of jump, passing the page fault address, old instr pointer, page fault flags, etc), does *not* necessarily require several context switches. In fact, it can in theory be just as fast as a kernel-corrected page fault, which should be slightly but not significantly slower than a syscall, i.e. two mode switches. It wouldn't surprise me if Linux's implementation is not as efficient though.

The point of catching SIGSEGV this way, is not that exceptions are faster than checks, but that they *are* faster overall if the probability of the check failing, is low enough to justify avoiding this check in the general case. After all, this is why Linux (and Redox) implements copy_from_user as a catchable memcpy function, rather than walking the page tables every time. Most applications won't EFAULT more than once.

Not really userspace

Posted Jul 18, 2024 19:30 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> A SIGSEGV signal, which on Redox will be handled as an 'exception' that userspace can later call a signal handler

FWIW, Windows also uses a similar model. Memory faults can be caught via SEH ("Structured Exception Handling"). It still has to round-trip through the kernel, but it's more lightweight compared to signals.

> The point of catching SIGSEGV this way, is not that exceptions are faster than checks, but that they *are* faster overall if the probability of the check failing, is low enough to justify avoiding this check in the general case. After all, this is why Linux (and Redox) implements copy_from_user as a catchable memcpy function, rather than walking the page tables every time. Most applications won't EFAULT more than once.

Sure, but then it's also fine if the signal handling is done through a userland process/thread. It will add a couple of context switches, but it'll still be fast enough.

Not really userspace

Posted Jul 19, 2024 16:16 UTC (Fri) by 4lDO2 (guest, #172237) [Link]

> FWIW, Windows also uses a similar model. Memory faults can be caught via SEH ("Structured Exception Handling"). It still has to round-trip through the kernel, but it's more lightweight compared to signals.

That's pretty cool. My understanding is that Windows generally offloads much more logic to userspace, as they can freely change the kernel/user ABI.

> Sure, but then it's also fine if the signal handling is done through a userland process/thread. It will add a couple of context switches, but it'll still be fast enough.

Yeah probably. Redox will most likely implement userspace exception handling synchronously, like signals, but for many workloads those edge cases would presumably not be that noticeable either way.

Not really userspace

Posted Jul 20, 2024 0:53 UTC (Sat) by am (subscriber, #69042) [Link]

I'm not sure about bounds checking, but it's true that the JVM uses SIGSEGV for optimizations. It's optimizing for cases where it doesn't happen, though.

If you have hot code with branches checking for null that aren't being taken, the JIT will simply optimize out the checks, giving you better throughput. If a value happens to be become null after all, it will dereference it, catch the SIGSEGV, throw out the optimized code ("deoptimize"), and then continue where it left off, giving you a little latency spike.

Not really userspace

Posted Jul 17, 2024 9:21 UTC (Wed) by sthibaul (✭ supporter ✭, #54477) [Link]

> How I would do truly userspace signals:

This is what GNU/Hurd does indeed.

It's a bit hairy in some places to manage the interrupted context, but that does work indeed.