User-space interrupts

By Jonathan Corbet
September 30, 2021

The term "interrupt" brings to mind a signal that originates in the hardware and which is handled in the kernel; even software interrupts are a kernel concept. But there is, it seems, a use case for enabling user-space processes to send interrupts directly to each other. An upcoming Intel processor generation includes support for this capability; at the 2021 Linux Plumbers Conference, Sohil Mehta ran a Kernel-Summit session on how Linux might support that feature.

At its core, Mehta began, the user-space interrupts (or simply "user interrupts") feature is a fast way to do event signaling. It delivers signals directly to user space, bypassing the kernel to achieve lower latency. Our existing interprocess communication mechanisms all have limitations, he said. The synchronous mechanisms often require a dedicated thread, have high latency, and are generally inefficient. Asynchronous mechanisms (signals, for example) have even higher latency. So often the only alternative is polling, which wastes CPU time. It would be nice to have a fast, efficient alternative.

That alternative is user-space interrupts, which will first appear in Intel's "Sapphire Rapids" CPUs. RFC patches supporting this feature were posted in mid-September. Those patches support user-to-user signaling without going through the kernel; instead, the new SENDUIPI instruction allows one process to send an interrupt directly to another process. Future versions will also include kernel-to-user signaling and, eventually, interrupts sent directly to user space from devices.

Mehta put up some benchmark results (which can be seen in the slides) showing that user-space interrupts are nine times faster than using eventfd(), and 16 times faster than using pipes or signals. The advantage is lower if the receiving process is blocked in the kernel, since it is not possible to avoid a context switch in that case. Even then, user-space interrupts are 10% faster for the recipient, and significantly faster for the sender, which need not enter the kernel at all. Florian Weimer asked how user-space interrupts compared to futexes, but evidently that testing has not been done.

Use cases for this feature include fast interprocess communication, of course. User-mode CPU schedulers should be able to benefit from it, as can user-space I/O stacks (networking, for example). Getting the full benefit from this feature will require enhancements to libraries like libevent and liburing. There are no real-world applications using this feature yet, Mehta said; he is interested in hearing about other applications that might benefit from it. Ted Ts'o suggested host-to-guest wakeups in virtualization environments; evidently that use case is being investigated, but there are no real results yet.

For any number of good reasons, user-space processes cannot just arbitrarily send interrupts to others; there is some setup required. On the receiving side, it all starts with a call to:

    uintr_register_handler(handler, flags);

Where handler() is the function that is called to handle a user-space interrupt, and flags must be zero. The definition of the handler function requires a bit of special care; its prototype is:

    void __attribute__ ((interrupt))
        handler(struct __uintr_frame *frame, unsigned long long vector);

The next step is to create at least one file descriptor associated with this handler:

    int uintr_create_fd(u64 vector, unsigned int flags);

Here, vector is a number between zero and 63; one file descriptor can be created for each vector. The process then hands that file descriptor to the sending side. If the sender is another thread in the same process, the hand-off is trivial; otherwise a Unix-domain socket can be used to transfer the descriptor. The sender then performs its setup with:

   int uintr_register_sender(int fd, unsigned int flags);

Where fd is the file descriptor passed by the recipient and flags, as always, is zero. The return value is a handle that can be used with the _senduipi() intrinsic that is supported by GCC 11 to actually send an interrupt to the receiver.

Actual delivery of the interrupt depends on what the receiver is doing at the time; if that process is running in user space, the handler function will be called immediately with the appropriate vector number. Once the handler returns, execution will continue at the point of interruption. If the receiver is blocked in a system call in the kernel, the interrupt will be delivered on return to user space without interrupting the in-progress system call. There is a uintr_wait() system call in the patch set that will block until a user-space interrupt arrives then return immediately, but it is described as a "placeholder" until the desired behavior for this case is worked out.

Prakesh Sangappa asked whether it was really necessary to exchange the file descriptor with all senders; in a system where there could be large numbers of senders, that could get expensive. Mehta replied that there are a couple of optimizations that are being looked at. Ts'o asked whether user-space interrupts could be broadcast to multiple recipients; the answer is that broadcast is not supported.

Arnd Bergmann wanted to know if any thought had been given to emulating this feature on older CPUs. The answer appears to be yes; the kernel will trap the relevant instructions and transparently emulate them. Mehta asked for feedback on the emulation mechanism and, in particular, whether it should be implemented for other architectures. Bergmann discouraged that idea, saying that if user-space interrupts are implemented for those architectures, they will surely not be compatible with the emulated version. Emulation for other architectures, he said, should only be done once those architectures have defined their own instructions.

Greg Kroah-Hartman asked about whether the Clang compiler has support for the _senduipi() intrinsic; that support is being worked on, but is not yet ready. Kroah-Hartman also asked about more details on workloads that benefit from this feature, to which Mehta replied that he did not have anything specific to point to yet.

Mehta closed the session (which was running out of time) by asking what should happen when the recipient is blocked in a system call. As mentioned, the current patch set waits for the system call to return before delivering the interrupt. Should the behavior be changed to be closer to signals, with the interrupt delivered immediately and the system call returning with an EINTR status? Nobody had an opinion to share on that question, so the session ended there.

The video of this talk is available on YouTube.

Index entries for this article
Kernel	Architectures/x86
Conference	Linux Plumbers Conference/2021

User-space interrupts

Posted Sep 30, 2021 20:11 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (19 responses)

> Mehta closed the session (which was running out of time) by asking what should happen when the recipient is blocked in a system call. As mentioned, the current patch set waits for the system call to return before delivering the interrupt. Should the behavior be changed to be closer to signals, with the interrupt delivered immediately and the system call returning with an EINTR status? Nobody had an opinion to share on that question, so the session ended there.

IMHO it depends on multiple factors:

* EINTR is annoying to handle from userspace, so much so that some languages (Python at least) just transparently handle it for you. But you already have to handle it anyway, unless you're exclusively doing SIG_IGN/SIG_DFL for all signals.
* The "wait for something to happen" syscalls (nanosleep, epoll_wait, etc.) should probably be interrupted, regardless of what is decided for other syscalls.
* Since this is supposed to be a "low latency" mechanism, for when signals aren't fast enough, it would be very odd if you had to wait around for the kernel before you could get the interrupt, given that signals actually do interrupt the kernel...

User-space interrupts

Posted Oct 1, 2021 12:10 UTC (Fri) by droundy (subscriber, #4559) [Link] (2 responses)

Would it not be possible to service the interrupt and then finish the system call? Or would that put too strong a constraint on what could be done by the handler?

User-space interrupts

Posted Oct 1, 2021 13:46 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

The handler could twiddle with memory the syscall is working with from the userspace side, no? You'll need a context swap at least anyways so the handler has the right memory mappings.

User-space interrupts

Posted Oct 1, 2021 16:30 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

Another thread in the same process can already change memory (and mappings) while the first thread is in a system call, either because a reschedule occurred during the system call or because the other thread is running on a different CPU. A signal handler doesn't seem like it would change anything in that regard.

User-space interrupts

Posted Oct 3, 2021 2:25 UTC (Sun) by kepstin (subscriber, #72391) [Link] (1 responses)

I'm kinda confused about why hardware support is needed for this. For normally scheduled programs, as likely as not, the process being "interrupted" isn't actually running, so wouldn't the kernel need to step in and schedule the process before the interrupt could be delivered anyways? What's the benefit over a syscall there?

Or... Is this primarily for real time processes maybe? If the processes are carefully scheduled and known to be running concurrently on different cpus, I guess a direct userspace interrupt would be a win for latency.

User-space interrupts

Posted Oct 3, 2021 18:35 UTC (Sun) by Bigos (subscriber, #96807) [Link]

I think the idea is: when two threads are running at the same time on different CPUs one can interrupt the other at an order of magnitude lower latency than the alternatives. The thread might be in the middle of execution of a long running task that does not touch any syscalls. And if the receiving thread is not running, the latency is not worse.

User mode scheduling (aka "green threads") was given as an example use case. If I understand correctly, one thread can preempt the other using a userspace interrupt. The interrupt handler can then modify the state so that on return something else is called, similar to how kernel scheduler works, but without any context switches. However, green threads are usually used when scheduling small short-living (or often-waiting) tasks, so cooperative preemption is enough. And when forceful preemption is necessary, it happens seldom enough for it not to be a bottleneck.

Jens Axboe mentioned io_uring cq notification [1], though that is about kernel -> userspace which has not been implemented yet.

There might be a better use case example that I am not thinking about. In fact, this has been stated as one of the issues on the mailing list [2].

At first, I wanted to point out that userspace RCU (URCU) could be a possible use case as well, but that was already resolved by membarrier() syscall years ago [3]. However, userspace interrupts might improve the performance of this even without broadcast support.

[1] https://lwn.net/ml/linux-kernel/ecf3cf2e-685d-afb9-1a5d-1...
[2] https://lwn.net/ml/linux-kernel/456bf9cf-87b8-4c3d-ac0c-7...
[3] https://lwn.net/Articles/369567/

User-space interrupts

Posted Oct 12, 2021 21:57 UTC (Tue) by anton (subscriber, #25547) [Link] (10 responses)

EINTR is annoying to handle from userspace, so much so that some languages (Python at least) just transparently handle it for you.

My understanding of EINTR (especially after reading Gabriel's Worse-is-better paper) is that it's too complicated to deal with the situation (dealing with a signal) in the kernel, so the kernel returns to user space with EINTR, the signal can be delivered, and afterwards user space sees the EINTR and restarts the system call.

Based on this understanding, I would expect the system-call wrapper to deal with EINTR, i.e., higher-level user programs should never see it. But strangely, AFAIK the system-call wrappers just deliver EINTR to the higher-level user code. Why? Is my understanding of EINTR wrong?

User-space interrupts

Posted Oct 12, 2021 22:54 UTC (Tue) by mpr22 (subscriber, #60784) [Link] (1 responses)

Imagine a daemon in an call to select() with a long or infinite timeout on a collection of sockets, or in a blocking read() on a socket.

Imagine further that you send that daemon SIGHUP, telling it to reload its configuration files.

Do you really want it to wait until it receives its next packet before reloading its configuration files? After all, your config change might mean it needs to close one or more of the sockets it's currently waiting on and/or open new sockets.

I don't.

I'd much rather that its signal handler for SIGHUP just sets a flag saying "config must be reloaded", and that the system call wrapper then returns -1 and sets errno to EINTR, so that my daemon can check its config-reload flag and go "oh hey I've been reconfigured".

User-space interrupts

Posted Oct 13, 2021 5:51 UTC (Wed) by anton (subscriber, #25547) [Link]

In this scenario the signal will be handled right after the system call (not the wrapper) returns, i.e., immediately. The default for SIGHUP is to terminate the process, so the parent process can reread the configuration file and restart the process (just one example on how to deal with that).

User-space interrupts

Posted Oct 12, 2021 23:45 UTC (Tue) by neilbrown (subscriber, #359) [Link] (7 responses)

Your description of EINTR roughly matches my understanding of ERESTART.

In many cases a syscall that returns EINTR should *not* be restarted. A blocking read is an obvious case. The application should be given the opportunity to choose whether to restart.
It *might* be possible for a platform (such as python) to require that syscalls which cannot be restarted don't get used. e.g. reads must be non-blocking. But that is a higher-level design choice than the libc syscall wrapper.

User-space interrupts

Posted Oct 13, 2021 6:37 UTC (Wed) by anton (subscriber, #25547) [Link] (6 responses)

Looking at the wrapper of read() (one of the system calls that , it does not handle ERESTART, and ERESTART is not documented as an error returned from read(), while EINTR is.

Why do you think that a blocking read() should not just continue (by restarting the system call) if the signal handler just returns after doing whatever it was doing? Why should every user of read() have to deal with EINTR?

User-space interrupts

Posted Oct 13, 2021 13:19 UTC (Wed) by madscientist (subscriber, #16861) [Link]

That doesn't work: the user needs to deal with the signal that was received, and since signal handlers are so restricted it usually needs to happen in "normal context" not inside the handler. And after dealing with it, they might or might not want to restart the read(). Perhaps the program got a SIGCHLD and before we restart reading we need to deal with our dead child process. Or maybe the interrupt was SIGINT and we want to shut down the process in a reliable way after doing cleanup, so the signal handler set a flag but now we definitely don't want to re-enter the read() at all.

User-space interrupts

Posted Oct 13, 2021 22:47 UTC (Wed) by neilbrown (subscriber, #359) [Link] (4 responses)

> Why should every user of read() have to deal with EINTR?

It's not "every user of read()", but only every user of read() reading from a file descriptor on which reads can block.
That does NOT include regular files - mainly char devices and sockets.
When you read from something that can block, you usually want more than you can be sure of getting in a single read. You'll usually need to be prepared to read some more anyway (not always, but often).
So you need to be prepared for a short read, and handling EINTR as well is not a whole lot more effort.

You certainly *could* have a platform where read() always retries EINTR, and signal handlers have to use longjmp if they want to abort a system call. But I don't think that would be *clearly* better than the current situation. Maybe marginally better - I don't know.

User-space interrupts

Posted Oct 13, 2021 23:23 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (1 responses)

> But I don't think that would be *clearly* better than the current situation. Maybe marginally better - I don't know.

When an idea involves normalizing the use of longjmp() in code written by mere mortals, that creates in my mind the (theoretically, but probably not practically, rebuttable) presumption that the idea is absolutely terrible.

User-space interrupts

Posted Oct 14, 2021 17:18 UTC (Thu) by anton (subscriber, #25547) [Link]

I found writing a signal handler more of a challenge than performing the longjmp() (and I don't remember ever having a bug due to the longjmp()). The problem is that asynchronous signals can be invoked anywhere, including in the middle of updating some data structure, so you have a chance of corrupting the data structure if you longjmp() out of a signal handler for an asynchronous signal.

User-space interrupts

Posted Oct 14, 2021 17:13 UTC (Thu) by anton (subscriber, #25547) [Link] (1 responses)

Unix has the very good idea that everything is a file. One of the benefits is that you can write some general routine on top of the system calls, and it is useful for all kinds of things. This admittedly does not work all the time, but we should strive for it.

In the present case, requiring the general routine to know whether it is used on a file that can result in EINTR, and how to behave in that case (which is very likely application-dependent) breaks modularity.

The application and its signal handler know how to deal with the situation, and the longjmp() approach is a good one in that sense. Of course, leaving an asynchronous signal with longjmp() has its dangers, but that's still the way we chose in Gforth (where asynchronous signals are rare).

User-space interrupts

Posted Oct 14, 2021 19:51 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

IMHO the unnecessary complication here is that "regular files" are treated specially. When your "regular file" could be backed by a network filesystem, FUSE, NBD, etc. it really ought to be considered more like a socket, subject to potential short reads and returning EINTR on signals whether or not any data has been read.

Users generally expect to be able to use sockets and pipes in place of regular files, e.g. using process substitution in Bash or named FIFOs or Unix-domain sockets in the filesystem, or arbitrary paths under /proc/$PID/fd/. Unless there is a good reason to require capabilities specific to regular files, for example lseek() or mmap()—or the application creates the file itself with O_EXCL—then applications ought to expect that read() and write() may process less data than requested even if the normal case involves regular files.

As for the longjmp() approach, that only works because the kernel backs out of the blocking call before invoking the signal handler. (A longjmp() call from a signal handler can't perform a non-local return out of arbitrary *kernel* stack frames.) At that point it's mostly a matter of policy whether the kernel restarts the system call after the handler returns or just returns EINTR to the caller—either always restarting or always returning EINTR would not simplify the kernel signficantly—and in general matters of policy are best left to application or library code rather than the kernel. Wrapping every non-interruptable read() in a loop to restart it until you get all the data you wanted is not substantially more code, or more *complex* code, than wrapping every read() which you might want to interrupt in a call to setjmp() and communicating that fact to the signal handler so it can decide whether to call longjmp().

POSIX also has these caveats regarding longjmp() from a signal handler:

> It is recommended that applications do not call longjmp() or siglongjmp() from signal handlers. To avoid undefined behavior when calling these functions from a signal handler, the application needs to ensure one of the following two things: … After the call to longjmp() or siglongjmp() the process only calls async-signal-safe functions and does not return from the initial call to main(). … Any signal whose handler calls longjmp() or siglongjmp() is blocked during *every* call to a non-async-signal-safe function, and no such calls are made after returning from the initial call to main().

It would be difficult to guarantee either of these restrictions are met in a complex application with many library dependencies. For example, if you return from a signal handler with longjmp() and then call printf() without masking every signal whose handler could call longjmp() then you've already broken both of those rules and invoked undefined behavior.

User-space interrupts

Posted Sep 18, 2024 6:23 UTC (Wed) by renox (guest, #23785) [Link] (2 responses)

IMHO the CPU should provide a 'wait for interrupt' instruction (maybe with different power saving level configurable).

User-space interrupts

Posted Sep 18, 2024 6:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

x86 has it: https://en.wikipedia.org/wiki/HLT_(x86_instruction)

User-space interrupts

Posted Sep 18, 2024 7:09 UTC (Wed) by intelfx (subscriber, #130118) [Link]

Like Cyberax said... this is exactly the instruction(s) that the kernel runs in each and every idle loop :-) And, incidentally, this is exactly how cpuidle works at least on x86 — via different "wait for interrupt" instructions.

User-space interrupts

Posted Oct 4, 2021 9:23 UTC (Mon) by taladar (subscriber, #68407) [Link] (1 responses)

Another half-baked Intel CPU feature that promises performance gains? Sounds to me as if Intel hasn't really learned much from overly focussing on performance at the cost of correctness and security in the past.

User-space interrupts

Posted Oct 7, 2021 3:17 UTC (Thu) by wahern (subscriber, #37304) [Link]

Now that Intel has lost the lead in silicon and raw performance, they're focusing on specialized hardware features that can directly benefit highly specific aspects of the software engineering problem space, particularly those aspects most important to their biggest customers (e.g. Google and Facebook application fleets, cloud hosting providers, etc), while also promoting vendor lock-in. Of course, the stagnation in clock rates and the weight of Amdahl's law also promote this shift. But we've seen this shift to specialized processor features many times before, most recently with HP and especially Sun as they struggled to compete with Intel. Many of Intel's recent hardware features are conspicuous for leveraging peculiarities of their ISA and architecture, for example TSX.

I don't know how self-aware Intel's strategies are, but it matters very little as their strategic options are strongly dictated by classic market dynamics. You can easily predict the directions they'll take, though that's not the same thing as predicting their success, and definitely not a prediction of their imminent demise.

User-space interrupts

Posted Oct 7, 2021 2:52 UTC (Thu) by wahern (subscriber, #37304) [Link]

Ah, the eternal dance between interrupts and polling. It plays out across the entire design space of hardware and software, with preferences seasonly shifting according to predominating concerns and constraints, but rarely aligning even between adjacent spaces. The majority of people working on internet application software might know this dance as events vs threads, and people who implement such frameworks eventually (hopefully) come to understand that it's turtles all the way down--thread scheduler could be based on events/signals/interrupts, which in turn might be triggered by some blocking/waiting/polling mechanism, etc. Processes, signals, actors, CSP, coroutines, even banalities like call stacks, are all just variations on the same basic theme centering around questions of state and control flow management and their reification. They sometimes purported to be the final, concluding step in the dance, at least for large swathes of the problem space, yet invariably the dance continued without missing a beat.

There's no right answer except that it would especially harmonious and pleasing if a user-space interrupts facility was merged at the same time as Google's user-space context switching patches for its User-Managed Concurrency Groups (UCMG) framework: https://lwn.net/Articles/856816/ An earlier incarnation, SwitchTo, is described at https://pdxplumbers.osuosl.org/2013/ocw/system/presentati... (https://web.archive.org/web/20200926200828/https://pdxplu...).

User-space interrupts

Posted Oct 8, 2021 0:34 UTC (Fri) by dancol (guest, #142293) [Link] (1 responses)

> Here, vector is a number between zero and 63; one file descriptor can be created for each vector.

Ugh. So do I understand correctly that we have yet another process-wide resource --- uipi vector numbers in this case --- not centrally allocated by any process-wide authority? How is a shared library supposed to know what vector number to use? Should it guess? What if two libraries do that?

Either the vector number namespace needs to be expanded or libc needs to grow an allocator.

User-space interrupts

Posted Oct 8, 2021 19:00 UTC (Fri) by BenHutchings (subscriber, #37955) [Link]

Plus you can only have one handler function per thread (or process? it's not clear). So there will have to be some definitive user-space library (not necessarily the C library) to manage this resource.

How does CPU state save/restore work with user interrupts?

Posted Oct 8, 2021 17:03 UTC (Fri) by tpo (subscriber, #25713) [Link] (2 responses)

I'm confused:

> [user interrupts] deliver[s] signals directly to user space, bypassing the kernel to achieve lower latency. [...] interprocess [...]

As far as I understand an interrupt on Intel architectures will essentially only store the PC/IP to the stack. To save/restore of the rest of the CPU registers is up to the interrupt handler. Similar for the TLB and the process' memory map.

Enter user interrupts where a process can trigger an interrupt without going through the kernel.

So who will do the CPU state save/restore? The process itself in its own interrupt handler? Thus it will be able to set up random virtual memory maps for itself? And if the interrupting process forgets to flush CPU state, then that will be a information leak from the interrupting process to the interrupted?

What is it that I am missing?

How does CPU state save/restore work with user interrupts?

Posted Oct 8, 2021 18:57 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (1 responses)

The receiving thread has to already be running on a different CPU, for the interrupt to be delivered directly. If any context switch is needed, it's mediated by the kernel, unless I've very much misunderstood.

How does CPU state save/restore work with user interrupts?

Posted Oct 8, 2021 20:14 UTC (Fri) by tpo (subscriber, #25713) [Link]

Oh! I see, thanks a lot for helping my understanding Ben!

Nevertheless the mechanisms looks quite tricky in practice: the inter-CPU communication/synchronization needs to be flawless otherwise it seems like anarchy and chaos could ensue... let's hope for rock solid CPU engineering!

User-space interrupts

Posted Nov 6, 2021 10:04 UTC (Sat) by mcortese (guest, #52099) [Link]

Isn't an important piece missing here? This patch set must impose additional burden on every context switch in order to keep track of where the receiving process is actually running. Inter-process communications might get some benefits, but has anyone evaluated the penalty for the rest of the code?

User-space interrupts

Posted Apr 5, 2022 11:13 UTC (Tue) by pskocik (guest, #130865) [Link]

Looking forward to this.

I feel the interface/semantics should follow signals as closely as possible. In particular I strongly think those interrupts should break system calls with EINTR if it's not possible to choose (like with SA_RESTART in the case of signals). You can always get SA_RESTART semantics by looping around a EINTR-returning systemcalls (possibly in a library) but if a hung system call won't return, there's nothing userspace can do to react to a signal/interrupt.

As for the fd interface—I don't know if it's the right one. Intuitively it feels like threads should be able to interrupt themselves via their id (so there could be an interrupt version of pthread_kill). Same-process threads implicitly via pid_t and foreign threads via pid_t after acquiring permission somehow, preferable without involving fds, but maybe that's wrong intuition.

I'm curious what other similarities and disimilarities with signals there will be. Blocking masks? SA_NODEFER? alternative stacks?

I also hear the Linux kernel cannot use the x86_64 red zone because interrupts would clobber it. Will these userspace interrupts be able to respect the red zone?