|
|
Subscribe / Log in / New account

Bringing restartable sequences out of the niche

By Jonathan Corbet
August 21, 2025
The restartable sequences feature, which was added to the 4.18 kernel in 2018, exists to enable better performance in certain types of threaded applications. While there are users for restartable sequences, they tend to be relatively specialized code; this is not a tool that most application developers reach for. Over time, though, the use of restartable sequences has grown, and it looks to grow further as the feature is tied to new capabilities provided by the kernel. As restartable sequences become less of a niche feature, though, some problems have turned up; fixing one of them may involve an ABI change visible in user space.

A restartable sequences overview

As the number of CPUs in a system grows, so does the desire to write and run highly parallel programs. In user space, as well as in the kernel, concurrent code using locks eventually runs into scalability problems, leading to an interest in the use of lockless algorithms instead. The kernel has a distinct advantage when it comes to lockless concurrent access, though, in that code running in kernel space can prevent interruptions from interrupts, preemption, or migration to another CPU during critical sections; user space has no such guarantees. So any user-space lockless algorithm must work correctly in an environment where code execution can be interrupted at any time.

Restartable sequences are one solution to this problem. To use this feature, an application must designate a critical section that does some work, culminating in a single atomic instruction that commits whatever change is being made. One of the earliest uses of restartable sequences was a memory allocator that would, within its critical section, identify an object to allocate from a per-CPU list, then remove it from that list with an atomic compare-and-exchange operation.

It would obviously be problematic if this critical section were to be interrupted between the identification of the object to allocate and the final step committing the change. To work around this problem, the application informs the kernel of the critical section by filling a structure (struct rseq_cs) describing the address range of the instructions within that section, along with a special abort address. If the kernel interrupts the execution of the relevant thread, to preempt it, move it to another CPU, or to deliver a signal, it will, on return to user space check for the existence of this structure. If the current instruction pointer lies within the critical section, the kernel will cause execution to jump to the abort address. The thread can then clean up after its aborted attempt and restart the operation.

The hope, of course, is that most runs through the critical section will complete without interruption, accomplishing their job in about the fastest way possible. The extra overhead of dealing with aborts only enters the picture in cases where the critical section is interrupted for some reason.

Using restartable sequences requires setting up a shared memory area using the rseq() system call. The actual critical section almost certainly must be written in assembly. Until relatively recently, there was no support for restartable sequences in the GNU C Library (glibc), though that has been changing. Given the effort that is required to use this feature properly, and given that it is a Linux-only feature, it is not surprising that relatively few application developers have made use of it.

Not so niche anymore

The shared-memory area set up by rseq() allows a thread to tell the kernel when it is executing in a critical section, but the communication through that area goes both ways. Whenever the kernel runs a thread, it will update this area to reflect which CPU and NUMA node the thread ended up on. That information is used by glibc to accelerate functions like sched_getcpu(), which can obtain the needed information directly without calling into the kernel. Other uses of restartable sequences within glibc are under consideration; Mathieu Desnoyers (the creator of this feature) has been advocating for increased use of it within glibc to improve scalability for some time.

Recently, there has been interest in another feature intended to help user space write efficient critical sections: time-slice extension. The idea here is that a thread running in a critical section (perhaps holding a lock) could indicate that fact to the kernel as a way of politely asking to not be preempted, even if its time slice runs out. The kernel could, if it is in a good mood, respond by giving the thread a little bit more time to finish its work and release the lock, so that other threads can proceed. It makes obvious sense to tie a feature like this to restartable sequences, so the time-slice-extension patch set added the "please can I run a little longer" bit to the shared area.

Time-slice extension is a feature that affects the core CPU scheduler, and it is likely to see wider use, so it is not surprising that it has brought more attention to restartable sequences. Thomas Gleixner dug into the patches and, not entirely liking what he saw, put together a proposal for an improved interface for time-slice extension. His changes are likely to be the way this work proceeds in the future, and may merit an article on its own, but Gleixner also found a few things to improve in the restartable-sequences code as well.

The trouble with restartable sequences

His observations in this area led to the posting of a separate patch set aimed at addressing the problems he found with restartable sequences. The end result is significantly improved performance, which is a good thing given that the performance cost of restartable sequences has been starting to attract attention. Jens Axboe, for example, responded to the series saying: "I'd see rseq at 2-3% of overall CPU time last I tested". Kernel developers will work long and hard to eliminate a performance hit much smaller than that. A couple of the things that Gleixner fixed are good examples of how the combination of history and oversights can slow things down.

There are a number of ways in which a user-space thread might lose access to the CPU. The restartable-sequences code in current kernels maintains a field, rseq_event_mask, in each thread's task_struct structure to track those ways; there are separate bits for preemption, the delivery of a signal, and migration to a different CPU. So, for example, if the kernel ends up delivering a signal to a task, it will set the RSEQ_EVENT_SIGNAL bit in rseq_event_mask. When the time comes to return to user space after whatever has happened, rseq_need_restart() will be called to see if a critical section was interrupted; in that case, rseq_event_mask will be consulted to see whether the critical section should be continued, or whether the thread should instead be diverted to the abort address.

Gleixner noticed a couple of interesting problems with this code. One is that the kernel is constantly setting the bits in rseq_event_mask, but those bits are only cleared in rseq_need_restart(), and only when the thread happens to be executing within a critical section at the time. Otherwise those bits just pile up there. That is likely to result in a spurious restart the next time an interrupt occurs while the thread is executing within a critical section, taking away some of the performance advantage that the restartable sequences feature was added to provide.

Perhaps more tellingly, though, those bits exist to allow a thread to specify when its critical section should be aborted. Given the nature of any given section, some sorts of events do not risk breaking the algorithm; a thread might be able to handle a signal without risking concurrent access to data needed by its critical sections, for example. But that feature was deprecated for the 6.0 release in 2022; it added complexity without bringing a lot of value. That deprecation was of a relatively severe variety; any process trying to use that feature would be immediately killed with a SIGSEGV signal. So the chances are pretty good that there are no actual users of the feature in the wild now.

Gleixner replaced all of that machinery with a single "an event happened" boolean value. This elimination also allowed the removal of some accesses to user-space memory, which speeds things up. User-space access from the kernel is never entirely cheap, and the need for Spectre mitigations has made it even slower. As a separate optimization, the series changes the remaining user-space access to the masked user-space-access primitives (added in 6.12) that carry a lower mitigation cost.

There are two ways that a thread running in user space can end up switching to kernel mode: an interrupt happens, or the thread makes a system call. All of the cases that restartable sequences are meant to handle will be the result of an interrupt of some type or another. Threads in critical sections are not supposed to make system calls at all. When running on a kernel built with the DEBUG_RSEQ configuration option, making a system call while in a critical section will result in the immediate termination of the offending process — a gentle hint that the restartable sequences feature is not being used correctly. Without that option (as will be the case in production systems), though, the kernel will handle a system-call return in the same way as an interrupt return: if the thread is running in a critical section and an event has occurred, control will be redirected to the abort handler.

Gleixner's work separates the handling of the interrupt and system-call case, since the time-slice-extension feature will only apply for the former. As an optimization, he removed the restartable-sequences handling for system-call returns; the relevant patch notes:

This changes the current behaviour, which just blindly fixes up the critical section unconditionally in the syscall case. But that's a user space problem when it invokes a syscall from within a critical section and expects it to work. That code was clearly never tested on a debug kernel and user space can keep the pieces.

If there are users that behave in this way, and the code has worked so far, they may be in for an unpleasant surprise once this change goes in. That would likely become the cause of an interesting conversation; "user space was always broken" is not normally considered an acceptable argument for causing a working program to break. There is a good chance that this change will have to be reverted if trouble appears. Given the relatively limited use of restartable sequences so far, though, there is reason to hope that no such user exists.

It may be a little while until we find out, anyway. This work will need review, including, presumably, a look at how it interacts with the time-slice extension work. For extra fun, Gleixner has let it be known that he has found more things to fix, so further patches will be forthcoming. But, at their core, the current changes appear to make sense, addressing problems that have managed to sneak into the code over the years. Almost any body of code can benefit from restarted development involving a fresh set of eyes.

Index entries for this article
KernelRestartable sequences
KernelScheduler/Time-slice extension


to post comments

If this threatens a userspace (meaning glibc in particular) change ...

Posted Aug 21, 2025 16:43 UTC (Thu) by davecb (subscriber, #1574) [Link] (4 responses)

... glibc is one of the Linux users of versioned interfaces.

Ask them how to support an abi change by versioning the interface that changed, using linker symbols. High-level (and oversimplified!) view in https://leaflessca.wordpress.com/2018/05/31/an-alternativ..., and the real details in David J Brown's proposal, https://www.usenix.org/legacy/publications/library/procee...

If this threatens a userspace (meaning glibc in particular) change ...

Posted Aug 22, 2025 0:17 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (3 responses)

I wish people wouldn't use ELF symbol versioning. It causes problems the moment you step off of the narrow path of running old binaries AOT-compiled native-code on new systems.

What happens if you call dlsym? What happens if you try to build a program on a new system targeting an old one and pick up the new symbol without realizing it? (glibc refuses to provide a preprocessor macro for targeting older systems.)

Much better to give new functions new names. Don't add a symbol version my_api@2. Just define a my_api2(). It looks uglier in the short term, sure, but saves everyone trouble in the end.

If this threatens a userspace (meaning glibc in particular) change ...

Posted Aug 22, 2025 0:29 UTC (Fri) by intelfx (subscriber, #130118) [Link] (2 responses)

> glibc refuses to provide a preprocessor macro for targeting older systems

Perhaps this is a glibc problem rather than a symbol versioning problem?

If this threatens a userspace (meaning glibc in particular) change ...

Posted Aug 22, 2025 0:43 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

People look to glibc for inspiration on how to use symbol versions in general, and what glibc does is wrong and broken. And no, "just compile on an old docker image" isn't a solution either: what if I want to write a program that *conditionally* uses new functionality? If I compile against old glibc headers, I don't see definitions of the new functionality. If I compile against new glibc headers, I can't run on older systems. The whole situation sucks. The whole problem is avoided by simply not overloading function by version.

How do you solve the problem of shipping bugfixes that might break older versions e.g. the infamous memcpy direction problem? Use what macOS, Windows, and Android do: compatibility versions embedded in the binary manifest.

If this threatens a userspace (meaning glibc in particular) change ...

Posted Aug 23, 2025 1:42 UTC (Sat) by comex (subscriber, #71521) [Link]

macOS does do runtime checks based on compatibility versions, but it also does something more sophisticated.

There's a compiler flag to specify the minimum OS version you need to run on. Meanwhile, every declaration in every system header is marked with a minimum OS version, so you get an error if you accidentally use a newer function (this can be disabled for functionality you want to use conditionally). In more subtle cases, header files can manually check the target version with #if and change their behavior, though this is rarely done.

The goal is that you can always use the latest SDK, even if you are targeting an old OS version (though there is a limit to how far back you can go).

I think glibc would be well-served if it adopted a similar scheme.

Please Rewrite Glibc RWlock or Revert to Drepper's Good Old Implementation.

Posted Aug 22, 2025 3:02 UTC (Fri) by PengZheng (subscriber, #108006) [Link]

> Mathieu Desnoyers (the creator of this feature) has been advocating for increased use of it within glibc to improve scalability for some time.

I would once again propose the buggy rwlock([1]), which is a complicated mess, as a rewrite candidate.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=31477

Followup series

Posted Aug 23, 2025 18:42 UTC (Sat) by corbet (editor, #1) [Link]

For the curious: Thomas Gleixner has posted a followup series fixing another set of problems and offering some strong commentary on the situation as a whole.

I missed something about abort

Posted Aug 24, 2025 17:57 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (2 responses)

So there's this:

"the kernel will handle a system-call return in the same way as an interrupt return: if the thread is running in a critical section and an event has occurred, control will be redirected to the abort handler. "

And this:

" If the current instruction pointer lies within the critical section, the kernel will cause execution to jump to the abort address. The thread can then clean up after its aborted attempt and restart the operation. "

What's "if an event occurred"?

I read this as saying invoking a system call in a critical section results in the abort handler being called.. which restart the sequence, which would seem to be a bit (infinite) loopy.

But the rest of the article seems to imply the current (pre-tglx) code inadvertently works on non-debug kernels.

So I missed something, probably obvious.. but I can't sort out what.

I missed something about abort

Posted Aug 24, 2025 18:02 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Maybe more succinct: can a system call trigger what the retart code considers "an event" ( pre-emption, interrupt, cpu migration) ?

I assume the answer is "possibly", in which case the code loops infinitely. But wondering if I'm misunderstanding.

I missed something about abort

Posted Aug 24, 2025 18:42 UTC (Sun) by corbet (editor, #1) [Link]

Yes, a system call can result in those things. As an extreme example, think of a read() call that has to wait for data to be brought in from disk; that will result in the calling thread losing access to the CPU indefinitely. But almost any system call can result in preemption and/or migration of the calling thread.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds