Extending time slices for user-space locks

By Daroc Alden
February 19, 2025

Steven Rostedt recently posted a patch set that could help improve the performance of certain user-space applications by giving the scheduler more context about when they are safe to interrupt. The patch set lets programs request a small grace window before they can be interrupted so that they can relinquish any locks, decreasing the amount of time that other threads have to spend waiting. Rostedt shared performance numbers suggesting that the patch might cut the amount of time spent acquiring locks in half for some programs — although, since his test was specifically tuned for this case, real-world projects should expect a somewhat less dramatic improvement. The change received some pushback from scheduler maintainer Peter Zijlstra, who objected to the patch set's approach.

Background

The core job of the scheduler is to decide which task ought to be running on each CPU at a given time. In order to carry out its decision, it will preempt running tasks at different points — when a task has been running for its allotted time slice, when a hardware interrupt arrives, when a higher-priority task needs to run, etc. When a task is running in user space, preempting it is relatively straightforward. But preempting kernel code requires a bit more cooperation.

The Linux kernel has several different levels of preemption that users can select, from PREEMPT_FULL, which can interrupt kernel code at (almost) any point all of the way to PREEMPT_NONE, which never interrupts kernel code except at explicit preemption points. In 2024, the kernel gained a new level, PREEMPT_LAZY. This setting makes it so that when an event might normally cause another thread to preempt the running thread, it can instead just set a flag. When the running thread returns to user space, or when the next scheduler tick occurs, the kernel looks at that flag to decide whether to reschedule the running thread.

There are complications, of course. Scheduler ticks, hardware interrupts, and realtime threads all impact preemption in slightly different ways. LWN has covered the details for years. But broadly, PREEMPT_LAZY combines the best performance characteristics of PREEMPT_FULL and PREEMPT_NONE — the running thread can be preempted at almost any time (allowing for low-latency realtime processes), but it mostly won't be. In the typical case, the thread will finish whatever its current task is, and only be preempted on return to user space or when its time slice ends, which is better for performance.

In particular, lazy preemption lets the kernel defer preemptions until a task would need to make a context switch anyway, avoiding unnecessary switches and therefore improving performance. The decision about whether to let a task keep running can only be made with information that the kernel has, however. If a thread uses a purely user-space lock, such as a spinlock, then preempting it could require other threads to waste time waiting for the thread to be scheduled again.

So Rostedt's patch set gives user-space programs a way to signal the kernel when they are in a critical section; it adds a field to the structure used for restartable sequences that the kernel will check when it preempts the program. This structure is mapped into both kernel memory and the memory of the user-space program, so keeping the kernel up to date doesn't impose the overhead of a system call. If the program indicates that it's in a critical section and a preemption event occurs, the kernel will check whether the reason for the preemption is something that can be slightly delayed (according to the flag for PREEMPT_LAZY). If so, the scheduler will wait a short time so that the program can give up its locks.

The kernel also uses the same shared memory to tell the user-space process that it has received a grace period, and it should make a system call (any system call) once it's out of the critical section to let the kernel schedule the next task. If an uncooperative user-space program doesn't do that, the scheduler will just preempt the program normally after 50µs. In exchange for that short delay, the patch set significantly reduces the amount of time that threads on other CPUs spend waiting for user-space locks. While the exact speedup will depend on the application, Rostedt did share some performance numbers where the amount of time spent waiting on a lock was cut in half.

In his justification for the patch, Rostedt was clear that this is not intended to enable any new abilities that user-space programs did not already have; it's a pure performance optimization for low-level user-space locks. The kernel is not required to honor a process's request for more time. The information is only treated as a hint to allow it to schedule tasks more efficiently.

The discussion

Zijlstra did not like the patch set, saying "I still have full hate for this approach." Despite that discouraging opening, Rostedt asked how he would like the idea to be implemented, prompting Zijlstra to elaborate. He would prefer an implementation that works with all preemption modes, not just PREEMPT_LAZY.

If this feature were implemented for all preemption modes, then that could lead to a situation where an event causes a task to be rescheduled in the kernel, but the same event wouldn't cause a user-space task to be rescheduled, Rostedt said. Although he didn't say it explicitly, the implication was that such a state of affairs wouldn't really make sense. He also explained that his ultimate goal was to apply this performance enhancement to virtual machines.

Zijlstra responded with a list of complaints, first among them that PREEMPT_LAZY is neither the default nor even a recommended preemption method right now. And even if it eventually became the default, there would always be PREEMPT_FULL as an option. So any approach that was based on the chosen preemption mode would not apply to most workloads, he said.

Rostedt didn't see the problem, saying that it was okay if most programs had to wait for PREEMPT_LAZY to become the default in order to benefit from this. Furthermore, users who enable PREEMPT_FULL are trying to minimize latency — and therefore would likely not want this feature in any case.

Suleiman Souhlal asked why PREEMPT_LAZY was needed for Rostedt's patch set at all; Rostedt explained that the code needed some way to distinguish between preemption events where a little extra delay was tolerable (such as reaching the end of a time slice) versus preemption for a realtime thread. "We should not delay any RT tasks ever", he said. Much later in the discussion, he even shared an idea for how his patch set could potentially be extended to other preemption modes in the future by changing the flags the kernel sets for different events.

The idea that realtime threads should not be delayed proved to be another point of contention with Zijlstra, however, who didn't see the problem with adding a small, bounded delay to realtime tasks. He explained that there were already places in the kernel that could hold a lock for 50µs, and didn't think that adding one more in user space would cause problems:

There is no difference between a 'malicious/broken' userspace consuming the entire window in userspace (50us, 20us whatever it will be) and doing a system call which we know will cause similar delays because it does in-kernel locking.

Rostedt remained adamant that adding any extra delay onto realtime tasks was simply not acceptable, and refused to change the patch set in that way. The two of them exchanged several more emails on the topic, but were unable to reach an agreement. Rostedt also got into an extended exchange with Joel Fernandes on the same topic.

This wasn't the first time that the idea of allowing user space to provide more information to the scheduler about locking has been raised. One of Zijlstra's complaints was that he preferred the approach that Prakash Sangappa shared a few months earlier — although he had plenty of disagreements with how that patch set was implemented as well. It had a similar ultimate effect, but implemented the communication between the kernel and user space with a new shared memory mapping and integrated with the scheduling code a bit differently.

While it seems that Zijlstra is in favor of the general idea of a mechanism to better schedule threads using user-space locks, he doesn't think any of the patches on the topic so far are how it should be done. On the other hand, he didn't explicitly refuse either patch set, so Rostedt and Sangappa will probably continue working on their respective patch sets. In time, such a change could provide a significant performance benefit to virtual machines, green-threading libraries, and other user-space programs that make use of spinlocks. If Zijlstra gets his way, the change may even benefit kernels with preemption modes other than PREEMPT_LAZY.

Index entries for this article
Kernel	Scheduler/Time-slice extension

Work environment

Posted Feb 20, 2025 6:26 UTC (Thu) by mb (subscriber, #50428) [Link] (3 responses)

>"I still have full hate for this approach."

What a lovely first reaction to an RFC (Request For Comments).
Full of hate, completely lacking any information.
That's the environment everybody wants to work in.

Work environment

Posted Feb 20, 2025 12:15 UTC (Thu) by hmh (subscriber, #3838) [Link] (1 responses)

Is it even a "first reaction" ? It sure sounds like that there was previous discussion about this approach, were presumably the technical reasons for the "hate" were explained -- note: I am talking about the approach, *not* the patchset.

I did not hunt for it, though.

The rest of the article makes it clear that the "hated" approach does have a chance of being adopted if the alternatives prove to be worse...

Work environment

Posted Feb 20, 2025 16:03 UTC (Thu) by mb (subscriber, #50428) [Link]

>Is it even a "first reaction" ?

Well, it doesn't matter. Because either way this is not a sane answer.

It
- Is literal hate.
- Has no information at all what could be wrong. It misses the important word "because". Even the sentence "I still think this is bad because of the things we discussed earlier" is way better than this.

It causes nothing to resolve the situation.
It only causes related persons to feel bad.
The only answer that can be given is "what is wrong then?". Which is what actually happened. Why not start with explaining what is wrong?

This is all true no matter if there was previous discussion on the topic.
Can we please all (including myself) try to avoid such messages altogether?

I'll stop here, because this is not really on topic. (But it's not on-topic anywhere).

Work environment

Posted Feb 20, 2025 13:00 UTC (Thu) by camhusmj38 (subscriber, #99234) [Link]

It is certainly an unfortunate way of phrasing that. The following may be more politique and certainly more professional:
"I am not convinced by this approach" or "I think this approach is not the right one."

Interesting idea

Posted Feb 20, 2025 16:16 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

I like the idea (I don't judge the implementation details). I'm already dealing with exactly this problem when there are more total threads than CPUs and a thread can be interrupted with a lock held. This particularly happens during service reloads where the old process finishes and the new one starts doing its work, for a short time you see a CPU increase due to this. And indeed, being able to signal to the kernel at low cost "please don't play the fool now" can help.

Maybe one approach to improve the solution and try to make Steven and Peter agree would be to say that PREEMPT_LAZY can even schedule you out before the end of the slice, and that you can avoid this using that flag. This way it couldn't be used to grab more than one slice, and by default you'd just possibly get slightly less (or randomly a bit less or a bit more, i.e. some noise) to limit the benefit of an abuse.

We could imagine for example that a thread that has been granted some extra time will be "punished" by this same amount of time on the next slice. Just an idea.

Interesting idea

Posted Feb 20, 2025 23:57 UTC (Thu) by riking (guest, #95706) [Link] (1 responses)

I also feel like there's some clear gaps in the implementation details, for instance:

Let's presume for my argument here that the existing values of the flag are 0u32 for clear, 1u32 for in lock. Shouldn't the kernel cmpxchg the 1u32 to a 3u32 to indicate "in lock and requested to schedule out", and then when the userspace goes to clear the flag, it fails the cmpxchg 1u32 to 0u32, sees the 3, and does a sched_yield.

That feels like a full protocol to me. Full performance is achieved by fully implementing the protocol, partial implementations still get some benefit.

Interesting idea

Posted Mar 10, 2025 19:47 UTC (Mon) by nevets (subscriber, #11875) [Link]

The kernel doesn't race with user space. The value is per thread and when the thread is in the kernel, user space can't update the value. User space updates the value by 2 every time it enters a critical section and subtracts 2 when it exits. Both increment and decrement must be atomic to the CPU. It can't be a read modify write, as user space can race with the kernel (from an interrupt).

On subtracting 2, if the result is 0 or 1, it is out of its nested critical section (or there's a bug). If it's 1, since user space only adds or subtracts 2, it means that the kernel set it, and it needs to schedule. No cmpxchg needs to be done.

Proxy execution

Posted Feb 20, 2025 18:06 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

How does this relate to the proxy execution that was suggested earlier? It seems to me that if you have proxy execution, the benefit of this would go away almost entirely, right? (Save for some context switching cost.)

Proxy execution

Posted Feb 20, 2025 20:22 UTC (Thu) by daroc (editor, #160859) [Link]

Assuming I'm thinking of the right thing, I think proxy execution only works for cases where the kernel can tell that one thread is waiting on another thread. Which is great for things like mutexes, but doesn't work for purely user-space spinlocks.