Revisiting the kernel's preemption models (part 1)
Fast memory clearing
The patch set in question adds a function to use x86 string instructions to clear large amounts of memory more quickly; the change produces some nice performance improvements for situations (such as page-fault handling) where ranges of memory must be zeroed out before being given to the faulting process. But there is one little problem with this approach: being able to clear large ranges with a single instruction is nice, but that one instruction can execute for a long time — long enough to create unwanted latencies for other processes running on the same CPU.
Excess latency caused by long-running operations is not a new problem for the kernel. The usual response is to break those operations up, inserting cond_resched() calls to voluntarily give up the CPU for a bit if a higher-priority process needs to run. It is, however, not possible to insert such calls into a single, long-running instruction, so some other mitigation is needed. Arora chose to add a new task flag (TIF_ALLOW_RESCHED) marking the current task as being preemptible. If the kernel, at the end of handling an interrupt, sees that flag, it knows that it can switch to a higher-priority task if need be. This new flag could be set before clearing the pages, then reset afterward.
This mechanism turned out to have some problems. The code setting the flag may be preemptible, but other functions it calls may not be. Other events, such as hardware interrupts or CPU traps (page faults, for example) could put the kernel into a non-preemptible situation as well. Having a flag set that marks the current task as being preemptible anyway is just not going to lead to good things.
It looks like making this idea work with current kernels would require moving away from a task flag, and toward marking specific ranges of code as being preemptible in this way. That could limit how widely this feature could be used, though, since finding out whether the current location is preemptible would require maintaining a data structure of the preemptible ranges and searching it, in the interrupt path, to see if preemption is possible. That made Linus Torvalds a little unhappy:
I really hate that, because I was hoping we'd be able to use this to not have so many of those annoying and random "cond_resched()" calls. [...] I was hoping that we'd have some generic way to deal with this where we could just say "this thing is reschedulable", and get rid of - or at least not increasingly add to - the cond_resched() mess.
Peter Zijlstra pointed out that Torvalds was describing full preemption, which the kernel already supports quite well. That led to a bit of a shift in the discussion.
Preemption models
The traditional Unix model does not allow for preemption of the kernel at all; once the kernel gets the CPU, it keeps executing until it voluntarily gives the CPU up. In the beginning, Linux followed this model as well; over the years, though, the kernel has gained a number of preemption modes, selectable with a configuration option:
- PREEMPT_NONE is the traditional model, with no preemption at all. The kernel must give up the CPU, via a return to user space, a blocking operation, or a cond_resched() call, before another task can run.
- PREEMPT_VOLUNTARY increases (significantly) the number of points where the kernel is said to be voluntarily giving up the CPU. Each call to might_sleep(), which is otherwise a debugging function marking functions that could block, becomes a preemption point, for example.
- PREEMPT is full preemption; the kernel can be preempted at any point where some factor (such as holding a spinlock or explicitly disabling preemption) does not explicitly prevent it.
- PREEMPT_RT is the realtime preemption mode, where even most spinlocks become preemptible and a number of other changes are made as well.
These options represent different tradeoffs between throughput and latency. Preemption is not free; it can worsen cache behavior, and tracking the state needed to know whether preemption is safe at any given time has costs of its own. But latency hurts as well, especially for interactive use. At the PREEMPT_NONE end of the scale, only throughput matters, and latencies can be long. As the level of preemption increases, latency is reduced, but throughput might suffer as well.
As an extra complication, another option, PREEMPT_DYNAMIC, was added to the 5.12 kernel by Michal Hocko in 2021. It allows the preemption choice to be deferred until boot time, where any of the modes except PREEMPT_RT can be selected by the preempt= command-line parameter. PREEMPT_DYNAMIC allows distributors to ship a single kernel while letting users pick the preemption mode that works best for their workload.
Torvalds, seemingly looking closely at PREEMPT_DYNAMIC for the
first time, observed
that it maintains all of the information about whether the current task is
preemptible, even when running in the no-preemption modes. As Zijlstra responded,
that suggests that the overhead of maintaining that information is not seen
as being a problem; Ingo Molnar added that, while
it might be nice to patch that overhead out, "it's little more than
noise on most CPUs, considering the kind of horrible security-workaround
overhead we have on almost all x86 CPU types
". That overhead, he said,
is less of a concern than preemption causing "material changes to a
random subset of key benchmarks that specific enterprise customers care
about
", so PREEMPT_DYNAMIC works well as it is.
Zijlstra also said that, since PREEMPT_DYNAMIC seems to work for
distributors, he is open to removing the other options. While the
connection wasn't made in the conversation, doing so might solve the
original problem as well. If the kernel is always maintaining the
information needed to know when preemption is safe, that information can be
used for a safe implementation of TIF_ALLOW_RESCHED. It may not
come to that, though; the conversation is ongoing and some more significant
changes to preemption are being considered; stay tuned for the second part of this series once the dust
settles a bit.
Index entries for this article | |
---|---|
Kernel | Preemption |
Posted Sep 21, 2023 16:13 UTC (Thu)
by saladin (subscriber, #161355)
[Link]
Currently, alpha, hexagon, m68k (classic), and UML don't support PREEMPT, but it was argued that it shouldn't
UML is a bit more complicated, but even so there's a patch [2] adding (limited) support for PREEMPT that shows
This area is worth watching. Maybe we'll get full support for PREEMPT across the board, or maybe we'll
[1]: https://lore.kernel.org/linux-um/877comw8m7.ffs@tglx/
Posted Sep 21, 2023 17:29 UTC (Thu)
by mtodorov (guest, #158788)
[Link] (17 responses)
I just wondered how much additional cost would it be in the code to make the choice of the PREEMPT model changeable at the runtime. This might allow for changing the mode from "normal operation" to "realtime" without rebooting ...
Posted Sep 21, 2023 17:34 UTC (Thu)
by mb (subscriber, #50428)
[Link] (15 responses)
Posted Sep 21, 2023 17:58 UTC (Thu)
by mtodorov (guest, #158788)
[Link] (14 responses)
However, I can see the need for that, i.e. in changing a battleship's computer's mode from "normal operation" in which performance and computing efficiency is preferred to a "battle mode" or "condition red" where latency would be a paramount for ship's survival.
Posted Sep 21, 2023 20:49 UTC (Thu)
by geofft (subscriber, #59789)
[Link] (1 responses)
So you can't really convert this in a running system, because locks can be held at any time. You'd have to acquire the spinlock (which can block), then turn it into a sleeping lock, then figure out who's been spinning on the spinlock while you held it and somehow convert them. For a use case like going into battle, even if this were technically possible, having the mode switch require acquiring every lock in the system isn't acceptable because it could take indefinitely long. (And even if it weren't for this, there would be an ongoing performance cost in both modes in order to support checking what mode you're in and so forth, meaning that performance will never be as good as a pure real-time or pure non-real-time kernel.)
Posted Sep 21, 2023 21:26 UTC (Thu)
by mtodorov (guest, #158788)
[Link]
The locks which are currently held in a non-preemptible mode are obviously going to be held, but the idea behind locking is that it is short-held for performance reasons anyway.
I recall the kernel NMI timeouts on i.e. RCU locks being held for 20 seconds, but IMHO that is a sign of something deeply wrong in the kernel code ...
However, I have given up the idea that I will ever understand fully how +32M lines of code work and inter-operate.
Spinlocks are the idea of waiting from 6502 and Z80 and I think they're evil.
Posted Sep 22, 2023 3:37 UTC (Fri)
by josh (subscriber, #17465)
[Link]
Switching modes is a complex and error-prone operation that affects the operation of all other software, so the net effect of doing this on the verge of a critical situation would be to switch into a less-tested mode for all software. Such a mode switch would need to be regularly tested, but even such testing would not be as good as simply running in that mode all the time.
Posted Sep 22, 2023 10:38 UTC (Fri)
by cloehle (subscriber, #128160)
[Link] (3 responses)
Either you're application is real-time critical or it's not. Even in your case "Switching to battle mode" would be time-critical, wouldn't it?
Posted Sep 22, 2023 12:17 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (2 responses)
> Either you're application is real-time critical or it's not. Even in your case "Switching to battle mode" would be time-critical, wouldn't it?
Actually if you really mean real-time, no it wouldn't.
The term "real-time" is, sadly, horribly abused and often confused with "online".
We basically have three - conflicting - targets we may want to optimise for.
(1) Batch: Do as much work as possible, as efficiently as possible, and if some jobs take forever, well they do.
(2) Online: Don't keep the user waiting. If that means background tasks get held up because we're wasting time checking to see if the user wants anything, well so be it.
(3) Real time: We have deadlines. They MUST be met. We might have a job that only takes 20 minutes, but it needs to be completed AT midnight tomorrow. Or say we're making jam. Finish too early, and the jam will run away off the toast. Finish too late and we won't be able to get it out of the pan and into the jars. Basically, for real-time it doesn't matter when the deadline is, it matters that you can't move it.
So take your battleship. Real-time doesn't mean the transition will happen quickly. Real-time means you know that if you engage real-time mode, the cascade of actions to change OS state will take, say. pretty much exactly five minutes (or whatever time that is).
Cheers,
Posted Sep 22, 2023 18:11 UTC (Fri)
by cloehle (subscriber, #128160)
[Link] (1 responses)
The critical path for the RT reaction just increased from signal -> RT reaction (with a defined latency upper bound for RT systems) to signal -> switch to RT-mode -> RT reaction (no upper bound anymore).
Or how does the system know when to be in RT-mode?
Posted Sep 22, 2023 18:41 UTC (Fri)
by mb (subscriber, #50428)
[Link]
There is no such thing as a "system RT mode".
The kernel just provides mechanisms for tasks to be scheduled with various RT properties (e.g. RT-Fifo, RT-RR, deadline sched, locking latency guarantees, etc..).
A task has to opt-in to being RT. (And selecting a RT sched class is only part of that. There are many many many more things to consider.)
And that's why it doesn't make sense at all to switch between full-preempt and RT-preempt during runtime. That would just add even more overhead with no added benefit. If your task doesn't need RT, then don't schedule it in a RT class.
Posted Sep 22, 2023 12:34 UTC (Fri)
by kazer (subscriber, #134462)
[Link] (6 responses)
The RT mode is therefore determined by the task before code is ever deployed, not as a mode of operation "on the fly". RT is normally due to safety reasons and the worst case is what matters. If it safety isn't of consideration (only throughoutput) preemption isn't needed.
Posted Sep 22, 2023 12:43 UTC (Fri)
by kazer (subscriber, #134462)
[Link]
The main feature of full RT is that it is *deterministic* - things happen at precisely when predicted and expected. Latency or performance is not important and they can be sacrificed for the sake of correctness.
Servers can have a long latency for the sake of throughoutput and desktop users can prefer low latencies. While these deal with different pre-emption modes, they are not about real-timedness. So even in these cases boot-time selection is quite enough.
Posted Sep 22, 2023 13:56 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
RT isn't only of importance in safety-critical systems; in audio systems, 3 ms of latency is roughly equivalent to 1 m distance between the output device and the listener. For simple listening cases, this is a non-issue; you just have a big enough buffer that any plausible latency is covered by the buffer (e.g. a 3,000 ms buffer will cover most likely stalls). But for telephony, live effects and other such cases where there's feedback between the microphone and the output, you need to bound latency to a much smaller value.
The usual distinction in the RT world is between hard RT and soft RT; a system is soft RT if it recovers from a missed deadline as soon as it starts meeting deadlines again, and hard RT if further action has to be taken to recover or if recovery from missed deadlines is not possible. In this setup, audio is soft RT - if you miss a deadline, the audio system is in a fault condition, but as soon as you start hitting deadlines again, it recovers automatically; a fly-by-wire system is hard RT, since a failure to meet deadlines can result in the plane needing maintenance work.
Posted Sep 23, 2023 6:17 UTC (Sat)
by donald.buczek (subscriber, #112892)
[Link] (3 responses)
Good example. You system doesn't need to be in RT mode if the plane is parked on the ground. It may, for example, be in some low power mode. When you are ready to go, you switch to fly mode which implies RT on critical systems. It's not critical, how long that takes. What you need, though, is an indication that the switch is completed.
Posted Sep 23, 2023 7:23 UTC (Sat)
by mb (subscriber, #50428)
[Link] (1 responses)
Posted Sep 23, 2023 8:27 UTC (Sat)
by mtodorov (guest, #158788)
[Link]
Just as a curiosity, about a year ago the kernel 6.0 compiled with KASAN made a number of RCU stall warnings.
It was blamed on the KASAN slow down, but I felt something odd if some RCU lock was held for more than 20 milliseconds, enough to trigger the NMI watchdog.
After a year, I get the notion that it was due to the coarse granularity of the locks held while doing some tasks. In the worst case, an unlucky spinlocks contention can degrade your SMP system's performance in the worst case to a multicore 6502 ... Did somebody make a note on real-time telephony and heart monitor systems?
Don't take this as I am ungrateful for the kernel - I am just not fond of the spinlocks and I wish I had a new parallel programming paradigm and a better grasp on the lockless algorithms and the RCU ...
Posted Sep 25, 2023 10:23 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
You need to separate out PREEMPT_RT from real-time mode.
PREEMPT_RT configured kernels have slightly higher overheads than normal kernels, especially under full load, in return for a hard guarantee on the latencies that can be measured by processes in real time scheduling classes (at the time of this comment, that's SCHED_FIFO, SCHED_RR and SCHED_DEADLINE). If you don't use real time scheduling classes, then the overhead of PREEMPT_RT is entirely wasted, because online (SCHED_OTHER) and batch (SCHED_BATCH, SCHED_IDLE) scheduling classes do not get lower latencies in a PREEMPT_RT kernel.
Separately, a system is "in real-time mode" if it has one or more processes in runnable state in a real-time scheduling class. You can "switch" in and out of "real-time mode" by either blocking all processes in real-time scheduling classes on a kernel primitive such as a futex, or by removing all processes from real-time scheduling classes.
You can be "in real-time mode" with a PREEMPT_NONE or PREEMPT_VOLUNTARY kernel; the issue with this is that the kernel does not provide any guarantee that your process will be scheduled in time to meet its deadlines in this situation, but it's useful for (e.g.) providing musicians in a studio with live monitoring of the recording their DAW is making (complete with applied effects), where the worst consequence of missing a deadline is that you'll need to do another take. You'd prefer not to have to do another take, hence using real time scheduling to ensure that only the kernel can cause you to miss a deadline.
You can also use a PREEMPT_RT kernel with no real-time tasks. In this case, the kernel's latency guarantees are worthless, because you have no tasks that get guaranteed latencies, but you pay a small amount of overhead to allow the kernel to reschedule bits of itself to keep the latency guarantees.
The intended case for PREEMPT_RT kernels is to have real-time tasks, at least some of the time, where the system has been analysed to guarantee that the real-time tasks will always meet their deadlines.
You don't need to change the kernel configuration to switch out of "real-time mode" and enter low power states that are incompatible with your deadlines; you just change your scheduling. You can't switch codepaths between PREEMPT and PREEMPT_RT at runtime, because you need to have the system go completely idle in order to change locking, which is also the pre-requisite for kexec. If you do want to switch kernels, you can kexec between kernels at roughly the same time penalty you'd face if you tried to switch at runtime.
Posted Sep 22, 2023 9:46 UTC (Fri)
by joib (subscriber, #8541)
[Link]
> Ie. in the long run - after a careful period of observing performance
> !PREEMPT_RT # regular kernel. Single default behavior.
> Any other application level preemption preferences can be expressed via
PREEMPT_RT would still be a separate compile time option, due to the difficulty of switching locking primitives on the fly as mentioned in other comments here (and presumably switching to threaded interrupt handlers on the fly could be nontrivial as well). But the default behavior would be something similar to the current PREEMPT_NONE or PREEMPT_VOLUNTARY for non-RT processes, and similar to the current full PREEMPT ("low latency desktop") for RT processes. Best of both worlds.
The end of ARCH_NO_PREEMPT, at least for UML
take more than 10 lines of assembly and a smattering of {enable,disable}_preempt() calls to support it [1].
significant improvement over PREEMPT_NONE [3].
just get a few more architectures to support PREEMPT.
[2]: https://lore.kernel.org/linux-um/20230921155522.283582-1-...
[3]: https://lore.kernel.org/linux-um/6934482b-f210-b86f-3fab-...
Revisiting the kernel's preemption models (part 1)
As an extra complication, another option, PREEMPT_DYNAMIC, was added to the 5.12 kernel by Michal Hocko in 2021. It allows the preemption choice to be deferred until boot time, where any of the modes except PREEMPT_RT can be selected by the preempt= command-line parameter. PREEMPT_DYNAMIC allows distributors to ship a single kernel while letting users pick the preemption mode that works best for their workload.
Revisiting the kernel's preemption models (part 1)
Even today rt cannot be selected with the PREEMPT_DYNAMIC boot parameter.
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Wol
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
But that does *not* mean that every task suddenly becomes a RT task. In fact, in a RT system almost all tasks usually still run in normal scheduling mode.
Normal tasks on a RT kernel basically just behave like tasks on a full-preempt kernel. Just with a little bit more overhead here and there. (I'm simplifying for the sake of briefness)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
If your RT application decides to go non-RT, then it can do that now. Just switch scheduling modes.
And if you decide to put your system into sleep, you can do that now.
Switching the kernel preemption mode at runtime is absolutely not necessary to do all that.
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
Revisiting the kernel's preemption models (part 1)
> regressions and other dragons - we'd only have *two* preemption modes left:
> PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter.
> scheduling policies & priorities.