Revisiting the kernel's preemption models (part 1)

Posted Sep 21, 2023 17:29 UTC (Thu) by mtodorov (guest, #158788)
Parent article: Revisiting the kernel's preemption models (part 1)

As an extra complication, another option, PREEMPT_DYNAMIC, was added to the 5.12 kernel by Michal Hocko in 2021. It allows the preemption choice to be deferred until boot time, where any of the modes except PREEMPT_RT can be selected by the preempt= command-line parameter. PREEMPT_DYNAMIC allows distributors to ship a single kernel while letting users pick the preemption mode that works best for their workload.

I just wondered how much additional cost would it be in the code to make the choice of the PREEMPT model changeable at the runtime.

This might allow for changing the mode from "normal operation" to "realtime" without rebooting ...

Revisiting the kernel's preemption models (part 1)

Posted Sep 21, 2023 17:34 UTC (Thu) by mb (subscriber, #50428) [Link] (15 responses)

rt is *much* more than just the preempt mode.
Even today rt cannot be selected with the PREEMPT_DYNAMIC boot parameter.

Revisiting the kernel's preemption models (part 1)

Posted Sep 21, 2023 17:58 UTC (Thu) by mtodorov (guest, #158788) [Link] (14 responses)

I see, this is correct.

However, I can see the need for that, i.e. in changing a battleship's computer's mode from "normal operation" in which performance and computing efficiency is preferred to a "battle mode" or "condition red" where latency would be a paramount for ship's survival.

Revisiting the kernel's preemption models (part 1)

Posted Sep 21, 2023 20:49 UTC (Thu) by geofft (subscriber, #59789) [Link] (1 responses)

It sounds like the reason it isn't possible is that code is compiled differently, e.g. "even most spinlocks become preemptible" - in fact they are no longer spinlocks at all, but relatively normal locks where waiting code sleeps (yields to another process) instead of spinning: https://wiki.linuxfoundation.org/realtime/documentation/t...

So you can't really convert this in a running system, because locks can be held at any time. You'd have to acquire the spinlock (which can block), then turn it into a sleeping lock, then figure out who's been spinning on the spinlock while you held it and somehow convert them. For a use case like going into battle, even if this were technically possible, having the mode switch require acquiring every lock in the system isn't acceptable because it could take indefinitely long. (And even if it weren't for this, there would be an ongoing performance cost in both modes in order to support checking what mode you're in and so forth, meaning that performance will never be as good as a pure real-time or pure non-real-time kernel.)

Revisiting the kernel's preemption models (part 1)

Posted Sep 21, 2023 21:26 UTC (Thu) by mtodorov (guest, #158788) [Link]

I see your point. It could be reduced to checking a single variable telling which mode the system is currently in.

The locks which are currently held in a non-preemptible mode are obviously going to be held, but the idea behind locking is that it is short-held for performance reasons anyway.

I recall the kernel NMI timeouts on i.e. RCU locks being held for 20 seconds, but IMHO that is a sign of something deeply wrong in the kernel code ...

However, I have given up the idea that I will ever understand fully how +32M lines of code work and inter-operate.

Spinlocks are the idea of waiting from 6502 and Z80 and I think they're evil.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 3:37 UTC (Fri) by josh (subscriber, #17465) [Link]

> However, I can see the need for that, i.e. in changing a battleship's computer's mode from "normal operation" in which performance and computing efficiency is preferred to a "battle mode" or "condition red" where latency would be a paramount for ship's survival.

Switching modes is a complex and error-prone operation that affects the operation of all other software, so the net effect of doing this on the verge of a critical situation would be to switch into a less-tested mode for all software. Such a mode switch would need to be regularly tested, but even such testing would not be as good as simply running in that mode all the time.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 10:38 UTC (Fri) by cloehle (subscriber, #128160) [Link] (3 responses)

Either you're application is real-time critical or it's not. Even in your case "Switching to battle mode" would be time-critical, wouldn't it?

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 12:17 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> > However, I can see the need for that, i.e. in changing a battleship's computer's mode from "normal operation" in which performance and computing efficiency is preferred to a "battle mode" or "condition red" where latency would be a paramount for ship's survival.

> Either you're application is real-time critical or it's not. Even in your case "Switching to battle mode" would be time-critical, wouldn't it?

Actually if you really mean real-time, no it wouldn't.

The term "real-time" is, sadly, horribly abused and often confused with "online".

We basically have three - conflicting - targets we may want to optimise for.

(1) Batch: Do as much work as possible, as efficiently as possible, and if some jobs take forever, well they do.

(2) Online: Don't keep the user waiting. If that means background tasks get held up because we're wasting time checking to see if the user wants anything, well so be it.

(3) Real time: We have deadlines. They MUST be met. We might have a job that only takes 20 minutes, but it needs to be completed AT midnight tomorrow. Or say we're making jam. Finish too early, and the jam will run away off the toast. Finish too late and we won't be able to get it out of the pan and into the jars. Basically, for real-time it doesn't matter when the deadline is, it matters that you can't move it.

So take your battleship. Real-time doesn't mean the transition will happen quickly. Real-time means you know that if you engage real-time mode, the cascade of actions to change OS state will take, say. pretty much exactly five minutes (or whatever time that is).

Cheers,
Wol

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 18:11 UTC (Fri) by cloehle (subscriber, #128160) [Link] (1 responses)

I'm aware of the distinction, but maybe my imagination is just lacking, but if the system is always used in 'non-RT'-mode until RT mode is needed, but that switch cannot be bound by a latency, what use case is there?

The critical path for the RT reaction just increased from signal -> RT reaction (with a defined latency upper bound for RT systems) to signal -> switch to RT-mode -> RT reaction (no upper bound anymore).

Or how does the system know when to be in RT-mode?

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 18:41 UTC (Fri) by mb (subscriber, #50428) [Link]

> Or how does the system know when to be in RT-mode?

There is no such thing as a "system RT mode".

The kernel just provides mechanisms for tasks to be scheduled with various RT properties (e.g. RT-Fifo, RT-RR, deadline sched, locking latency guarantees, etc..).
But that does *not* mean that every task suddenly becomes a RT task. In fact, in a RT system almost all tasks usually still run in normal scheduling mode.
Normal tasks on a RT kernel basically just behave like tasks on a full-preempt kernel. Just with a little bit more overhead here and there. (I'm simplifying for the sake of briefness)

A task has to opt-in to being RT. (And selecting a RT sched class is only part of that. There are many many many more things to consider.)

And that's why it doesn't make sense at all to switch between full-preempt and RT-preempt during runtime. That would just add even more overhead with no added benefit. If your task doesn't need RT, then don't schedule it in a RT class.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 12:34 UTC (Fri) by kazer (subscriber, #134462) [Link] (6 responses)

The full RT is aimed at cases where safety (hazard) analysis determines that it is a critical task that must be made reliable to nth degree (what would be worst case if it failed, is it car ABS, heart monitor or CD-player). If it were part of, say, fly-by-wire system there would not be need to be in any other mode than full RT always. Less critical systems could have whatever mode suitable for the task.

The RT mode is therefore determined by the task before code is ever deployed, not as a mode of operation "on the fly". RT is normally due to safety reasons and the worst case is what matters. If it safety isn't of consideration (only throughoutput) preemption isn't needed.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 12:43 UTC (Fri) by kazer (subscriber, #134462) [Link]

Adding to previous..

The main feature of full RT is that it is *deterministic* - things happen at precisely when predicted and expected. Latency or performance is not important and they can be sacrificed for the sake of correctness.

Servers can have a long latency for the sake of throughoutput and desktop users can prefer low latencies. While these deal with different pre-emption modes, they are not about real-timedness. So even in these cases boot-time selection is quite enough.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 13:56 UTC (Fri) by farnz (subscriber, #17727) [Link]

RT isn't only of importance in safety-critical systems; in audio systems, 3 ms of latency is roughly equivalent to 1 m distance between the output device and the listener. For simple listening cases, this is a non-issue; you just have a big enough buffer that any plausible latency is covered by the buffer (e.g. a 3,000 ms buffer will cover most likely stalls). But for telephony, live effects and other such cases where there's feedback between the microphone and the output, you need to bound latency to a much smaller value.

The usual distinction in the RT world is between hard RT and soft RT; a system is soft RT if it recovers from a missed deadline as soon as it starts meeting deadlines again, and hard RT if further action has to be taken to recover or if recovery from missed deadlines is not possible. In this setup, audio is soft RT - if you miss a deadline, the audio system is in a fault condition, but as soon as you start hitting deadlines again, it recovers automatically; a fly-by-wire system is hard RT, since a failure to meet deadlines can result in the plane needing maintenance work.

Revisiting the kernel's preemption models (part 1)

Posted Sep 23, 2023 6:17 UTC (Sat) by donald.buczek (subscriber, #112892) [Link] (3 responses)

> If it were part of, say, fly-by-wire system there would not be need to be in any other mode than full RT always.

Good example. You system doesn't need to be in RT mode if the plane is parked on the ground. It may, for example, be in some low power mode. When you are ready to go, you switch to fly mode which implies RT on critical systems. It's not critical, how long that takes. What you need, though, is an indication that the switch is completed.

Revisiting the kernel's preemption models (part 1)

Posted Sep 23, 2023 7:23 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

But you can do that now with an existing RT kernel.
If your RT application decides to go non-RT, then it can do that now. Just switch scheduling modes.
And if you decide to put your system into sleep, you can do that now.
Switching the kernel preemption mode at runtime is absolutely not necessary to do all that.

Revisiting the kernel's preemption models (part 1)

Posted Sep 23, 2023 8:27 UTC (Sat) by mtodorov (guest, #158788) [Link]

Yet for a number of reasons, the RT preemption model isn't compiled by default ...

Just as a curiosity, about a year ago the kernel 6.0 compiled with KASAN made a number of RCU stall warnings.

It was blamed on the KASAN slow down, but I felt something odd if some RCU lock was held for more than 20 milliseconds, enough to trigger the NMI watchdog.

After a year, I get the notion that it was due to the coarse granularity of the locks held while doing some tasks. In the worst case, an unlucky spinlocks contention can degrade your SMP system's performance in the worst case to a multicore 6502 ... Did somebody make a note on real-time telephony and heart monitor systems?

Don't take this as I am ungrateful for the kernel - I am just not fond of the spinlocks and I wish I had a new parallel programming paradigm and a better grasp on the lockless algorithms and the RCU ...

Revisiting the kernel's preemption models (part 1)

Posted Sep 25, 2023 10:23 UTC (Mon) by farnz (subscriber, #17727) [Link]

You need to separate out PREEMPT_RT from real-time mode.

PREEMPT_RT configured kernels have slightly higher overheads than normal kernels, especially under full load, in return for a hard guarantee on the latencies that can be measured by processes in real time scheduling classes (at the time of this comment, that's SCHED_FIFO, SCHED_RR and SCHED_DEADLINE). If you don't use real time scheduling classes, then the overhead of PREEMPT_RT is entirely wasted, because online (SCHED_OTHER) and batch (SCHED_BATCH, SCHED_IDLE) scheduling classes do not get lower latencies in a PREEMPT_RT kernel.

Separately, a system is "in real-time mode" if it has one or more processes in runnable state in a real-time scheduling class. You can "switch" in and out of "real-time mode" by either blocking all processes in real-time scheduling classes on a kernel primitive such as a futex, or by removing all processes from real-time scheduling classes.

You can be "in real-time mode" with a PREEMPT_NONE or PREEMPT_VOLUNTARY kernel; the issue with this is that the kernel does not provide any guarantee that your process will be scheduled in time to meet its deadlines in this situation, but it's useful for (e.g.) providing musicians in a studio with live monitoring of the recording their DAW is making (complete with applied effects), where the worst consequence of missing a deadline is that you'll need to do another take. You'd prefer not to have to do another take, hence using real time scheduling to ensure that only the kernel can cause you to miss a deadline.

You can also use a PREEMPT_RT kernel with no real-time tasks. In this case, the kernel's latency guarantees are worthless, because you have no tasks that get guaranteed latencies, but you pay a small amount of overhead to allow the kernel to reschedule bits of itself to keep the latency guarantees.

The intended case for PREEMPT_RT kernels is to have real-time tasks, at least some of the time, where the system has been analysed to guarantee that the real-time tasks will always meet their deadlines.

You don't need to change the kernel configuration to switch out of "real-time mode" and enter low power states that are incompatible with your deadlines; you just change your scheduling. You can't switch codepaths between PREEMPT and PREEMPT_RT at runtime, because you need to have the system go completely idle in order to change locking, which is also the pre-requisite for kexec. If you do want to switch kernels, you can kexec between kernels at roughly the same time penalty you'd face if you tried to switch at runtime.

Revisiting the kernel's preemption models (part 1)

Posted Sep 22, 2023 9:46 UTC (Fri) by joib (subscriber, #8541) [Link]

From https://lwn.net/ml/linux-kernel/ZQlV5l4pbKunQJug@gmail.com/ it seems one suggestion is:

> Ie. in the long run - after a careful period of observing performance
> regressions and other dragons - we'd only have *two* preemption modes left:

> !PREEMPT_RT # regular kernel. Single default behavior.
> PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter.

> Any other application level preemption preferences can be expressed via
> scheduling policies & priorities.

PREEMPT_RT would still be a separate compile time option, due to the difficulty of switching locking primitives on the fly as mentioned in other comments here (and presumably switching to threaded interrupt handlers on the fly could be nontrivial as well). But the default behavior would be something similar to the current PREEMPT_NONE or PREEMPT_VOLUNTARY for non-RT processes, and similar to the current full PREEMPT ("low latency desktop") for RT processes. Best of both worlds.