The real realtime preemption end game
The point of realtime preemption is to ensure that the highest-priority
process will always be able to run with a minimum (and predictable) delay.
To that end, it makes the kernel preemptible in as many situations as
possible, with the exceptions being tightly limited in scope. The basic
mechanics of how that works have been established for a long time, but
there have been a lot of details to resolve along the way. The realtime
preemption work has resulted in the rewriting of much of the core kernel
over the years, with benefits that extend far beyond the realtime use case.
Gleixner started by noting that, while the realtime preemption project has been underway for nearly 20 years, it is actually closer to 25 years for him — he started working on realtime support for Linux in 1999. Once it's done, he said, there will be "a big party". Is that point at hand? The answer, he said, is "yes — kind of". There is one last holdout to be dealt with: printk().
Whenever code in the kernel needs to send something to the system consoles and logs, it calls printk() or one of the numerous functions built on top of it. One might not think that printing a message would be a challenging task, but it is. A call to printk() can come from any context, including in non-maskable-interrupt handlers or other printk() calls. The information being printed may be crucial, especially in the case of a system crash, so printk() calls have to work regardless of the context. As a result, there are a lot of concurrency and locking issues, and lots of driver-related complications.
printk(), Gleixner said, is fully synchronous in current kernels; a call will not return until the message has been sent to all of the configured destinations. That is "stupid"; much of what is printed is simply noise, especially during the boot process, and there is no point to waiting for it all to go out. Beyond being pointless, that waiting introduces latency, which runs counter to the goals of the realtime work, so the realtime developers have long since moved printk() output into separate threads, making it asynchronous. That code is a bunch of hacks rather than a real solution, though. A better job must be done to make this work useful for the rest of the kernel.
The printk() problem has been worked on seriously since 2018, resulting in about 300 patches that have either gone upstream or are waiting in linux-next; this work has been covered here at times. There are, he said, three final patch sets currently in the works to finish the job. A few tricky details are still being worked on. One of those is the handover mechanism; if the kernel has an emergency message to put out (it's crashing, for example), it may need to grab control of a console that is currently printing a lower-priority message. Doing that safely from any context is not an easy thing to do.
Another ongoing task is marking console drivers that are not safe to use in some contexts; if, for example, outputting a message during a non-maskable interrupt requires doing video-mode setting, it's just not going to work.
Gleixner finished the prepared part of his talk by saying that, even though it's getting close, nobody should ask him when the work will be done. printk() is unpredictable, and he is no longer willing to even try. Even so, he expressed hopes that the rest of the realtime preemption code would be in mainline before the 20th anniversary comes late in 2024.
An audience member asked whether there had been any interesting changes in
the printk() code over the last year; Gleixner answered that there
have been no fundamental conceptual changes. John Ogness, who has done
much of the printk() work, said that the handover code has been
reduced somewhat, but that some work remains; there are 76 console drivers
in the kernel that need to be fixed, and it may take a while until they are
all done. The handover code has been changed to allow drivers to be
updated one at a time rather than requiring that this work all be done at
once. (See this article for more
discussion on the recent printk() work).
Masami Hiramatsu asked which kernel messages need to be printed synchronously; Gleixner answered that almost everything should be made asynchronous. Beyond reducing latency associated with printk() calls, asynchronous output allows the creation of a separate kernel thread for each console, letting the faster consoles go at full speed rather than waiting for the slowest one. He also said that the code has been changed to ensure that important messages are fully copied into the message buffer before the first line is output, just in case a faulty console driver brings the whole system down in flames. Further safety is obtained by writing to the known-safe consoles first. If, for example, there is a persistent-memory store available, messages are put there before being sent to physical devices, once again preserving the output even if a faulty driver kills the system.
As the session closed, Clark Williams asked whether, once the printk() patches go upstream, Gleixner would try to push the rest of the realtime code (which wasn't discussed in this session) in the same merge window. The answer was a qualified "yes"; he might try if all of the code is staged in linux-next and seems ready to go.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our
travel to this event.]
Index entries for this article | |
---|---|
Kernel | Kernel messages |
Kernel | Realtime |
Conference | Linux Plumbers Conference/2023 |
Posted Nov 16, 2023 14:17 UTC (Thu)
by grawity (subscriber, #80596)
[Link] (3 responses)
I remember when I was managing a large Linux-based gateway, and I configured serial console (over IPMI), and later I added some iptables LOG rules, and it turned out that just a few matching packets per second would DoS it because it wasn't processing any packets while waiting for each log message to go out through ttyS1...
Posted Nov 17, 2023 1:30 UTC (Fri)
by areilly (subscriber, #87829)
[Link] (2 responses)
Posted Nov 17, 2023 5:51 UTC (Fri)
by donald.buczek (subscriber, #112892)
[Link] (1 responses)
Posted Nov 17, 2023 6:23 UTC (Fri)
by donald.buczek (subscriber, #112892)
[Link]
Posted Nov 16, 2023 16:17 UTC (Thu)
by IanKelling (subscriber, #89418)
[Link]
Posted Nov 16, 2023 21:44 UTC (Thu)
by itsmycpu (guest, #139639)
[Link] (20 responses)
Posted Nov 16, 2023 22:53 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (18 responses)
Posted Nov 17, 2023 0:47 UTC (Fri)
by itsmycpu (guest, #139639)
[Link] (1 responses)
https://en.wikipedia.org/wiki/Reboot#Warm
"The Linux family of operating systems supports an alternative to warm boot; the Linux kernel has optional support for kexec, a system call which transfers execution to a new kernel and skips hardware or firmware reset. The entire process occurs independently of the system firmware. The kernel being executed does not have to be a Linux kernel.[citation needed]"
Posted Nov 17, 2023 5:33 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link]
Posted Nov 18, 2023 5:36 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link] (10 responses)
In practice here means x86 hardware like Thinkpads and other assorted PCs and servers whose BIOS will not overwrite the entire memory during warm reboot, as well as SPARCstations whose OpenBoot will similarily not clear the high-up memory used for the kernel log buffer.
Posted Nov 19, 2023 3:20 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (7 responses)
And surely this is only possible through the retention of data in memory over reboot! What other magic could do this?
Sorry, but I’d lay a lot of money this is done with storage.
Posted Nov 19, 2023 5:15 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (6 responses)
https://mbsd.evolvis.org/cvs.cgi/src/sys/kern/subr_log.c?...
It is called for SPARC from:
For i386, the call is at…
And yes, it’s purely memory-based. It helps immensely in copying e.g. the remainder of a ddb(4) session (in-kernel debugger) out if you don’t have a serial console.
Posted Nov 19, 2023 5:17 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Oh well, it’ll be back at some point.
Posted Nov 19, 2023 18:29 UTC (Sun)
by kreijack (guest, #43513)
[Link] (4 responses)
So the problem is not to find a fixed area where store the data, but avoid that this area is cleaned up during a reboot.
The kind of reboot that I am talking, is the one that allow you to exit from a "crash", so I think that we are talking about an hard reboot. And an hard reboot implies the memory cleanup.
The pstore back-ends in the x86 are mostly two: the first one relies on the UEFI variable storage; the second one relies on the ACPI-ERST, which is like a flash memory.
Posted Nov 19, 2023 18:35 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
Posted Nov 19, 2023 18:40 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (2 responses)
Yes, it’s not a persistent storage like the BIOS (or EFI) settings.
No, a boot does not imply memory cleaning (except for memory used during boot, of course). It usually does imply some kind of memory test, and several kinds of memory amount probing by different places in the boot process, but these are often nōn-intrusive enough to keep the memory contents.
A cold boot does have empty memory simply because the memory had no power and the memory controller likewise did not refresh the memory banks.
A warm reboot does not have a period of such, so the memory is *usually* retained.
A hard reboot can fall into either category, depending on how it is executed and wired. The usual power button long-press will be a poweroff followed by a mostly-cold boot; a watchdog reboot, or if the kernel crashed but is still able to reboot-ish (even if just by causing a triple-fault) can be warm reboots (this mostly depends on the memory controller to continue refreshing the memory during that, and of course the firmware not overwriting it).
Posted Nov 20, 2023 19:57 UTC (Mon)
by kreijack (guest, #43513)
[Link] (1 responses)
I think that the key word is "*usually*". On my UEFI system I build a UEFI program that dump the first 4 bytes of the following address:
Then it sets these bytes to a specific value, and then it dump again.
What I saw is:
This proof that UEFI doesn't reset the memory between different program invocation.
Then I "warm rebooted" the system, and I saw the "random values" at 1). So it seemed that in my system the memory is cleared between the reboot.
What I'm telling is that at least some bios clears the memory. In may case (a ASUS B550 desktop mainboard) it seems that the BIOS clear the memory.
What I found is that it is possible to force the BIOS to not clear the memory after a reset [1]. But again this is not typically what happens after a crash; after a crash you push the reset physical buttons.
[1] https://stackoverflow.com/questions/36608101/does-a-soft-...
Posted Nov 21, 2023 23:31 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link]
The reset button as the only way out of a crash is such a PC thing though. Some machines have watchdogs, and some have something like ddb(4) on BSD or SysRq on Linux that allow for warm reboots even in the face of a crash.
Posted Nov 19, 2023 5:40 UTC (Sun)
by mjg59 (subscriber, #23239)
[Link] (1 responses)
Posted Nov 19, 2023 18:22 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Posted Nov 19, 2023 9:40 UTC (Sun)
by DemiMarie (subscriber, #164188)
[Link] (4 responses)
How does Windows manage to display the BSOD message?
Posted Nov 19, 2023 14:10 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
I guess it just seizes control of the graphics card, or puts it into text mode, or whatever.
Cheers,
Posted Nov 19, 2023 20:38 UTC (Sun)
by ballombe (subscriber, #9523)
[Link]
(just jocking of course)
Posted Nov 19, 2023 21:07 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Windows drivers are much more resilient than the drivers in Linux. A surprising amount of functionality remains working in Windows even if half the kernel is going haywire.
In particular, modesetting and simple framebuffer access have always been a part of the kernel driver. And each driver is also responsible for pre-allocating its object pools, so there's much less dependency on memory allocation. The IRQL system also has a side effect of forcing driver writers to avoid putting anything too involved inside the critical pathways.
Posted Dec 8, 2023 17:45 UTC (Fri)
by pawel44 (guest, #162008)
[Link]
Posted Nov 24, 2023 21:06 UTC (Fri)
by mtthu (subscriber, #123091)
[Link]
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
https://wiki.linuxfoundation.org/realtime/start link to latest development version patch, cloc says:
The real realtime preemption end game
Language files blank comment code
-------------------------------------------------------------------------------
diff 1 1899 5676 8032
That is pretty small. I've enjoyed reading about this over the years.
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
(I’m using a somewhat beefier mirror here to not get the main server slashdotted)
look for initmsgbuf near the beginning of the file, which gets a pointer to the RAM region.
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/sparc/sparc...
(initmsgbuf called with an almost fixed (only the oldest systems avoid the first page) address…)
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/i386/i386/m...
… where msgbufp comes from…
https://mbsd.evolvis.org/cvs.cgi/src/sys/arch/i386/i386/p...
(the __OpenBSD__ ifdef) which sets the virtual address. The physical address (MMU mapping) is done somewhere between locore.s and there, and it looks to me like its location depends on the size of the kernel image, so you’d only get the log messages if you boot the same or a very similar-sized kernel after warm reboot.
The real realtime preemption end game
The real realtime preemption end game
And this cannot be done in a generic way.
Think if this wouldn't exists: this would allow to extract from the memory some secret with a simple reboot at the "right time"; it would be a giant security hole.
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
- 3GB
- 7GB
- 14GB
1) the first time that I run the program, I saw "random values", like 0 and other non 0 values.
2) the 2nd time that I run the program, I saw the same values that I set in the first iteration.
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
Wol
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game
The real realtime preemption end game