Multiple kernels on a single system
The patch set as a whole only touches 1,400 lines of code, adding a few basic features; there would clearly need to be a lot more work done to make this feature useful. The first part is a new KEXEC_MULTIKERNEL flag to the kexec_load() system call, requesting that a new kernel be booted on a specific CPU. That CPU must be in the offline state when the call is made, or the call will fail with an EBUSY error. It would appear that it is only possible to assign a single CPU to any given kernel in this mode; the current interface lacks a way to specify more than one CPU. There is a bunch of x86-64 assembly magic to set up the target CPU for the new kernel and to boot it there.
The other significant piece is a new inter-kernel communication mechanism, based on inter-processor interrupts, that allows the kernels running on different CPUs to talk to each other. Shared memory areas are set aside for the efficient movement of data between the kernels. While the infrastructure is present in the patch set, there are no users of it in this series. A real-world system running in this mode would clearly need to use this communication infrastructure to implement a lot of coordination of resources to keep the kernels from stepping on each other, but that work has not been posted yet.
The final patches in the series add a new /proc/multikernel file that can be used to monitor the state of the various kernels running in the system.
Why would one want to do this? In the cover letter, Wang mentions a few advantages, including improved fault isolation and security, better efficiency than virtualization, and the ease of zero-downtime updates in conjunction with the kexec handover mechanism. He also mentions the ability to run special-purpose kernels (such as a realtime kernel) for specific workloads.
The work that has been posted is clearly just the beginning:
This patch series represents only the foundational framework for multikernel support. It establishes the basic infrastructure and communication mechanisms. We welcome the community to build upon this foundation and develop their own solutions based on this framework.
The new files in the series carry copyright notices for Multikernel Technologies Inc, which,
seemingly, is also developing its own solutions based on this code. In
other words, this looks like more than a hobby project; it will be
interesting to see where it goes from here. Perhaps this relatively old
idea (Larry McVoy was proposing
"cache-coherent clusters" for Linux at least as far back as 2002) will
finally come to fruition.
Index entries for this article | |
---|---|
Kernel | Multi-kernel |
Posted Sep 19, 2025 21:12 UTC (Fri)
by tullmann (subscriber, #20149)
[Link] (2 responses)
In a similar space, the Barrelfish OS (https://barrelfish.org/) was a "multikernel" research operating system in the 2010s that was built around separate "kernels" running on each CPU. But they worked to make the communication between cores work smoothly enough to present a single logical system to applications running on it.
Posted Sep 20, 2025 21:57 UTC (Sat)
by chexo4 (subscriber, #169500)
[Link]
Posted Sep 23, 2025 19:31 UTC (Tue)
by acarno (subscriber, #123476)
[Link]
In addition to running natively (e.g., multiple kernels on a single multi-core system), they also investigated running across different architectures and performing stack transformation to migrate memory between nodes.
> The project is exploring a replicated-kernel OS model for the Linux operating system. In this model, multiple Linux kernel instances running on multiple nodes collaborate each other to provide applications with a single-image operating system over the nodes. The kernels transparently provide a consistent memory view across the machine boundary, so threads in a process can be spread across the nodes without an explicit declaration of memory regions to share nor accessing through a custom memory APIs. The nodes are connected through a modern low-latency interconnect, and each of them might be based on different ISA and/or hardware configuration. In this way, Popcorn Linux utilizes the ISA-affinity in applications and scale out the system performance beyond a single system performance while retaining full POSIX compatibility.
Project Website: https://popcornlinux.org/
Posted Sep 19, 2025 21:25 UTC (Fri)
by geofft (subscriber, #59789)
[Link] (10 responses)
Posted Sep 19, 2025 22:23 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
There was a project to do the inverse, run Linux on Windows NT. It did not isolate individual CPUs, but otherwise it was a similar idea.
Its website is still up: http://www.colinux.org/
Posted Sep 20, 2025 13:43 UTC (Sat)
by jannex (subscriber, #43525)
[Link] (1 responses)
Apple's M2 and later support nested virtualization. The issue is probably that arm64 nested virtualization support in Linux itself is rather fresh. It was just merged in 6.16 (I haven't tried it yet).
This approach will be hard or impossible to support on Apple silicon systems. There is no easy way to support PSCI so taking CPU cores offline is currently not supported.
Posted Sep 20, 2025 18:25 UTC (Sat)
by geofft (subscriber, #59789)
[Link]
I also get
Posted Sep 21, 2025 4:37 UTC (Sun)
by kazer (subscriber, #134462)
[Link] (4 responses)
That second "foreign" kernel would need to understand the "partition" it is allowed to use so it won't try to take over rest of the machine where another kernel may be running. Unless there is a way to make hardware understand where that another one is allowed to run (basically selective removing of supervisor-rights from the foreign kernel).
So I can only see that happening if the second kernel understands multikernel situations correctly as well. Otherwise it is back to hypervisor virtualization.
> old kernel
Sorry, but for the reasons mentioned above (supervisor access to hardware) that old kernel would need to be multikernel compliant as well. Otherwise you need a plain old hypervisor for virtualization.
Posted Sep 21, 2025 12:16 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (3 responses)
It is already the case that a booting kernel asks the underlying system which part of physical memory it is allowed to use. It can then prepare the kernel mapping so it can only access the parts it is allowed to. It can't assume anything about all the other parts.
Now, this only prevents accidental interference. There's nothing that prevents the kernel from modifying its mapping (dynamically adding RAM/devices is a thing) but it would give a very high degree of isolation. Not as good as a hypervisor, but pretty good.
Posted Sep 21, 2025 17:46 UTC (Sun)
by glettieri (subscriber, #15705)
[Link] (2 responses)
However, in this case the underlying system is the hardware, that doesn't know anything about these partitions. A non-multikernel-aware kernel would discover all the memory and all the devices, and think that it owns everything.
Posted Sep 22, 2025 4:50 UTC (Mon)
by skissane (subscriber, #38675)
[Link] (1 responses)
Maybe someone just needs to add a “telling lies facility” to the hardware/firmware which the multikernel could use to get the hardware/firmware to lie to the non-multikernel-aware kernel? This could be much more lightweight than standard virtualisation since it wouldn’t be involved at runtime only in config discovery
Posted Sep 22, 2025 22:13 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Sep 22, 2025 8:10 UTC (Mon)
by rhbvkleef (subscriber, #154505)
[Link] (1 responses)
Posted Sep 25, 2025 12:55 UTC (Thu)
by Karellen (subscriber, #67644)
[Link]
Posted Sep 19, 2025 21:30 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (18 responses)
It would be neat to be able to use industry-standard interfaces like libvirt to work with these guest kernels.
> better efficiency than virtualization,
Like I said, I'd consider this cool thing a *kind* of virtualization, one that trades flexibility for performance, not something *distinct* from virtualization.
Posted Sep 20, 2025 0:45 UTC (Sat)
by stephen.pollei (subscriber, #125364)
[Link] (13 responses)
Posted Sep 20, 2025 15:29 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (9 responses)
Posted Sep 20, 2025 17:10 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (8 responses)
Posted Sep 20, 2025 20:22 UTC (Sat)
by roc (subscriber, #30627)
[Link] (7 responses)
If the Linux kernel had been unable to scale well beyond 16 cores then this cluster idea might have been a viable path forward. But Linux did and any potential competitor that doesn't is simply not viable for these workloads.
Posted Sep 21, 2025 8:07 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (6 responses)
Yes, and those programs can keep running. Suppose I'm developing a brand-new system and a cluster on which to run it. My workload is bigger than any single machine no matter how beefy, so I'm going to have to distribute it *anyway*, with all the concomitant complexity. If I can carve up my cluster such that each NUMA domain is a "machine", I can reuse my inter-box work distribution stuff for intra-box distribution too.
Not every workload is like this, but some are, and life can be simpler this way.
Posted Sep 21, 2025 9:17 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (5 responses)
But more seriously, when using message passing, you still want to be share your working set across all cores in the same node to preserve memory.
Posted Sep 21, 2025 10:15 UTC (Sun)
by willy (subscriber, #9762)
[Link] (4 responses)
Posted Sep 21, 2025 12:42 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (3 responses)
Posted Sep 21, 2025 20:19 UTC (Sun)
by willy (subscriber, #9762)
[Link] (2 responses)
Posted Sep 21, 2025 20:35 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Sep 21, 2025 20:39 UTC (Sun)
by willy (subscriber, #9762)
[Link]
The patches are to do this automatically without library involvement. I think the latest round were called something awful like "Copy On NUMA".
Posted Sep 20, 2025 18:59 UTC (Sat)
by willy (subscriber, #9762)
[Link] (2 responses)
But Solaris didn't have RCU. I would argue that RCU has enabled Linux to scale further than Solaris without falling off "the locking cliff". We also have lockdep to prevent us from creating deadlocks (I believe Solaris eventually had an equivalent, but that was after Larry left Sun). Linux also distinguishes between spinlocks and mutexes, while I believe Solaris only has spinaphores. Whether that's terribly helpful or not for scaling, I'm not sure.
Posted Sep 20, 2025 21:31 UTC (Sat)
by stephen.pollei (subscriber, #125364)
[Link] (1 responses)
I don't know enough to have an opinion on how Linux kernel was able to scale as successfully as it has. There were certainly doubts in the past. If I recall correctly, RCU was being used in other kernels before it was introduced in Linux, but I don't recall which ones.
Posted Sep 21, 2025 6:04 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Posted Sep 21, 2025 10:56 UTC (Sun)
by kazer (subscriber, #134462)
[Link] (1 responses)
Virtualization is an abstraction of the hardware.
Better term for multi-kernel system would be *partition* (term has been used in mainframe-world already). In a multi-kernel design, kernel would still see the whole hardware as it is (not an abstraction), but it would be limited to a subset of the capabilities (a partition).
Linux already has various capabilities to limit certain tasks to run on certain CPUs so this would be taking that approach further, not adding abstractions.
Posted Sep 21, 2025 11:40 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link]
So VMMs doing PCIe pass-through aren't doing virtualization?
Anyway, the terminology difference is immaterial. In a purest view of virtualization, a guest shouldn't be aware that it's virtualized or that other guests exist. In the purest view of a partition, the whole system is built around multi-instance data structures. In reality, the virtualization is
Besides: lots of people arrange VMs and assign resources such that the net effect ends being a partition anyway. The multi kernel work might be a way to achieve practically the same configuration with more performance and less isolation.
My point is that it would be nice to manage configurations like this using the existing suite of virtualization tools. Even if multi kernel is not virtualization under some purist definition of the word, it's close enough, practically speaking, that virtualization tools can be made to work well enough that the configuration stacks can be unified and people don't have to learn a new thing.
Posted Sep 22, 2025 9:46 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
NUMA systems could thus be treated as just a special case of clusters (instead of running an instance per system, passing messages over the network, run an instance per NUMA node, bound to the node, passing messages over shared memory channels), but they benefited hugely for problems where you'd normally stick to your instance's data, but could need to get at data from anywhere to solve the problem, since that was now just "normal" reads instead of message passing.
I'd be interested to see what the final intent behind this work is - is it better RAS (since you can upgrade the kernel NUMA node by NUMA node), is it about sharing a big machine among smaller users (like containers or virtualization, but with different costs), or is it about giving people an incentive to write their programs in terms of "one instance per NUMA node" again?
Posted Sep 22, 2025 10:05 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Similar stuff before has been called "Logical Partitions" (LPARs) by IBM, and "Logical Domains" (LDOMs) by Sun Microsystems (the sun4v stuff introduced in UltraSPARC T1 Niagara).
Posted Sep 19, 2025 23:01 UTC (Fri)
by linusw (subscriber, #40300)
[Link]
It seems to more or less require an in-kernel and intra-kernel IPC mechanism so with kdbus and BUS1 having stalled here is a new reason to have something like that, because these kernels and userspaces will really need to talk to each other in structured ways.
In say a desktop scenario the messaging mechanism if that is say systemd on D-Bus is going to do coordination over bringing up not just processes but entire kernels with processes in them.
Posted Sep 20, 2025 1:44 UTC (Sat)
by roc (subscriber, #30627)
[Link] (2 responses)
However, as a way to do host kernel upgrades without interrupting guest VMs, it could definitely be useful.
Posted Sep 21, 2025 9:23 UTC (Sun)
by cyperpunks (subscriber, #39406)
[Link] (1 responses)
Posted Sep 21, 2025 20:10 UTC (Sun)
by Lionel_Debroux (subscriber, #30014)
[Link]
Posted Sep 20, 2025 12:26 UTC (Sat)
by meyert (subscriber, #32097)
[Link] (1 responses)
Posted Sep 20, 2025 17:34 UTC (Sat)
by willy (subscriber, #9762)
[Link]
Posted Sep 20, 2025 14:42 UTC (Sat)
by brchrisman (subscriber, #71769)
[Link] (1 responses)
Posted Sep 20, 2025 18:38 UTC (Sat)
by Lennie (subscriber, #49641)
[Link]
Posted Sep 21, 2025 7:19 UTC (Sun)
by dale.hagglund (subscriber, #48536)
[Link] (2 responses)
Anyway, this new multi-kernel work could be used in many different and useful ways, as others have already noted, but it's always interesting to see how essentially every "new" idea has antecedents in the past.
Posted Sep 21, 2025 23:13 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (1 responses)
There is a gazillion different potential reasons for that: the solution was in search of a problem, it was too expensive, it was not mature yet, it broke backwards compatibility too much, it was mature and successful for a while but displaced by less convenient but much cheaper commodity solutions, etc.
1% inspiration, 99% perspiration. The lone inventor and its eureka! moment is probably the least common case but it makes the best stories to read or watch and they massively skew our perception. Our tribal brain is hardwired for silver bullets and miracles and "allergic" to slow, global and real-world evolutions. Not just for science and technology, it's the same for economics, war, sociology, climate, etc.
Posted Sep 22, 2025 22:10 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
It wasn't interesting to Universities? (So students never knew about it.)
Cheers,
Posted Sep 22, 2025 1:10 UTC (Mon)
by SLi (subscriber, #53131)
[Link] (2 responses)
Having said that, I started to wonder. Would it still be possible, and would it make enough sense, to have some kind of a shared memory mechanism between userspace processes running on the different kernels? I don't think it can look like POSIX, but something stripped down.
What I'm basically thinking of: Multikernel gives us some benefits while arguably sacrificing other things as less important. Could we meaningfully claw back some of those lost things where it makes sense?
Posted Sep 22, 2025 3:28 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
You can do this with VMs today. Why would you use this new thing instead of the nice mature virtualization stack?
Posted Sep 22, 2025 8:09 UTC (Mon)
by matthias (subscriber, #94967)
[Link]
Posted Sep 22, 2025 17:55 UTC (Mon)
by marcH (subscriber, #57642)
[Link] (1 responses)
Maybe this could become a standard for communicating with firmwares too, so drivers don't have to keep re-inventing this wheel?
It's quite different because it's heterogeneous (both at the HW and SW levels) but most systems are _already_ "multi-kernels" when you think about it!
Posted Sep 23, 2025 15:04 UTC (Tue)
by linusw (subscriber, #40300)
[Link]
Posted Sep 26, 2025 12:39 UTC (Fri)
by jsakkine (subscriber, #80603)
[Link]
I'd like to point out that personally my viewpoint does not come from any opionated standing point. Using any form of code generation to get some stuff ongoing is absolutely fine, as far as I'm concerned. The thing is, however, that it is exactly *placeholder code*, and for my eyes the patch set looks like placeholder/stub code as a feature.
Obviously I don't know facts, this is exactly a guess, and I absolutely don't enjoy making claims like this, and I hope that I have completely misunderstood the topic.
Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)
Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)
Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)
2020 LWN Article: https://lwn.net/Articles/819237/
I can imagine a whole lot of interesting use cases for this. Off the top of my head:
Lots of use cases
Lots of use cases
Lots of use cases
For whatever reason Apple only enables it with the M3 chip and later, as documented for the high-level Virtualization.framework's Lots of use cases
VZGenericPlatformConfiguration.isNestedVirtualizationSupported
.
false
from the lower-level Hypervisor.framework's hv_vm_config_get_el2_supported()
on my machine.
Lots of use cases
Lots of use cases
Lots of use cases
Lots of use cases
Lots of use cases
Wol
Lots of use cases - Rolling kernel upgrade
Lots of use cases - Rolling kernel upgrade
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
<https://en.wikipedia.org/wiki/Single_system_image>
... or HPE will sell you NUMAlink systems with coherent memory across 32 sockets.
Replacing a 128 cores system by 8 16-cores system will require 8 copies of the working set.
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
leaky, and deliberately so because the leaks are useful. Likewise, in a partition setup, especially one grafted into an existing system, at some point you arrange data structures such that code running on one partition "thinks" it owns a system --- there's your abstraction.
It's certainly an interesting turn of the wheel; one of the selling points of NUMA over clusters back in the 1990s was that a cluster required you to work out what needed to be communicated between partitions of your problem, and pass messages, while a NUMA cluster let any CPU read any data anywhere in the system.
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Message passing OS
Limited isolation
Limited isolation
Limited isolation
memory and devices
memory and devices
Shared access?
Bind them together in an old school MOSIX style single-system-image cluster?
Or SR-IOV devices/functions, one per kernel instance?
Shared access?
interesting similarities to "hardware partitioning" of IBM mainframes
interesting similarities to "hardware partitioning" of IBM mainframes
interesting similarities to "hardware partitioning" of IBM mainframes
Wol
Shared memory
Shared memory
Shared memory
Firmware "kernels"
Firmware "kernels"
Not sure about this ...