Multiple kernels on a single system
[LWN subscriber-only content]
Welcome to LWN.net
The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!
The Linux kernel generally wants to be in charge of the system as a whole; it runs on all of the available CPUs and controls access to them globally. Cong Wang has just come forward with a different approach: allowing each CPU to run its own kernel. The patch set is in an early form, but it gives a hint for what might be possible.
The patch set as a whole only touches 1,400 lines of code, adding a few basic features; there would clearly need to be a lot more work done to make this feature useful. The first part is a new KEXEC_MULTIKERNEL flag to the kexec_load() system call, requesting that a new kernel be booted on a specific CPU. That CPU must be in the offline state when the call is made, or the call will fail with an EBUSY error. It would appear that it is only possible to assign a single CPU to any given kernel in this mode; the current interface lacks a way to specify more than one CPU. There is a bunch of x86-64 assembly magic to set up the target CPU for the new kernel and to boot it there.
The other significant piece is a new inter-kernel communication mechanism, based on inter-processor interrupts, that allows the kernels running on different CPUs to talk to each other. Shared memory areas are set aside for the efficient movement of data between the kernels. While the infrastructure is present in the patch set, there are no users of it in this series. A real-world system running in this mode would clearly need to use this communication infrastructure to implement a lot of coordination of resources to keep the kernels from stepping on each other, but that work has not been posted yet.
The final patches in the series add a new /proc/multikernel file that can be used to monitor the state of the various kernels running in the system.
Why would one want to do this? In the cover letter, Wang mentions a few advantages, including improved fault isolation and security, better efficiency than virtualization, and the ease of zero-downtime updates in conjunction with the kexec handover mechanism. He also mentions the ability to run special-purpose kernels (such as a realtime kernel) for specific workloads.
The work that has been posted is clearly just the beginning:
This patch series represents only the foundational framework for multikernel support. It establishes the basic infrastructure and communication mechanisms. We welcome the community to build upon this foundation and develop their own solutions based on this framework.
The new files in the series carry copyright notices for Multikernel Technologies Inc, which,
seemingly, is also developing its own solutions based on this code. In
other words, this looks like more than a hobby project; it will be
interesting to see where it goes from here. Perhaps this relatively old
idea (Larry McVoy was proposing
"cache-coherent clusters" for Linux at least as far back as 2002) will
finally come to fruition.
Index entries for this article | |
---|---|
Kernel | Multi-kernel |
Posted Sep 19, 2025 21:12 UTC (Fri)
by tullmann (subscriber, #20149)
[Link] (1 responses)
In a similar space, the Barrelfish OS (https://barrelfish.org/) was a "multikernel" research operating system in the 2010s that was built around separate "kernels" running on each CPU. But they worked to make the communication between cores work smoothly enough to present a single logical system to applications running on it.
Posted Sep 20, 2025 21:57 UTC (Sat)
by chexo4 (subscriber, #169500)
[Link]
Posted Sep 19, 2025 21:25 UTC (Fri)
by geofft (subscriber, #59789)
[Link] (4 responses)
Posted Sep 19, 2025 22:23 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
There was a project to do the inverse, run Linux on Windows NT. It did not isolate individual CPUs, but otherwise it was a similar idea.
Its website is still up: http://www.colinux.org/
Posted Sep 20, 2025 13:43 UTC (Sat)
by jannex (subscriber, #43525)
[Link] (1 responses)
Apple's M2 and later support nested virtualization. The issue is probably that arm64 nested virtualization support in Linux itself is rather fresh. It was just merged in 6.16 (I haven't tried it yet).
This approach will be hard or impossible to support on Apple silicon systems. There is no easy way to support PSCI so taking CPU cores offline is currently not supported.
Posted Sep 20, 2025 18:25 UTC (Sat)
by geofft (subscriber, #59789)
[Link]
I also get
Posted Sep 21, 2025 4:37 UTC (Sun)
by kazer (subscriber, #134462)
[Link]
That second "foreign" kernel would need to understand the "partition" it is allowed to use so it won't try to take over rest of the machine where another kernel may be running. Unless there is a way to make hardware understand where that another one is allowed to run (basically selective removing of supervisor-rights from the foreign kernel).
So I can only see that happening if the second kernel understands multikernel situations correctly as well. Otherwise it is back to hypervisor virtualization.
> old kernel
Sorry, but for the reasons mentioned above (supervisor access to hardware) that old kernel would need to be multikernel compliant as well. Otherwise you need a plain old hypervisor for virtualization.
Posted Sep 19, 2025 21:30 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (10 responses)
It would be neat to be able to use industry-standard interfaces like libvirt to work with these guest kernels.
> better efficiency than virtualization,
Like I said, I'd consider this cool thing a *kind* of virtualization, one that trades flexibility for performance, not something *distinct* from virtualization.
Posted Sep 20, 2025 0:45 UTC (Sat)
by stephen.pollei (subscriber, #125364)
[Link] (9 responses)
Posted Sep 20, 2025 15:29 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (5 responses)
Posted Sep 20, 2025 17:10 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Posted Sep 20, 2025 20:22 UTC (Sat)
by roc (subscriber, #30627)
[Link] (3 responses)
If the Linux kernel had been unable to scale well beyond 16 cores then this cluster idea might have been a viable path forward. But Linux did and any potential competitor that doesn't is simply not viable for these workloads.
Posted Sep 21, 2025 8:07 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Yes, and those programs can keep running. Suppose I'm developing a brand-new system and a cluster on which to run it. My workload is bigger than any single machine no matter how beefy, so I'm going to have to distribute it *anyway*, with all the concomitant complexity. If I can carve up my cluster such that each NUMA domain is a "machine", I can reuse my inter-box work distribution stuff for intra-box distribution too.
Not every workload is like this, but some are, and life can be simpler this way.
Posted Sep 21, 2025 9:17 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (1 responses)
But more seriously, when using message passing, you still want to be share your working set across all cores in the same node to preserve memory.
Posted Sep 21, 2025 10:15 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Posted Sep 20, 2025 18:59 UTC (Sat)
by willy (subscriber, #9762)
[Link] (2 responses)
But Solaris didn't have RCU. I would argue that RCU has enabled Linux to scale further than Solaris without falling off "the locking cliff". We also have lockdep to prevent us from creating deadlocks (I believe Solaris eventually had an equivalent, but that was after Larry left Sun). Linux also distinguishes between spinlocks and mutexes, while I believe Solaris only has spinaphores. Whether that's terribly helpful or not for scaling, I'm not sure.
Posted Sep 20, 2025 21:31 UTC (Sat)
by stephen.pollei (subscriber, #125364)
[Link] (1 responses)
I don't know enough to have an opinion on how Linux kernel was able to scale as successfully as it has. There were certainly doubts in the past. If I recall correctly, RCU was being used in other kernels before it was introduced in Linux, but I don't recall which ones.
Posted Sep 21, 2025 6:04 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Posted Sep 19, 2025 23:01 UTC (Fri)
by linusw (subscriber, #40300)
[Link]
It seems to more or less require an in-kernel and intra-kernel IPC mechanism so with kdbus and BUS1 having stalled here is a new reason to have something like that, because these kernels and userspaces will really need to talk to each other in structured ways.
In say a desktop scenario the messaging mechanism if that is say systemd on D-Bus is going to do coordination over bringing up not just processes but entire kernels with processes in them.
Posted Sep 20, 2025 1:44 UTC (Sat)
by roc (subscriber, #30627)
[Link] (1 responses)
However, as a way to do host kernel upgrades without interrupting guest VMs, it could definitely be useful.
Posted Sep 21, 2025 9:23 UTC (Sun)
by cyperpunks (subscriber, #39406)
[Link]
Posted Sep 20, 2025 12:26 UTC (Sat)
by meyert (subscriber, #32097)
[Link] (1 responses)
Posted Sep 20, 2025 17:34 UTC (Sat)
by willy (subscriber, #9762)
[Link]
Posted Sep 20, 2025 14:42 UTC (Sat)
by brchrisman (subscriber, #71769)
[Link] (1 responses)
Posted Sep 20, 2025 18:38 UTC (Sat)
by Lennie (subscriber, #49641)
[Link]
Posted Sep 21, 2025 7:19 UTC (Sun)
by dale.hagglund (subscriber, #48536)
[Link]
Anyway, this new multi-kernel work could be used in many different and useful ways, as others have already noted, but it's always interesting to see how essentially every "new" idea has antecedents in the past.
Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)
Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)
I can imagine a whole lot of interesting use cases for this. Off the top of my head:
Lots of use cases
Lots of use cases
Lots of use cases
For whatever reason Apple only enables it with the M3 chip and later, as documented for the high-level Virtualization.framework's Lots of use cases
VZGenericPlatformConfiguration.isNestedVirtualizationSupported
.
false
from the lower-level Hypervisor.framework's hv_vm_config_get_el2_supported()
on my machine.
Lots of use cases
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
<https://en.wikipedia.org/wiki/Single_system_image>
... or HPE will sell you NUMAlink systems with coherent memory across 32 sockets.
Replacing a 128 cores system by 8 16-cores system will require 8 copies of the working set.
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Neat: but isn't this a type-1 hypervisor?
Message passing OS
Limited isolation
Limited isolation
memory and devices
memory and devices
Shared access?
Bind them together in an old school MOSIX style single-system-image cluster?
Or SR-IOV devices/functions, one per kernel instance?
Shared access?
interesting similarities to "hardware partitioning" of IBM mainframes