|
|
Subscribe / Log in / New account

Multiple kernels on a single system

[LWN subscriber-only content]

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!

By Jonathan Corbet
September 19, 2025
The Linux kernel generally wants to be in charge of the system as a whole; it runs on all of the available CPUs and controls access to them globally. Cong Wang has just come forward with a different approach: allowing each CPU to run its own kernel. The patch set is in an early form, but it gives a hint for what might be possible.

The patch set as a whole only touches 1,400 lines of code, adding a few basic features; there would clearly need to be a lot more work done to make this feature useful. The first part is a new KEXEC_MULTIKERNEL flag to the kexec_load() system call, requesting that a new kernel be booted on a specific CPU. That CPU must be in the offline state when the call is made, or the call will fail with an EBUSY error. It would appear that it is only possible to assign a single CPU to any given kernel in this mode; the current interface lacks a way to specify more than one CPU. There is a bunch of x86-64 assembly magic to set up the target CPU for the new kernel and to boot it there.

The other significant piece is a new inter-kernel communication mechanism, based on inter-processor interrupts, that allows the kernels running on different CPUs to talk to each other. Shared memory areas are set aside for the efficient movement of data between the kernels. While the infrastructure is present in the patch set, there are no users of it in this series. A real-world system running in this mode would clearly need to use this communication infrastructure to implement a lot of coordination of resources to keep the kernels from stepping on each other, but that work has not been posted yet.

The final patches in the series add a new /proc/multikernel file that can be used to monitor the state of the various kernels running in the system.

Why would one want to do this? In the cover letter, Wang mentions a few advantages, including improved fault isolation and security, better efficiency than virtualization, and the ease of zero-downtime updates in conjunction with the kexec handover mechanism. He also mentions the ability to run special-purpose kernels (such as a realtime kernel) for specific workloads.

The work that has been posted is clearly just the beginning:

This patch series represents only the foundational framework for multikernel support. It establishes the basic infrastructure and communication mechanisms. We welcome the community to build upon this foundation and develop their own solutions based on this framework.

The new files in the series carry copyright notices for Multikernel Technologies Inc, which, seemingly, is also developing its own solutions based on this code. In other words, this looks like more than a hobby project; it will be interesting to see where it goes from here. Perhaps this relatively old idea (Larry McVoy was proposing "cache-coherent clusters" for Linux at least as far back as 2002) will finally come to fruition.

Index entries for this article
KernelMulti-kernel



to post comments

Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)

Posted Sep 19, 2025 21:12 UTC (Fri) by tullmann (subscriber, #20149) [Link] (1 responses)

In the initial versions of VMware's ESX servers (up through version 5.0), a Linux kernel would boot with (standard) command-line options restricting it to CPU 0, the first chunk of physical RAM, and a subset of the PCI devices. A subsequent loader would load the VMware hypervisor and it would manage the remaining memory, CPUs, and the few PCI devices it understood. The hard-partitioning of hardware worked surprisingly well and two very different kernels could exist in tandem on a single machine.

In a similar space, the Barrelfish OS (https://barrelfish.org/) was a "multikernel" research operating system in the 2010s that was built around separate "kernels" running on each CPU. But they worked to make the communication between cores work smoothly enough to present a single logical system to applications running on it.

Some precendent for this in VMware's ESX kernel (version 5.0 and earlier)

Posted Sep 20, 2025 21:57 UTC (Sat) by chexo4 (subscriber, #169500) [Link]

IIRC this is how multi-core systems under the seL4 microkernel work. At least in some configurations. Something about it being simpler to implement probably.

Lots of use cases

Posted Sep 19, 2025 21:25 UTC (Fri) by geofft (subscriber, #59789) [Link] (4 responses)

I can imagine a whole lot of interesting use cases for this. Off the top of my head:
  • You can run a hard realtime kernel and a more interactive-suitable kernel at the same time, doing e.g. audio processing on the realtime side and a normal desktop on the other side.
  • I wonder if you can extend this to run two different kernels, most attractively NT for the crowd who wants to do Windows gaming but otherwise have a Linux desktop, where the current best answer is something like VFIO. Here you could potentially offline all but one CPU as well as the GPU, then boot up Windows on that hardware, keeping the "host" machine accessible over a virtual network or something.
  • I have an M2 MacBook which supports virtualization but not nested virtualization, meaning I can't use qemu-kvm inside my Linux VM. This might let me get a user experience that is effectively like having the ability to launch VMs from inside Linux. I don't actually need security isolation between those VMs.
  • Sometimes you run into applications that for whatever reason need a specific old kernel. If you're running these applications under containerization (e.g. Kubernetes), this holds back either your entire container fleet or some subset of them. With multikernel you can treat the desired kernel version as a property of the container, provided you're willing to dedicate an integer number of cores to the container (which is a good idea for its own sake), and avoid the overhead of actual virtualization. Containers have their own network identity etc. anyway so it's probably doable to map that model onto this.

Lots of use cases

Posted Sep 19, 2025 22:23 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> I wonder if you can extend this to run two different kernels, most attractively NT for the crowd who wants to do Windows

There was a project to do the inverse, run Linux on Windows NT. It did not isolate individual CPUs, but otherwise it was a similar idea.

Its website is still up: http://www.colinux.org/

Lots of use cases

Posted Sep 20, 2025 13:43 UTC (Sat) by jannex (subscriber, #43525) [Link] (1 responses)

> I have an M2 MacBook which supports virtualization but not nested virtualization, meaning I can't use qemu-kvm inside my Linux VM

Apple's M2 and later support nested virtualization. The issue is probably that arm64 nested virtualization support in Linux itself is rather fresh. It was just merged in 6.16 (I haven't tried it yet).

This approach will be hard or impossible to support on Apple silicon systems. There is no easy way to support PSCI so taking CPU cores offline is currently not supported.

Lots of use cases

Posted Sep 20, 2025 18:25 UTC (Sat) by geofft (subscriber, #59789) [Link]

For whatever reason Apple only enables it with the M3 chip and later, as documented for the high-level Virtualization.framework's VZGenericPlatformConfiguration.isNestedVirtualizationSupported.

I also get false from the lower-level Hypervisor.framework's hv_vm_config_get_el2_supported() on my machine.

Lots of use cases

Posted Sep 21, 2025 4:37 UTC (Sun) by kazer (subscriber, #134462) [Link]

> two different kernels

That second "foreign" kernel would need to understand the "partition" it is allowed to use so it won't try to take over rest of the machine where another kernel may be running. Unless there is a way to make hardware understand where that another one is allowed to run (basically selective removing of supervisor-rights from the foreign kernel).

So I can only see that happening if the second kernel understands multikernel situations correctly as well. Otherwise it is back to hypervisor virtualization.

> old kernel

Sorry, but for the reasons mentioned above (supervisor access to hardware) that old kernel would need to be multikernel compliant as well. Otherwise you need a plain old hypervisor for virtualization.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 19, 2025 21:30 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (10 responses)

I've been wondering for a long time when we're going to start thinking of big heavily-NUMA oriented machines as clusters with fast interconnects instead of as single machines. This work is a step in that direction. That said, on a technical level, it seems to *amount* to a Xen-like hypervisor that just happens to pin guests to cores. A multi-kernel system, like any other hypervisor, has to worry about guest isolation, memory sharing, and device arbitration. Also, it's not clear from the article whether this system enforces guest memory isolation or whether the different guests (i.e. multikernel instances) just cooperatively agree not to stomp on each other.

It would be neat to be able to use industry-standard interfaces like libvirt to work with these guest kernels.

> better efficiency than virtualization,

Like I said, I'd consider this cool thing a *kind* of virtualization, one that trades flexibility for performance, not something *distinct* from virtualization.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 0:45 UTC (Sat) by stephen.pollei (subscriber, #125364) [Link] (9 responses)

I can't seem to find a good source for citation but I think it was Larry McVoy that thought something very similar that about 16 cpu/cores was a good limit; beyond that he suggested you run multiples independent kernels and just have fast message passing between them. I can't seem to find source and my memory could be faulty.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 15:29 UTC (Sat) by ballombe (subscriber, #9523) [Link] (5 responses)

This seems to preclude workloads that spawns more than 16 threads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 17:10 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (4 responses)

No it doesn't. You can have more threads than cores. If you mean that you can't get more than 16-way parallelism this way using threads: feature, not a bug. Use cross-machine distribution mechanism (e.g. dask) and handle work across an arbitrarily large number of cores across an arbitrarily large number of machines.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 20:22 UTC (Sat) by roc (subscriber, #30627) [Link] (3 responses)

There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

If the Linux kernel had been unable to scale well beyond 16 cores then this cluster idea might have been a viable path forward. But Linux did and any potential competitor that doesn't is simply not viable for these workloads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 8:07 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (2 responses)

> There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

Yes, and those programs can keep running. Suppose I'm developing a brand-new system and a cluster on which to run it. My workload is bigger than any single machine no matter how beefy, so I'm going to have to distribute it *anyway*, with all the concomitant complexity. If I can carve up my cluster such that each NUMA domain is a "machine", I can reuse my inter-box work distribution stuff for intra-box distribution too.

Not every workload is like this, but some are, and life can be simpler this way.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 9:17 UTC (Sun) by ballombe (subscriber, #9523) [Link] (1 responses)

...or you can run a SSI OS that move the complexity to the OS where it belongs.
<https://en.wikipedia.org/wiki/Single_system_image>
... or HPE will sell you NUMAlink systems with coherent memory across 32 sockets.

But more seriously, when using message passing, you still want to be share your working set across all cores in the same node to preserve memory.
Replacing a 128 cores system by 8 16-cores system will require 8 copies of the working set.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 10:15 UTC (Sun) by willy (subscriber, #9762) [Link]

Well, there's two schools of thought on that. Some say that NUMA hops are so slow and potentially congested (and therefore have high variability in their latency) that it's worth replicating read-only parts of the working set across nodes. They even have numbers that prove their point. I haven't dug into it enough to know if I believe that these numbers are typical or if they've chosen a particularly skewed example.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 18:59 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

You're right; Larry wanted a cluster of SMPs. Now, part of that was trying to avoid the locking complexity cliff; he didn't want Solaris to turn into IRIX with "too many" locks (I'm paraphrasing his point of view; IRIX fanboys need not be upset with me)

But Solaris didn't have RCU. I would argue that RCU has enabled Linux to scale further than Solaris without falling off "the locking cliff". We also have lockdep to prevent us from creating deadlocks (I believe Solaris eventually had an equivalent, but that was after Larry left Sun). Linux also distinguishes between spinlocks and mutexes, while I believe Solaris only has spinaphores. Whether that's terribly helpful or not for scaling, I'm not sure.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 21:31 UTC (Sat) by stephen.pollei (subscriber, #125364) [Link] (1 responses)

I do seem to recall that it was for "locking complexity" reasons. If I recall correctly, around this time, there was the BKL and relatively fewer locks. With even just a BKL, it could scale to 2 to 4 cores/cpus with a lot of typical workloads. There was too much contention for the kernel to scale up to even the 12 to 16 core and beyond range effectively. Several people were of the opinion that Sun Solaris and others had their locks too fine-grained. For this reason, I think they tried to be very cautious in breaking up coarse-grained locks for finer-grained locks; they tried requiring that there were measurements on realistic loads that a lock was having contention or latency issues before they accepted patches to break it up. They tried to avoid too much locking complexity and over-head.

I don't know enough to have an opinion on how Linux kernel was able to scale as successfully as it has. There were certainly doubts in the past. If I recall correctly, RCU was being used in other kernels before it was introduced in Linux, but I don't recall which ones.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 6:04 UTC (Sun) by willy (subscriber, #9762) [Link]

RCU was invented at Sequent (who were bought by IBM) and used in their Dynix/ptx kernel.

Message passing OS

Posted Sep 19, 2025 23:01 UTC (Fri) by linusw (subscriber, #40300) [Link]

To me it seems like essentially a message-passing (mailboxing) operating system with several kernels, such as was the basic idea in several operating systems of the past and the ambition in things like CORBA or DCOM, just on CPUs separated by IPIs and passing shared memory on the same silicon instead of separate computers separated by a network and passing packets.

It seems to more or less require an in-kernel and intra-kernel IPC mechanism so with kdbus and BUS1 having stalled here is a new reason to have something like that, because these kernels and userspaces will really need to talk to each other in structured ways.

In say a desktop scenario the messaging mechanism if that is say systemd on D-Bus is going to do coordination over bringing up not just processes but entire kernels with processes in them.

Limited isolation

Posted Sep 20, 2025 1:44 UTC (Sat) by roc (subscriber, #30627) [Link] (1 responses)

The obvious downside here compared to running guest VMs is that there is no security boundary between the kernels --- Corbet's summary here says "improved fault isolation and security", which true when you compare this approach to running workloads on the same kernel, but not when you compare it to running workloads in separate guest VMs. Anyone who cares strongly about workload isolation is already using guest VMs so they're unlikely to move to multikernels.

However, as a way to do host kernel upgrades without interrupting guest VMs, it could definitely be useful.

Limited isolation

Posted Sep 21, 2025 9:23 UTC (Sun) by cyperpunks (subscriber, #39406) [Link]

Would be possible to mix vanilla kernel and GRSecurity kernel at the same system? Such thing indeed be very useful imho.

memory and devices

Posted Sep 20, 2025 12:26 UTC (Sat) by meyert (subscriber, #32097) [Link] (1 responses)

How would memory and devices get managed in such a setup?

memory and devices

Posted Sep 20, 2025 17:34 UTC (Sat) by willy (subscriber, #9762) [Link]

You partition them. Assign various devices and memory to each kernel.

Shared access?

Posted Sep 20, 2025 14:42 UTC (Sat) by brchrisman (subscriber, #71769) [Link] (1 responses)

This is going to be entirely intra-node RDMA?
Bind them together in an old school MOSIX style single-system-image cluster?
Or SR-IOV devices/functions, one per kernel instance?

Shared access?

Posted Sep 20, 2025 18:38 UTC (Sat) by Lennie (subscriber, #49641) [Link]

The old https://en.wikipedia.org/wiki/OpenSSI code is still available too

interesting similarities to "hardware partitioning" of IBM mainframes

Posted Sep 21, 2025 7:19 UTC (Sun) by dale.hagglund (subscriber, #48536) [Link]

IBM mainframes (I won't say for sure about modern ones, but certainly the the 370, 390, and compatible Amdahl systems I was aware of in the mid 80s at university) supported a feature where the hardware could be divided into "partitions", each of which could run a fully separate "real mode" OS instance. Again, I don't know this for sure, but I wouldn't be entirely surprised if there was some hardware help for controlling which cpus, memory, devices, etc, could be discovered by the os running in a particular partition. As I understand it, partitioning was commonly used for testing new releases of the os and related software, to separate production from dev development and test, and no doubt other reasons.

Anyway, this new multi-kernel work could be used in many different and useful ways, as others have already noted, but it's always interesting to see how essentially every "new" idea has antecedents in the past.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds