Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 0:45 UTC (Sat) by stephen.pollei (subscriber, #125364)
In reply to: Neat: but isn't this a type-1 hypervisor? by quotemstr
Parent article: Multiple kernels on a single system

I can't seem to find a good source for citation but I think it was Larry McVoy that thought something very similar that about 16 cpu/cores was a good limit; beyond that he suggested you run multiples independent kernels and just have fast message passing between them. I can't seem to find source and my memory could be faulty.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 15:29 UTC (Sat) by ballombe (subscriber, #9523) [Link] (9 responses)

This seems to preclude workloads that spawns more than 16 threads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 17:10 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (8 responses)

No it doesn't. You can have more threads than cores. If you mean that you can't get more than 16-way parallelism this way using threads: feature, not a bug. Use cross-machine distribution mechanism (e.g. dask) and handle work across an arbitrarily large number of cores across an arbitrarily large number of machines.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 20:22 UTC (Sat) by roc (subscriber, #30627) [Link] (7 responses)

There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

If the Linux kernel had been unable to scale well beyond 16 cores then this cluster idea might have been a viable path forward. But Linux did and any potential competitor that doesn't is simply not viable for these workloads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 8:07 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (6 responses)

> There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

Yes, and those programs can keep running. Suppose I'm developing a brand-new system and a cluster on which to run it. My workload is bigger than any single machine no matter how beefy, so I'm going to have to distribute it *anyway*, with all the concomitant complexity. If I can carve up my cluster such that each NUMA domain is a "machine", I can reuse my inter-box work distribution stuff for intra-box distribution too.

Not every workload is like this, but some are, and life can be simpler this way.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 9:17 UTC (Sun) by ballombe (subscriber, #9523) [Link] (5 responses)

...or you can run a SSI OS that move the complexity to the OS where it belongs.
<https://en.wikipedia.org/wiki/Single_system_image>
... or HPE will sell you NUMAlink systems with coherent memory across 32 sockets.

But more seriously, when using message passing, you still want to be share your working set across all cores in the same node to preserve memory.
Replacing a 128 cores system by 8 16-cores system will require 8 copies of the working set.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 10:15 UTC (Sun) by willy (subscriber, #9762) [Link] (4 responses)

Well, there's two schools of thought on that. Some say that NUMA hops are so slow and potentially congested (and therefore have high variability in their latency) that it's worth replicating read-only parts of the working set across nodes. They even have numbers that prove their point. I haven't dug into it enough to know if I believe that these numbers are typical or if they've chosen a particularly skewed example.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 12:42 UTC (Sun) by ballombe (subscriber, #9523) [Link] (3 responses)

This is correct. However, NUMA systems come with libraries to give you access to the physical layout so you can copy the working set only once per coherent NUMA blocks, which are much larger than 16 cores nowadays.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:19 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

If those libraries already exist, why do people keep submitting patches to add this functionality to the kernel?

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:35 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

Because the libraries have to have something to talk to? It's like asking why we add KVM syscalls when we have kvm command line. Separate jobs.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:39 UTC (Sun) by willy (subscriber, #9762) [Link]

... no.

The patches are to do this automatically without library involvement. I think the latest round were called something awful like "Copy On NUMA".

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 18:59 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

You're right; Larry wanted a cluster of SMPs. Now, part of that was trying to avoid the locking complexity cliff; he didn't want Solaris to turn into IRIX with "too many" locks (I'm paraphrasing his point of view; IRIX fanboys need not be upset with me)

But Solaris didn't have RCU. I would argue that RCU has enabled Linux to scale further than Solaris without falling off "the locking cliff". We also have lockdep to prevent us from creating deadlocks (I believe Solaris eventually had an equivalent, but that was after Larry left Sun). Linux also distinguishes between spinlocks and mutexes, while I believe Solaris only has spinaphores. Whether that's terribly helpful or not for scaling, I'm not sure.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 21:31 UTC (Sat) by stephen.pollei (subscriber, #125364) [Link] (1 responses)

I do seem to recall that it was for "locking complexity" reasons. If I recall correctly, around this time, there was the BKL and relatively fewer locks. With even just a BKL, it could scale to 2 to 4 cores/cpus with a lot of typical workloads. There was too much contention for the kernel to scale up to even the 12 to 16 core and beyond range effectively. Several people were of the opinion that Sun Solaris and others had their locks too fine-grained. For this reason, I think they tried to be very cautious in breaking up coarse-grained locks for finer-grained locks; they tried requiring that there were measurements on realistic loads that a lock was having contention or latency issues before they accepted patches to break it up. They tried to avoid too much locking complexity and over-head.

I don't know enough to have an opinion on how Linux kernel was able to scale as successfully as it has. There were certainly doubts in the past. If I recall correctly, RCU was being used in other kernels before it was introduced in Linux, but I don't recall which ones.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 6:04 UTC (Sun) by willy (subscriber, #9762) [Link]

RCU was invented at Sequent (who were bought by IBM) and used in their Dynix/ptx kernel.