Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 15:29 UTC (Sat) by ballombe (subscriber, #9523)
In reply to: Neat: but isn't this a type-1 hypervisor? by stephen.pollei
Parent article: Multiple kernels on a single system

This seems to preclude workloads that spawns more than 16 threads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 17:10 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (8 responses)

No it doesn't. You can have more threads than cores. If you mean that you can't get more than 16-way parallelism this way using threads: feature, not a bug. Use cross-machine distribution mechanism (e.g. dask) and handle work across an arbitrarily large number of cores across an arbitrarily large number of machines.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 20, 2025 20:22 UTC (Sat) by roc (subscriber, #30627) [Link] (7 responses)

There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

If the Linux kernel had been unable to scale well beyond 16 cores then this cluster idea might have been a viable path forward. But Linux did and any potential competitor that doesn't is simply not viable for these workloads.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 8:07 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (6 responses)

> There are plenty of programs that work perfectly well with (e.g.) 200 threads on 200 cores, on hardware that exists today. Asking people to rewrite them to introduce a message-passing layer to get them to scale on your hypothetical cluster is a non-starter. Definitely a bug, not a feature.

Yes, and those programs can keep running. Suppose I'm developing a brand-new system and a cluster on which to run it. My workload is bigger than any single machine no matter how beefy, so I'm going to have to distribute it *anyway*, with all the concomitant complexity. If I can carve up my cluster such that each NUMA domain is a "machine", I can reuse my inter-box work distribution stuff for intra-box distribution too.

Not every workload is like this, but some are, and life can be simpler this way.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 9:17 UTC (Sun) by ballombe (subscriber, #9523) [Link] (5 responses)

...or you can run a SSI OS that move the complexity to the OS where it belongs.
<https://en.wikipedia.org/wiki/Single_system_image>
... or HPE will sell you NUMAlink systems with coherent memory across 32 sockets.

But more seriously, when using message passing, you still want to be share your working set across all cores in the same node to preserve memory.
Replacing a 128 cores system by 8 16-cores system will require 8 copies of the working set.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 10:15 UTC (Sun) by willy (subscriber, #9762) [Link] (4 responses)

Well, there's two schools of thought on that. Some say that NUMA hops are so slow and potentially congested (and therefore have high variability in their latency) that it's worth replicating read-only parts of the working set across nodes. They even have numbers that prove their point. I haven't dug into it enough to know if I believe that these numbers are typical or if they've chosen a particularly skewed example.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 12:42 UTC (Sun) by ballombe (subscriber, #9523) [Link] (3 responses)

This is correct. However, NUMA systems come with libraries to give you access to the physical layout so you can copy the working set only once per coherent NUMA blocks, which are much larger than 16 cores nowadays.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:19 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

If those libraries already exist, why do people keep submitting patches to add this functionality to the kernel?

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:35 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

Because the libraries have to have something to talk to? It's like asking why we add KVM syscalls when we have kvm command line. Separate jobs.

Neat: but isn't this a type-1 hypervisor?

Posted Sep 21, 2025 20:39 UTC (Sun) by willy (subscriber, #9762) [Link]

... no.

The patches are to do this automatically without library involvement. I think the latest round were called something awful like "Copy On NUMA".