Hybrid scheduling gets more complicated

Posted Oct 1, 2022 3:57 UTC (Sat) by jhoblitt (subscriber, #77733)
In reply to: Hybrid scheduling gets more complicated by roc
Parent article: Hybrid scheduling gets more complicated

The use case is more instructions per watt. The choice would be between something like 64 p-cores or 256 e-cores
Where the later could retire twice as many instructions at the as thermal envelop (assuming memory bandwidth isn't a bottleneck). Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 5:44 UTC (Sat) by roc (subscriber, #30627) [Link] (8 responses)

Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

In theory there could be a non-uniform workload that performs really well with 64 p-cores and 256 e-cores *all sharing memory*, but that sounds relatively unusual.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 7:40 UTC (Sat) by Sesse (subscriber, #53779) [Link] (5 responses)

The question is, does Intel sell those? :-)

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 12:41 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (1 responses)

No, the question is literally why doesn't Intel sell those.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 20:36 UTC (Sat) by flussence (guest, #85566) [Link]

They'd be competing unfavourably with existing 2P*128C ARM servers - and Zen4 Epyc when that comes out. Intel would not want to be seen putting out a high specs server CPU with no AVX512 at all while AMD's building chips that can do it within a sane power envelope.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 21:43 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

The closest available model to this was probably Tilera. They produced a chip for routers and other similar applications where you need to process many concurrent independent streams.

It went down like a brick. It was barely better than Xeons and not really any cheaper, while requiring a completely new architecture.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 9:24 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

Tilera was a bit weird and niche though, and orientated to network processing. Also buggy and unreliable - underdeveloped, presumably cause it never needed volume.

UltraSPARC T1 is probably more like it. A whole bunch of (relatively) classic, simple, in-order, 5-stage RISC SPARC CPUs, behind a thread-selector stage to rapidly switch threads and a huge memory controller. Low instructions-per-thread, but - if you loaded it up with an embarrassingly parallel IO workload (i.e. uncomplicated HTTP serving), very high aggregate IPC.

[In the initial versions, they neglected to add FPUs (just 1 FPU in the first T1, shared across all cores) and then - once away from Suns' in-house benchmarks and into the real-world - they realised that a lot of real-world applications depend on scripting languages that use FP by default, even for integer math. ;) ]

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 11:15 UTC (Mon) by paulj (subscriber, #341) [Link]

s/never needed volume/never got the needed volume/
s/ SPARC/UltraSPARC/g

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 11:31 UTC (Sat) by tlamp (subscriber, #108540) [Link] (1 responses)

> Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

For some server workloads it could be also interesting to have still both, like for example 128 E-cores and 32 P-cores.

For example, if I got an HTTP API with various endpoints, some of them may be quite compute-intensive while others not, a mixed setup may have a higher throughput at a lower total power consumption.
Tasks that, e.g., do lots of number crunching (or profit from advanced CPU core capabilities) can be moved to the P-cores while the relatively big amount of E-cores can serve parallel to that a massive amount of the "cheaper" responses using a low(er) amount of compute resources and potentially avoiding latency spikes compared to the homogeneous setups.

For very specialized workloads it may be better to handle such things explicitly in the application, or user space itself via pinning, but I saw quite some services where such different request work loads can happen, and it'd be nice if the OS scheduler would handle that better in a general way, even if not 100% optimal.

Hybrid scheduling gets more complicated

Posted Oct 2, 2022 8:40 UTC (Sun) by roc (subscriber, #30627) [Link]

I agree such a workload could exist, but it's not enough to just have multiple endpoints with these different properties, they also need to share memory. If the different endpoints don't need to share memory with each other, you can just run two or more separate servers. Plus, endpoints that *require* shared memory are going to have problems with horizontal scaling beyond one server. This use-case starts to sound very contrived.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 12:51 UTC (Mon) by rbanffy (guest, #103898) [Link] (3 responses)

> Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.

They did - it was called Xeon Phi and making it run at its theoretical peak performance was difficult - it was better suited for HPC workloads (it borrowed heavily from their Larrabee chip) than general-purpose high-throughput tasks such as web serving. I think it's a shame it ended up abandoned.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 19:13 UTC (Mon) by Paf (subscriber, #91811) [Link] (2 responses)

It really didn't work very well - The cores were very slow on traditional integer/control flow code, on the order of 1/4 of a similar age Xeon, and they were *fussy*. Lots of options to tweak memory/cache and scheduler behavior, but they only really worked well on a very small number of workloads, even in HPC. (Basically, think like Amdahl's law but substitute "integer and control flow code (like happens in the operating system code)" and "FP code" for non-parallelizable and parallelizable. They were painful to use well, and they really, really hit a wall when doing anything they weren't optimized for. There was definitely a useful space for them, but it wasn't large enough, at least at their level of execution.)

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 20:45 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (1 responses)

I think that knights landing could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Phi as an add on accelerator had all of the same problems as GPU development (E.g. network I/O), without mature (CUDA was a monopoly at the time) tooling, and without a cost / performance advantage over GPUs (or so I've heard). However, for workloads like batch HPC/HTC, the improved flops/watt could have been a attractive and it would have saved the labor investment required to deal with GPU development.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 23:46 UTC (Mon) by rbanffy (guest, #103898) [Link]

> could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Such a shame. I really wanted one.

The first generation (it was 4 threads per core, I think, each running at a quarter speed) was very limited and extracting reasonable performance from it was not trivial. Subsequent generations were much better, but, by then, a lot of the heavy lifting was available on GPUs and the "friendly" programming of the Phi was no longer a huge advantage (if you can hide the GPU behind a library, it's not significantly uglier than using intrinsics to directly play with AVX-512).

Again, a huge shame. I hope Intel makes something with lots of E cores, even if for no other reason than to teach programmers to make more parallel code, because clock x IPCs won't get higher as quickly as core counts.