LWN: Comments on "Hybrid scheduling gets more complicated"

Hybrid scheduling gets more complicated

strom1 — Sat, 24 Jun 2023 13:49:07 +0000

The E-Cores are designed to offload mundane, non-CPU intensive, task handling. Their speed and computational efficiency makes them excellent candidates for: IRQ handlers, polling events for drivers, audio, window movement along with most windowing tasks, game main loops and render loops(dispatch and collection); any thread that can complete its task within a millisecond or two and not dependent on memory timing delays.

Many, if not most, user applications are single threaded, so dedicating to a P-Core by default makes sense.

It would make sense to change the affinity to an E-core when the application is not in window focus, unless "nice" 'd.

There has been a mention of Phi and Knights Landing; The main issue is memory access speeds/delay. This problem is a local cache size issue, or more importantly where the instructions/data resides. Unless your web server code and served data fit within the shared cache, L3, adding more cores will not help capacity issues. The HBM memory incorporated in the current Xeon Max series solves the starvation issue, by filling more of the caches at one time, at the cost of latency.

My suggestion would be to modify the nice/priority system, Using odd nice values to indicate an affinity towards E-Cores. This change would allow developers to specify thread priority for affinity.

AVX-512

roblucid — Thu, 13 Oct 2022 13:54:49 +0000

Dr Ian Cutress tested various AVX usages on Zen4 and both power and performance benefited with AVX512 with an exception that lost about 5%. He was impressed with the lack of downsides.

AVX-512

roblucid — Thu, 13 Oct 2022 13:48:17 +0000

AMD Zen4 *cough cough*

Dr Ian Cutress's program uses AVX2 or AVX512 and it had a 6x speed up on Zen3.

Hybrid scheduling gets more complicated

paulj — Thu, 06 Oct 2022 12:18:06 +0000

If they can't figure out a cheap+fast profiling algorithm to optimally schedule instructions onto the available logic units in a CPU in silicon, where is the reason to think that a software scheduler in kernel with a) /far/ less insight into the currently executing code; and b) /far/ less control (in granularity of those logical units, and in temporally) over the scheduling of those instructions; can do better?

Hybrid scheduling gets more complicated

maxfragg — Thu, 06 Oct 2022 12:02:51 +0000

Current smartphone SOCs are already there. You have little cores, typically Cortex A57 or A55, a mid tier of Cortex A76 or newer, and then one or two Cortex X1 or similar at the top end.
That is what Samsung, Google and Qualcomm all are using in their high end chips and I still don't quite understand, how scheduling heuristics should handle such a SOC in an optimal way.

Hybrid scheduling gets more complicated

maxfragg — Thu, 06 Oct 2022 10:50:21 +0000

what you are describing is very well matching with what you get with the hybrid scheduling as its implemented for some ARM-SOCs.
The scheduler by default tries to keep everything on the LITTLE cores and only mitigates tasks to big cores, if they are repeatedly use up timeslice before blocking.
The result of that is, that you as a user get a noticeable slow period, if you interact with the system after some idle period, which really is a bad user experience.
Say you are running a webbrowser (and javascript usually suffers quite badly on simple in-order cores), and after idling it takes around 3s until all relevant threads have been migrated to the big cores again and your website is actually scrolling smoothly.
Not what I would call an ideal user experience, especially if the system isn't so strictly confined by power consumption.
Some applications shouldn't get periodically shifted down to slow cores, simply for the user sake, while other things, which might be janitor-tasks from the operating system, probably never should run on big cores.

Hybrid scheduling gets more complicated

rbanffy — Mon, 03 Oct 2022 23:46:34 +0000

> could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Such a shame. I really wanted one.

The first generation (it was 4 threads per core, I think, each running at a quarter speed) was very limited and extracting reasonable performance from it was not trivial. Subsequent generations were much better, but, by then, a lot of the heavy lifting was available on GPUs and the "friendly" programming of the Phi was no longer a huge advantage (if you can hide the GPU behind a library, it's not significantly uglier than using intrinsics to directly play with AVX-512).

Again, a huge shame. I hope Intel makes something with lots of E cores, even if for no other reason than to teach programmers to make more parallel code, because clock x IPCs won't get higher as quickly as core counts.

Hybrid scheduling gets more complicated

jhoblitt — Mon, 03 Oct 2022 20:45:16 +0000

I think that knights landing could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Phi as an add on accelerator had all of the same problems as GPU development (E.g. network I/O), without mature (CUDA was a monopoly at the time) tooling, and without a cost / performance advantage over GPUs (or so I've heard). However, for workloads like batch HPC/HTC, the improved flops/watt could have been a attractive and it would have saved the labor investment required to deal with GPU development.

Hybrid scheduling gets more complicated

Paf — Mon, 03 Oct 2022 19:13:17 +0000

It really didn't work very well - The cores were very slow on traditional integer/control flow code, on the order of 1/4 of a similar age Xeon, and they were *fussy*. Lots of options to tweak memory/cache and scheduler behavior, but they only really worked well on a very small number of workloads, even in HPC. (Basically, think like Amdahl's law but substitute "integer and control flow code (like happens in the operating system code)" and "FP code" for non-parallelizable and parallelizable. They were painful to use well, and they really, really hit a wall when doing anything they weren't optimized for. There was definitely a useful space for them, but it wasn't large enough, at least at their level of execution.)

AVX-512

farnz — Mon, 03 Oct 2022 15:33:28 +0000

The critical difference is that with SKX, the maximum permitted clock assuming that thermals allowed was massively reduced for "heavy" AVX-512, because it caused thermal hot-spots on the chip that weren't properly accounted for by "normal" thermal monitoring. With ICL and with RKL there's no longer a huge limit - instead of the SKX thing (where a chip could drop from 3.0 GHz "base" to 2.8 GHz "max turbo" if you used AVX-512), you now can always sustain the same "base", but the max turbo is reduced by 100 MHz or so.

AVX-512

drago01 — Mon, 03 Oct 2022 15:15:30 +0000

Well there is also no downclock on Zen4.
But give how CPUs work now days that's not entirely true either because a lighter workload will result into higher clocks and vise versa. CPUs try to maximize performance within the power budget.

Clocks don't matter much though, what matters is the performance you are getting. And if you workload benefits from wide vectors it will offset any clock changes.

Hybrid scheduling gets more complicated

excors — Mon, 03 Oct 2022 15:06:08 +0000

I suspect it depends on whether you're thinking about battery-powered mobile CPUs, where you want to minimise average power consumption over several hours of typical usage, or desktop CPUs, where you want to maximise performance under heavy load within a thermally-constrained maximum power consumption. In the latter case, if the system isn't heavily loaded then you don't care about power consumption and might as well run everything on P-cores to maximise responsiveness. Once all the P-cores are loaded, then you fall back on the E-cores to keep increasing overall throughput with a relatively small increase in power.

Intel advertises the hybrid architecture as good for gaming with lots of low-priority background tasks (like "Discord or antivirus software" - https://www.intel.com/content/www/us/en/gaming/resources/...), which sounds plausible. Modern game consoles have 8-core CPUs, so game engines are probably designed to scale effectively to 8 cores but maybe not much further. They should fit well on the Intel CPU's 8 P-cores, and then Intel stuffs in as many E-cores as they can afford (in terms of power and manufacturing cost) to handle any less-latency-sensitive or more-easily-parallelisable tasks without interrupting the game.

Hybrid scheduling gets more complicated

rbanffy — Mon, 03 Oct 2022 12:51:08 +0000

> Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.

They did - it was called Xeon Phi and making it run at its theoretical peak performance was difficult - it was better suited for HPC workloads (it borrowed heavily from their Larrabee chip) than general-purpose high-throughput tasks such as web serving. I think it's a shame it ended up abandoned.

Hybrid scheduling gets more complicated

Karellen — Mon, 03 Oct 2022 12:31:46 +0000

When I said "until the demand for CPU time rises", I was considering the equivalent of "when load average climbs above the number of awake cores" for a suitably short-duration load average, e.g. 2s or less. If that helps.

Anyway, I always thought that higher-performance cores had "deeper" sleep states, and took longer to wake from them. So even for latency-based workloads, if the amount of work to be done was relatively small, a low-power core might still complete the task sooner? Am I mistaken there?

But also, it still seems weird on the kind of machine you're discussing where you want work done as quickly as possible. You're always scheduling work on the fastest core available, but once you get to the point where 16 high-performance cores are all in use, and you need even more work done, that that is the point the system decides "Well, now is the time to wake up those energy-saving, low-power, low-performance cores. They'll be a great help here!" I don't see how that strategy makes sense.

Hybrid scheduling gets more complicated

paulj — Mon, 03 Oct 2022 11:15:16 +0000

s/never needed volume/never got the needed volume/
s/ SPARC/UltraSPARC/g

Hybrid scheduling gets more complicated

timrichardson — Mon, 03 Oct 2022 11:04:47 +0000

Laptops often have power profiles which impose thermal limits and possibly cpu frequency limits and I don't know what else. I wonder how these profiles interact with the core scheduling, or maybe they operate at such vastly different time scales it hardly matters.

AVX-512

farnz — Mon, 03 Oct 2022 09:32:54 +0000

The difficulty comes in when my workload is scheduled on a single server with other workloads. The right decision for my workload if on a machine by itself is AVX-512 at the lower clocks; however, depending on what the scheduler does, the right decision might become AVX2 if other workloads are more important than mine, and are adversely affected by the core doing AVX-512 downclocking.

This is the problem with using local state ("does this OS thread make use of AVX-512") to drive a global decision ("what clock speed should this core run at"). The correct answer depends not only on my workload, but also on all other workloads sharing this CPU core - which is fine for HPC type workloads, where there are no other workloads sharing a CPU core, but more of a problem with general deployment of AVX-512.

As a side note, as Intel moves on with process from the 14nm of original AVX-512 CPUs, the downclock becomes less severe, and it's nearly non-existent on the latest designs. This, to me, suggests that the downclock is a consequence of backporting AVX-512 to Skylake on 14nm, and thus will become a historic artefact over time.

Hybrid scheduling gets more complicated

paulj — Mon, 03 Oct 2022 09:24:11 +0000

Tilera was a bit weird and niche though, and orientated to network processing. Also buggy and unreliable - underdeveloped, presumably cause it never needed volume.

UltraSPARC T1 is probably more like it. A whole bunch of (relatively) classic, simple, in-order, 5-stage RISC SPARC CPUs, behind a thread-selector stage to rapidly switch threads and a huge memory controller. Low instructions-per-thread, but - if you loaded it up with an embarrassingly parallel IO workload (i.e. uncomplicated HTTP serving), very high aggregate IPC.

[In the initial versions, they neglected to add FPUs (just 1 FPU in the first T1, shared across all cores) and then - once away from Suns' in-house benchmarks and into the real-world - they realised that a lot of real-world applications depend on scripting languages that use FP by default, even for integer math. ;) ]

AVX-512

khim — Sun, 02 Oct 2022 12:06:49 +0000

The biggest problem there is the fact that decision to use (or not use) SSE, AVX, AVX-512 is local (you pick these on level is tiny, elementary, functions) while the question about whether AVX-512 is beneficial or not is global.

Essentially the same dilemma which killed Itanic, just not as acute.

Hybrid scheduling gets more complicated

roc — Sun, 02 Oct 2022 08:40:45 +0000

I agree such a workload could exist, but it's not enough to just have multiple endpoints with these different properties, they also need to share memory. If the different endpoints don't need to share memory with each other, you can just run two or more separate servers. Plus, endpoints that *require* shared memory are going to have problems with horizontal scaling beyond one server. This use-case starts to sound very contrived.

AVX-512

drago01 — Sun, 02 Oct 2022 07:30:32 +0000

If the clock offset hurts you more than the gains from AVX-512 then your workload does not really benefit from AVX-512 in a meaningful way.

Hybrid scheduling gets more complicated

neggles — Sun, 02 Oct 2022 00:03:49 +0000

Intel are of the opinion that hybrid serves no purpose in servers, as you're typically not going to have a particularly non-uniform workload on a server; they tend to run at fairly steady-state load levels, with a lot of simultaneous tasks, so you're not generally going to have unloaded/unused cores. Heterogeneous compute isn't super useful if you never have an opportunity to clock/power gate some P-cores; AMD and ARM tend to agree with them on this point, too.

That said, Intel do have a pure E-core Xeon CPU due out in 2024, probably with some frankly ludicrous number of cores, targeted at the mobile game market; quite a few mobile games for Android run entirely in the cloud, with a dedicated core per simultaneous user, then stream the result down to the user's device; prevents cheating, saves device battery usage, and means you get the same experience regardless of your device's performance. (Personally I don't like it, but that's what they're doing...) and these are targeted at that.

Hybrid scheduling gets more complicated

Wol — Sat, 01 Oct 2022 22:37:33 +0000

> … assuming that your total load is low enough that you can actually go to sleep.

Are these general purpose chips? Sounds like they're just a mix of Pentium/Arm cores (or whatever the standard x86_64 chip is called nowadays ...)

So if that chip is in a desktop (and is it financially profitable for Intel to make two different chips for server and desktop?) then I'm sure that a lot of chips will spend a lot of time asleep? xosview tells me that a lot of my cores spend a lot of time at 0% utilisation. And it looks like tasks spend a lot of time jumping from core to core. To my untutored glance, it looks like - most of the time - I wouldn't notice in the slightest if most of my cores were disabled.

What makes sense for a dedicated single-purpose chip may make completely different sense when looked at by an accountant asking "is it worth spending that money?".

Cheers,
Wol

Hybrid scheduling gets more complicated

Cyberax — Sat, 01 Oct 2022 21:43:33 +0000

The closest available model to this was probably Tilera. They produced a chip for routers and other similar applications where you need to process many concurrent independent streams.

It went down like a brick. It was barely better than Xeons and not really any cheaper, while requiring a completely new architecture.

Hybrid scheduling gets more complicated

jhoblitt — Sat, 01 Oct 2022 20:43:53 +0000

I often hear this stated by the scheduler folks yet we seem arm moving towards 3 different classes of core in the same chip. Is this all marketing hype or does it make sense in at least some scenarios?

Hybrid scheduling gets more complicated

flussence — Sat, 01 Oct 2022 20:36:27 +0000

They'd be competing unfavourably with existing 2P*128C ARM servers - and Zen4 Epyc when that comes out. Intel would not want to be seen putting out a high specs server CPU with no AVX512 at all while AMD's building chips that can do it within a sane power envelope.

Hybrid scheduling gets more complicated

smurf — Sat, 01 Oct 2022 20:03:39 +0000

… assuming that your total load is low enough that you can actually go to sleep.

If not, then (as the main text states) you need to distribute the load between E- and P-cores somewhat intelligently.

Hybrid scheduling gets more complicated

atnot — Sat, 01 Oct 2022 15:30:14 +0000

The thing you are not accounting for is sleep states. Most background processes do not run constantly, but sleep until woken up by some event. The E Cores do use less power than the P cores, but it's still many times what the SoC draws when completely idle. Thus counterintuitively, minimizing power draw is not actually about performing the work with the least energy use. It's a race to handle whatever woke you up and get back into a sleep state as fast as you can. This is why P cores will actually save you energy.

Hybrid scheduling gets more complicated

epa — Sat, 01 Oct 2022 13:30:41 +0000

What you say is true for throughput-oriented workloads. But for latency-oriented ones, even if my machine is mostly unloaded with 15 cores standing idle, I still want it to respond as fast as possible. So if there is some single-threaded task running, that has brief bursts of being CPU-bound in between waiting for I/O, I would schedule that task on the fastest core I have.

Even today there are applications, like a certain Lisp-based text editor, which are single threaded, interactive, and often need to grind the CPU before they can respond to the user.

Or when you say ‘until the demand for CPU time rises…’, are you including this scenario in that condition?

Hybrid scheduling gets more complicated

jhoblitt — Sat, 01 Oct 2022 12:41:21 +0000

No, the question is literally why doesn't Intel sell those.

Hybrid scheduling gets more complicated

Karellen — Sat, 01 Oct 2022 12:21:07 +0000

The Pcores have higher performance and also support simultaneous multi-threading (SMT). The Ecores, instead, are more focused on energy efficiency than performance; [...] As of 4.10, Intel's ITMT (standing for "Intel Turbo Boost Max Technology") support caused the scheduler to prefer Pcores over Ecores, all else being equal. That had the effect of putting processes on the faster CPUs when possible, but it also would load all SMT sibling CPUs before falling back to the Ecores,

Isn't that backwards? Wouldn't it be better to generally prefer Ecores while the load is low enough that the Ecores can handle the workload, and only start waking up and putting processes on the Pcores once the demand for CPU time rises high enough that they can't?

What's the benefit of using the low-power cores as the fall-backs?

Hybrid scheduling gets more complicated

tlamp — Sat, 01 Oct 2022 11:31:29 +0000

> Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

For some server workloads it could be also interesting to have still both, like for example 128 E-cores and 32 P-cores.

For example, if I got an HTTP API with various endpoints, some of them may be quite compute-intensive while others not, a mixed setup may have a higher throughput at a lower total power consumption.
Tasks that, e.g., do lots of number crunching (or profit from advanced CPU core capabilities) can be moved to the P-cores while the relatively big amount of E-cores can serve parallel to that a massive amount of the "cheaper" responses using a low(er) amount of compute resources and potentially avoiding latency spikes compared to the homogeneous setups.

For very specialized workloads it may be better to handle such things explicitly in the application, or user space itself via pinning, but I saw quite some services where such different request work loads can happen, and it'd be nice if the OS scheduler would handle that better in a general way, even if not 100% optimal.

AVX-512

atnot — Sat, 01 Oct 2022 11:31:01 +0000

It is slower in most practical scenarios because it imposes a high frequency penalty in Intel's current implementations. So you need to have enough AVX512 instructions lined up in a row to make up for the latency of clocking down into and back out of AVX512 mode. This gets worse in the common cases where backend bottlenecks mean that AVX512 is only actually slightly faster. So this means AVX512 is usually going to be slower outside of select HPC workloads.

AVX-512

drago01 — Sat, 01 Oct 2022 10:24:56 +0000

Why would it be slower than AVX2?
If your workload benefits from AVX-512 it will be significantly faster than AVX2.

AVX-512

ballombe — Sat, 01 Oct 2022 08:59:01 +0000

...which makes AVX512 even less dependable.
Why write AVX512 code when almost nobody can run it and when for half of those who can, it is slower than AVX2 ?

AVX-512

Sesse — Sat, 01 Oct 2022 07:42:55 +0000

That's true, and Intel has “solved” that by permanently disabling AVX-512 on the P-cores. The silicon is physically present, but that just a manufacturing detail (similar to how GPUs often have some defective cores that are disabled). So yes, they do run the same instruction set.

Hybrid scheduling gets more complicated

Sesse — Sat, 01 Oct 2022 07:40:35 +0000

The question is, does Intel sell those? :-)

AVX-512

drago01 — Sat, 01 Oct 2022 07:38:13 +0000

Yes but neither do P-cores. They technically could support AVX-512 but it's disabled.

AVX-512

epa — Sat, 01 Oct 2022 06:23:37 +0000

The article says “Both types of CPU implement the same instruction set”. But that’s not true, surely? The AVX-512 instructions and perhaps others are not available on E-cores.

Hybrid scheduling gets more complicated

roc — Sat, 01 Oct 2022 05:44:43 +0000

Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

In theory there could be a non-uniform workload that performs really well with 64 p-cores and 256 e-cores *all sharing memory*, but that sounds relatively unusual.