Hybrid scheduling gets more complicated

By Jonathan Corbet
September 30, 2022

Just over ten years ago, the Arm big.LITTLE architecture posed a challenge for the kernel's CPU scheduler: how should processes be assigned to CPUs when not all CPUs have the same capacity? The situation has not gotten simpler since then; new systems bring new quirks that must be kept in mind for optimal scheduling. At the 2022 Linux Plumbers Conference, Len Brown and Ricardo Neri talked about Intel's hybrid systems and the work that is being done to schedule properly on those systems.

Brown started by describing Intel's hybrid CPUs which, he said, have a combination of "Pcores" and "Ecores". The Pcores have higher performance and also support simultaneous multi-threading (SMT). The Ecores, instead, are more focused on energy efficiency than performance; Ecores were once known as "Atom" CPUs. Both types of CPU implement the same instruction set, so a process can move freely between the two types.

Kernel releases through 4.9 treated all CPUs on these systems as being equal; that meant that any given process would experience variable performance depending on where the scheduler placed it in the system. As of 4.10, Intel's ITMT (standing for "Intel Turbo Boost Max Technology") support caused the scheduler to prefer Pcores over Ecores, all else being equal. That had the effect of putting processes on the faster CPUs when possible, but it also would load all SMT sibling CPUs before falling back to the Ecores, which leads to worse performance overall. That has been fixed as of 5.16; an Ecore will now be preferred over an SMT CPU whose sibling is already busy.

Pcores are faster; they run at a higher clock frequency, but are also able to get more work done with each clock cycle. As a result, clock frequencies alone are not sufficient to compare the capacity of two CPUs in a system. To address this problem, the hardware is able to provide both performance and efficiency scores for each CPU; these numbers can change at run time if conditions change, Brown said.

The situation is actually a bit more complex than that, though. The performance difference between the CPU types depends on which instructions are being executed at any given time. Programs using the VNNI instructions (which are intended to accelerate machine-learning applications) may see much more advantage from running on a Pcore than those that are doing nothing special. There are four different classes of performance, dominated by instruction type, and the ratio of Pcore to Ecore performance is different for each.

To schedule such a system optimally, the kernel should use the Pcores to run the processes that will benefit from them the most. Application developers cannot really be expected to know which of the four performance classes best describes their code, and the appropriate class may change over a program's execution in any case, but the CPU certainly knows which types of instructions are being executed at any given time. So each CPU exposes a register indicating which performance class best describes the currently running process. That allows the kernel to assign a class ID and use it in scheduling decisions.

Neri took over to describe the work that has been done to take advantage of this information. The class ID of each process is stored in its task_struct structure. The first use of this information is in the idle load balancer, which is invoked when a CPU has run out of tasks to execute and looks to see if a task should be pulled from a more heavily loaded CPU elsewhere in the system. This code can look at the class ID of each candidate task to find the one that would benefit the most (or suffer the least) from being moved. This check works at both ends; a task that is making heavy use of instructions that are best run on its current CPU should not be moved if possible.

An audience member asked whether the class ID of a running process can be adjusted from user space. Brown answered that this capability exists for debugging purposes, but that nobody had thought about making it available as a supported feature.

Neri continued that the kernel's NUMA-balancing code can also look at the class IDs and exchange tasks between nodes if that would lead to better system performance. Something similar could also be done with busy load balancing, which tries to even out the load across a busy system. This idea made some developers nervous; it would be easy to break load balancing in ways that create performance regressions that don't come to light until long afterward. Neri emphasized that the class ID would only be used in load-balancing decisions if the existing heuristics led to a tie between two options.

The final moments of the session were dedicated to the problem of scheduling on Intel's Alder Lake CPUs (which started shipping earlier this year). Specifically, the kernel's energy-aware scheduling heuristics don't work well on those CPUs. A number of features present there complicate the energy picture; these include SMT, Intel's "turbo boost" mode, and the CPU's internal power-management mechanisms. For many workloads, running on an ostensibly more power-hungry Pcore can be more efficient than using an Ecore. Time for discussion of the problem was lacking, though, and the session came to a close.

[Thanks to LWN subscribers for supporting my travel to this event.]

Index entries for this article
Kernel	Scheduler
Conference	Linux Plumbers Conference/2022

Hybrid scheduling gets more complicated

Posted Sep 30, 2022 21:18 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (32 responses)

This is probably off topic but does anyone have a sense of the relative transistor count for a P-core vs. an E-core?

Hybrid scheduling gets more complicated

Posted Sep 30, 2022 21:37 UTC (Fri) by james (subscriber, #1325) [Link]

Apparently, four E-cores take up the same physical space as one P-core on Alder Lake.

Hybrid scheduling gets more complicated

Posted Sep 30, 2022 21:50 UTC (Fri) by andromeda (guest, #138427) [Link] (30 responses)

I couldn't find any specific numbers, but in looking at the die [1] for an Alder Lake S CPU with 8 performance cores and 8 efficiency cores, the pcores appear to be significantly larger. At a total die size of 215.25 mm^2, and a per mm^2 transistor count of 106.1M [2], the whole die would have 22.8B transistors. Of that, at least based on my quick glance at the die, maybe 10% is the ecores, and 55% is the pcores.

[1] https://en.wikichip.org/wiki/intel/microarchitectures/ald...
[2] https://en.wikipedia.org/wiki/7_nm_process#7_nm_process_n...

Hybrid scheduling gets more complicated

Posted Sep 30, 2022 21:58 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (29 responses)

The die pics appear to be at least 1:4 if not a smaller ratio. I think it is interesting that Intel has, thus far, limited the e-cores to desktop/mobile and kept them out of the server space. Is that because they don't want to add avx-512/etc. to the atom cores or is it anxiety about scheduler problems? It seems like AMD has no plans for E-cores but I'm guessing that has more to do with AMD sharing dies between different product categories.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 3:42 UTC (Sat) by roc (subscriber, #30627) [Link] (14 responses)

What's the use-case for E-cores on servers? On servers you can usually fully load the machine. If you have a workload that runs as well on an E-core as a P-core, just run it on a different class of server. E-cores are great for interactive devices where the user's needs can often be met without fully loading the machine.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 3:57 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (13 responses)

The use case is more instructions per watt. The choice would be between something like 64 p-cores or 256 e-cores
Where the later could retire twice as many instructions at the as thermal envelop (assuming memory bandwidth isn't a bottleneck). Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 5:44 UTC (Sat) by roc (subscriber, #30627) [Link] (8 responses)

Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

In theory there could be a non-uniform workload that performs really well with 64 p-cores and 256 e-cores *all sharing memory*, but that sounds relatively unusual.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 7:40 UTC (Sat) by Sesse (subscriber, #53779) [Link] (5 responses)

The question is, does Intel sell those? :-)

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 12:41 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (1 responses)

No, the question is literally why doesn't Intel sell those.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 20:36 UTC (Sat) by flussence (guest, #85566) [Link]

They'd be competing unfavourably with existing 2P*128C ARM servers - and Zen4 Epyc when that comes out. Intel would not want to be seen putting out a high specs server CPU with no AVX512 at all while AMD's building chips that can do it within a sane power envelope.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 21:43 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

The closest available model to this was probably Tilera. They produced a chip for routers and other similar applications where you need to process many concurrent independent streams.

It went down like a brick. It was barely better than Xeons and not really any cheaper, while requiring a completely new architecture.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 9:24 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

Tilera was a bit weird and niche though, and orientated to network processing. Also buggy and unreliable - underdeveloped, presumably cause it never needed volume.

UltraSPARC T1 is probably more like it. A whole bunch of (relatively) classic, simple, in-order, 5-stage RISC SPARC CPUs, behind a thread-selector stage to rapidly switch threads and a huge memory controller. Low instructions-per-thread, but - if you loaded it up with an embarrassingly parallel IO workload (i.e. uncomplicated HTTP serving), very high aggregate IPC.

[In the initial versions, they neglected to add FPUs (just 1 FPU in the first T1, shared across all cores) and then - once away from Suns' in-house benchmarks and into the real-world - they realised that a lot of real-world applications depend on scripting languages that use FP by default, even for integer math. ;) ]

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 11:15 UTC (Mon) by paulj (subscriber, #341) [Link]

s/never needed volume/never got the needed volume/
s/ SPARC/UltraSPARC/g

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 11:31 UTC (Sat) by tlamp (subscriber, #108540) [Link] (1 responses)

> Right, but you could just use a server with 256 e-cores and no p-cores and hybrid scheduling isn't an issue.

For some server workloads it could be also interesting to have still both, like for example 128 E-cores and 32 P-cores.

For example, if I got an HTTP API with various endpoints, some of them may be quite compute-intensive while others not, a mixed setup may have a higher throughput at a lower total power consumption.
Tasks that, e.g., do lots of number crunching (or profit from advanced CPU core capabilities) can be moved to the P-cores while the relatively big amount of E-cores can serve parallel to that a massive amount of the "cheaper" responses using a low(er) amount of compute resources and potentially avoiding latency spikes compared to the homogeneous setups.

For very specialized workloads it may be better to handle such things explicitly in the application, or user space itself via pinning, but I saw quite some services where such different request work loads can happen, and it'd be nice if the OS scheduler would handle that better in a general way, even if not 100% optimal.

Hybrid scheduling gets more complicated

Posted Oct 2, 2022 8:40 UTC (Sun) by roc (subscriber, #30627) [Link]

I agree such a workload could exist, but it's not enough to just have multiple endpoints with these different properties, they also need to share memory. If the different endpoints don't need to share memory with each other, you can just run two or more separate servers. Plus, endpoints that *require* shared memory are going to have problems with horizontal scaling beyond one server. This use-case starts to sound very contrived.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 12:51 UTC (Mon) by rbanffy (guest, #103898) [Link] (3 responses)

> Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.

They did - it was called Xeon Phi and making it run at its theoretical peak performance was difficult - it was better suited for HPC workloads (it borrowed heavily from their Larrabee chip) than general-purpose high-throughput tasks such as web serving. I think it's a shame it ended up abandoned.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 19:13 UTC (Mon) by Paf (subscriber, #91811) [Link] (2 responses)

It really didn't work very well - The cores were very slow on traditional integer/control flow code, on the order of 1/4 of a similar age Xeon, and they were *fussy*. Lots of options to tweak memory/cache and scheduler behavior, but they only really worked well on a very small number of workloads, even in HPC. (Basically, think like Amdahl's law but substitute "integer and control flow code (like happens in the operating system code)" and "FP code" for non-parallelizable and parallelizable. They were painful to use well, and they really, really hit a wall when doing anything they weren't optimized for. There was definitely a useful space for them, but it wasn't large enough, at least at their level of execution.)

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 20:45 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (1 responses)

I think that knights landing could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Phi as an add on accelerator had all of the same problems as GPU development (E.g. network I/O), without mature (CUDA was a monopoly at the time) tooling, and without a cost / performance advantage over GPUs (or so I've heard). However, for workloads like batch HPC/HTC, the improved flops/watt could have been a attractive and it would have saved the labor investment required to deal with GPU development.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 23:46 UTC (Mon) by rbanffy (guest, #103898) [Link]

> could have been a successful product if it had actually been possible for mere mortals to purchase them as a stand alone CPU.

Such a shame. I really wanted one.

The first generation (it was 4 threads per core, I think, each running at a quarter speed) was very limited and extracting reasonable performance from it was not trivial. Subsequent generations were much better, but, by then, a lot of the heavy lifting was available on GPUs and the "friendly" programming of the Phi was no longer a huge advantage (if you can hide the GPU behind a library, it's not significantly uglier than using intrinsics to directly play with AVX-512).

Again, a huge shame. I hope Intel makes something with lots of E cores, even if for no other reason than to teach programmers to make more parallel code, because clock x IPCs won't get higher as quickly as core counts.

AVX-512

Posted Oct 1, 2022 6:23 UTC (Sat) by epa (subscriber, #39769) [Link] (12 responses)

The article says “Both types of CPU implement the same instruction set”. But that’s not true, surely? The AVX-512 instructions and perhaps others are not available on E-cores.

AVX-512

Posted Oct 1, 2022 7:38 UTC (Sat) by drago01 (subscriber, #50715) [Link] (10 responses)

Yes but neither do P-cores. They technically could support AVX-512 but it's disabled.

AVX-512

Posted Oct 1, 2022 8:59 UTC (Sat) by ballombe (subscriber, #9523) [Link] (9 responses)

...which makes AVX512 even less dependable.
Why write AVX512 code when almost nobody can run it and when for half of those who can, it is slower than AVX2 ?

AVX-512

Posted Oct 1, 2022 10:24 UTC (Sat) by drago01 (subscriber, #50715) [Link] (7 responses)

Why would it be slower than AVX2?
If your workload benefits from AVX-512 it will be significantly faster than AVX2.

AVX-512

Posted Oct 1, 2022 11:31 UTC (Sat) by atnot (subscriber, #124910) [Link] (6 responses)

It is slower in most practical scenarios because it imposes a high frequency penalty in Intel's current implementations. So you need to have enough AVX512 instructions lined up in a row to make up for the latency of clocking down into and back out of AVX512 mode. This gets worse in the common cases where backend bottlenecks mean that AVX512 is only actually slightly faster. So this means AVX512 is usually going to be slower outside of select HPC workloads.

AVX-512

Posted Oct 2, 2022 7:30 UTC (Sun) by drago01 (subscriber, #50715) [Link] (5 responses)

If the clock offset hurts you more than the gains from AVX-512 then your workload does not really benefit from AVX-512 in a meaningful way.

AVX-512

Posted Oct 2, 2022 12:06 UTC (Sun) by khim (subscriber, #9252) [Link]

The biggest problem there is the fact that decision to use (or not use) SSE, AVX, AVX-512 is local (you pick these on level is tiny, elementary, functions) while the question about whether AVX-512 is beneficial or not is global.

Essentially the same dilemma which killed Itanic, just not as acute.

AVX-512

Posted Oct 3, 2022 9:32 UTC (Mon) by farnz (subscriber, #17727) [Link] (3 responses)

The difficulty comes in when my workload is scheduled on a single server with other workloads. The right decision for my workload if on a machine by itself is AVX-512 at the lower clocks; however, depending on what the scheduler does, the right decision might become AVX2 if other workloads are more important than mine, and are adversely affected by the core doing AVX-512 downclocking.

This is the problem with using local state ("does this OS thread make use of AVX-512") to drive a global decision ("what clock speed should this core run at"). The correct answer depends not only on my workload, but also on all other workloads sharing this CPU core - which is fine for HPC type workloads, where there are no other workloads sharing a CPU core, but more of a problem with general deployment of AVX-512.

As a side note, as Intel moves on with process from the 14nm of original AVX-512 CPUs, the downclock becomes less severe, and it's nearly non-existent on the latest designs. This, to me, suggests that the downclock is a consequence of backporting AVX-512 to Skylake on 14nm, and thus will become a historic artefact over time.

AVX-512

Posted Oct 3, 2022 15:15 UTC (Mon) by drago01 (subscriber, #50715) [Link] (2 responses)

Well there is also no downclock on Zen4.
But give how CPUs work now days that's not entirely true either because a lighter workload will result into higher clocks and vise versa. CPUs try to maximize performance within the power budget.

Clocks don't matter much though, what matters is the performance you are getting. And if you workload benefits from wide vectors it will offset any clock changes.

AVX-512

Posted Oct 3, 2022 15:33 UTC (Mon) by farnz (subscriber, #17727) [Link]

The critical difference is that with SKX, the maximum permitted clock assuming that thermals allowed was massively reduced for "heavy" AVX-512, because it caused thermal hot-spots on the chip that weren't properly accounted for by "normal" thermal monitoring. With ICL and with RKL there's no longer a huge limit - instead of the SKX thing (where a chip could drop from 3.0 GHz "base" to 2.8 GHz "max turbo" if you used AVX-512), you now can always sustain the same "base", but the max turbo is reduced by 100 MHz or so.

AVX-512

Posted Oct 13, 2022 13:54 UTC (Thu) by roblucid (guest, #48964) [Link]

Dr Ian Cutress tested various AVX usages on Zen4 and both power and performance benefited with AVX512 with an exception that lost about 5%. He was impressed with the lack of downsides.

AVX-512

Posted Oct 13, 2022 13:48 UTC (Thu) by roblucid (guest, #48964) [Link]

AMD Zen4 *cough cough*

Dr Ian Cutress's program uses AVX2 or AVX512 and it had a 6x speed up on Zen3.

AVX-512

Posted Oct 1, 2022 7:42 UTC (Sat) by Sesse (subscriber, #53779) [Link]

That's true, and Intel has “solved” that by permanently disabling AVX-512 on the P-cores. The silicon is physically present, but that just a manufacturing detail (similar to how GPUs often have some defective cores that are disabled). So yes, they do run the same instruction set.

Hybrid scheduling gets more complicated

Posted Oct 2, 2022 0:03 UTC (Sun) by neggles (subscriber, #153254) [Link]

Intel are of the opinion that hybrid serves no purpose in servers, as you're typically not going to have a particularly non-uniform workload on a server; they tend to run at fairly steady-state load levels, with a lot of simultaneous tasks, so you're not generally going to have unloaded/unused cores. Heterogeneous compute isn't super useful if you never have an opportunity to clock/power gate some P-cores; AMD and ARM tend to agree with them on this point, too.

That said, Intel do have a pure E-core Xeon CPU due out in 2024, probably with some frankly ludicrous number of cores, targeted at the mobile game market; quite a few mobile games for Android run entirely in the cloud, with a dedicated core per simultaneous user, then stream the result down to the user's device; prevents cheating, saves device battery usage, and means you get the same experience regardless of your device's performance. (Personally I don't like it, but that's what they're doing...) and these are targeted at that.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 12:21 UTC (Sat) by Karellen (subscriber, #67644) [Link] (10 responses)

The Pcores have higher performance and also support simultaneous multi-threading (SMT). The Ecores, instead, are more focused on energy efficiency than performance; [...] As of 4.10, Intel's ITMT (standing for "Intel Turbo Boost Max Technology") support caused the scheduler to prefer Pcores over Ecores, all else being equal. That had the effect of putting processes on the faster CPUs when possible, but it also would load all SMT sibling CPUs before falling back to the Ecores,

Isn't that backwards? Wouldn't it be better to generally prefer Ecores while the load is low enough that the Ecores can handle the workload, and only start waking up and putting processes on the Pcores once the demand for CPU time rises high enough that they can't?

What's the benefit of using the low-power cores as the fall-backs?

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 13:30 UTC (Sat) by epa (subscriber, #39769) [Link] (2 responses)

What you say is true for throughput-oriented workloads. But for latency-oriented ones, even if my machine is mostly unloaded with 15 cores standing idle, I still want it to respond as fast as possible. So if there is some single-threaded task running, that has brief bursts of being CPU-bound in between waiting for I/O, I would schedule that task on the fastest core I have.

Even today there are applications, like a certain Lisp-based text editor, which are single threaded, interactive, and often need to grind the CPU before they can respond to the user.

Or when you say ‘until the demand for CPU time rises…’, are you including this scenario in that condition?

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 12:31 UTC (Mon) by Karellen (subscriber, #67644) [Link]

When I said "until the demand for CPU time rises", I was considering the equivalent of "when load average climbs above the number of awake cores" for a suitably short-duration load average, e.g. 2s or less. If that helps.

Anyway, I always thought that higher-performance cores had "deeper" sleep states, and took longer to wake from them. So even for latency-based workloads, if the amount of work to be done was relatively small, a low-power core might still complete the task sooner? Am I mistaken there?

But also, it still seems weird on the kind of machine you're discussing where you want work done as quickly as possible. You're always scheduling work on the fastest core available, but once you get to the point where 16 high-performance cores are all in use, and you need even more work done, that that is the point the system decides "Well, now is the time to wake up those energy-saving, low-power, low-performance cores. They'll be a great help here!" I don't see how that strategy makes sense.

Hybrid scheduling gets more complicated

Posted Oct 6, 2022 10:50 UTC (Thu) by maxfragg (guest, #122266) [Link]

what you are describing is very well matching with what you get with the hybrid scheduling as its implemented for some ARM-SOCs.
The scheduler by default tries to keep everything on the LITTLE cores and only mitigates tasks to big cores, if they are repeatedly use up timeslice before blocking.
The result of that is, that you as a user get a noticeable slow period, if you interact with the system after some idle period, which really is a bad user experience.
Say you are running a webbrowser (and javascript usually suffers quite badly on simple in-order cores), and after idling it takes around 3s until all relevant threads have been migrated to the big cores again and your website is actually scrolling smoothly.
Not what I would call an ideal user experience, especially if the system isn't so strictly confined by power consumption.
Some applications shouldn't get periodically shifted down to slow cores, simply for the user sake, while other things, which might be janitor-tasks from the operating system, probably never should run on big cores.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 15:30 UTC (Sat) by atnot (subscriber, #124910) [Link] (5 responses)

The thing you are not accounting for is sleep states. Most background processes do not run constantly, but sleep until woken up by some event. The E Cores do use less power than the P cores, but it's still many times what the SoC draws when completely idle. Thus counterintuitively, minimizing power draw is not actually about performing the work with the least energy use. It's a race to handle whatever woke you up and get back into a sleep state as fast as you can. This is why P cores will actually save you energy.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 20:03 UTC (Sat) by smurf (subscriber, #17840) [Link] (4 responses)

… assuming that your total load is low enough that you can actually go to sleep.

If not, then (as the main text states) you need to distribute the load between E- and P-cores somewhat intelligently.

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 20:43 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (2 responses)

I often hear this stated by the scheduler folks yet we seem arm moving towards 3 different classes of core in the same chip. Is this all marketing hype or does it make sense in at least some scenarios?

Hybrid scheduling gets more complicated

Posted Oct 6, 2022 12:02 UTC (Thu) by maxfragg (guest, #122266) [Link] (1 responses)

Current smartphone SOCs are already there. You have little cores, typically Cortex A57 or A55, a mid tier of Cortex A76 or newer, and then one or two Cortex X1 or similar at the top end.
That is what Samsung, Google and Qualcomm all are using in their high end chips and I still don't quite understand, how scheduling heuristics should handle such a SOC in an optimal way.

Hybrid scheduling gets more complicated

Posted Oct 6, 2022 12:18 UTC (Thu) by paulj (subscriber, #341) [Link]

If they can't figure out a cheap+fast profiling algorithm to optimally schedule instructions onto the available logic units in a CPU in silicon, where is the reason to think that a software scheduler in kernel with a) /far/ less insight into the currently executing code; and b) /far/ less control (in granularity of those logical units, and in temporally) over the scheduling of those instructions; can do better?

Hybrid scheduling gets more complicated

Posted Oct 1, 2022 22:37 UTC (Sat) by Wol (subscriber, #4433) [Link]

> … assuming that your total load is low enough that you can actually go to sleep.

Are these general purpose chips? Sounds like they're just a mix of Pentium/Arm cores (or whatever the standard x86_64 chip is called nowadays ...)

So if that chip is in a desktop (and is it financially profitable for Intel to make two different chips for server and desktop?) then I'm sure that a lot of chips will spend a lot of time asleep? xosview tells me that a lot of my cores spend a lot of time at 0% utilisation. And it looks like tasks spend a lot of time jumping from core to core. To my untutored glance, it looks like - most of the time - I wouldn't notice in the slightest if most of my cores were disabled.

What makes sense for a dedicated single-purpose chip may make completely different sense when looked at by an accountant asking "is it worth spending that money?".

Cheers,
Wol

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 15:06 UTC (Mon) by excors (subscriber, #95769) [Link]

I suspect it depends on whether you're thinking about battery-powered mobile CPUs, where you want to minimise average power consumption over several hours of typical usage, or desktop CPUs, where you want to maximise performance under heavy load within a thermally-constrained maximum power consumption. In the latter case, if the system isn't heavily loaded then you don't care about power consumption and might as well run everything on P-cores to maximise responsiveness. Once all the P-cores are loaded, then you fall back on the E-cores to keep increasing overall throughput with a relatively small increase in power.

Intel advertises the hybrid architecture as good for gaming with lots of low-priority background tasks (like "Discord or antivirus software" - https://www.intel.com/content/www/us/en/gaming/resources/...), which sounds plausible. Modern game consoles have 8-core CPUs, so game engines are probably designed to scale effectively to 8 cores but maybe not much further. They should fit well on the Intel CPU's 8 P-cores, and then Intel stuffs in as many E-cores as they can afford (in terms of power and manufacturing cost) to handle any less-latency-sensitive or more-easily-parallelisable tasks without interrupting the game.

Hybrid scheduling gets more complicated

Posted Oct 3, 2022 11:04 UTC (Mon) by timrichardson (subscriber, #72836) [Link]

Laptops often have power profiles which impose thermal limits and possibly cpu frequency limits and I don't know what else. I wonder how these profiles interact with the core scheduling, or maybe they operate at such vastly different time scales it hardly matters.

Hybrid scheduling gets more complicated

Posted Jun 24, 2023 13:49 UTC (Sat) by strom1 (guest, #165771) [Link]

The E-Cores are designed to offload mundane, non-CPU intensive, task handling. Their speed and computational efficiency makes them excellent candidates for: IRQ handlers, polling events for drivers, audio, window movement along with most windowing tasks, game main loops and render loops(dispatch and collection); any thread that can complete its task within a millisecond or two and not dependent on memory timing delays.

Many, if not most, user applications are single threaded, so dedicating to a P-Core by default makes sense.

It would make sense to change the affinity to an E-core when the application is not in window focus, unless "nice" 'd.

There has been a mention of Phi and Knights Landing; The main issue is memory access speeds/delay. This problem is a local cache size issue, or more importantly where the instructions/data resides. Unless your web server code and served data fit within the shared cache, L3, adding more cores will not help capacity issues. The HBM memory incorporated in the current Xeon Max series solves the starvation issue, by filling more of the caches at one time, at the cost of latency.

My suggestion would be to modify the nice/priority system, Using odd nice values to indicate an affinity towards E-Cores. This change would allow developers to specify thread priority for affinity.