Hybrid scheduling gets more complicated
Brown started by describing Intel's hybrid CPUs which, he said, have a
combination of "Pcores" and "Ecores". The Pcores have higher performance
and also support simultaneous multi-threading (SMT). The Ecores, instead, are
more focused on energy efficiency than performance; Ecores were once known
as "Atom" CPUs. Both types of CPU implement the same instruction set, so a
process can move freely between the two types.
Kernel releases through 4.9 treated all CPUs on these systems as being equal; that meant that any given process would experience variable performance depending on where the scheduler placed it in the system. As of 4.10, Intel's ITMT (standing for "Intel Turbo Boost Max Technology") support caused the scheduler to prefer Pcores over Ecores, all else being equal. That had the effect of putting processes on the faster CPUs when possible, but it also would load all SMT sibling CPUs before falling back to the Ecores, which leads to worse performance overall. That has been fixed as of 5.16; an Ecore will now be preferred over an SMT CPU whose sibling is already busy.
Pcores are faster; they run at a higher clock frequency, but are also able to get more work done with each clock cycle. As a result, clock frequencies alone are not sufficient to compare the capacity of two CPUs in a system. To address this problem, the hardware is able to provide both performance and efficiency scores for each CPU; these numbers can change at run time if conditions change, Brown said.
The situation is actually a bit more complex than that, though. The performance difference between the CPU types depends on which instructions are being executed at any given time. Programs using the VNNI instructions (which are intended to accelerate machine-learning applications) may see much more advantage from running on a Pcore than those that are doing nothing special. There are four different classes of performance, dominated by instruction type, and the ratio of Pcore to Ecore performance is different for each.
To schedule such a system optimally, the kernel should use the Pcores to run the processes that will benefit from them the most. Application developers cannot really be expected to know which of the four performance classes best describes their code, and the appropriate class may change over a program's execution in any case, but the CPU certainly knows which types of instructions are being executed at any given time. So each CPU exposes a register indicating which performance class best describes the currently running process. That allows the kernel to assign a class ID and use it in scheduling decisions.
Neri took over to describe the work that has been done to take advantage of
this information. The class ID of each process is stored in its task_struct
structure. The first use of this information is in the idle load
balancer, which is invoked when a CPU has run out of tasks to execute and
looks to see if a task should be pulled from a more heavily loaded CPU
elsewhere in the system. This code can look at the class ID of each
candidate task to find the one that would benefit the most (or suffer the
least) from being moved. This check works at both ends; a task that is
making heavy use of instructions that are best run on its current CPU
should not be moved if possible.
An audience member asked whether the class ID of a running process can be adjusted from user space. Brown answered that this capability exists for debugging purposes, but that nobody had thought about making it available as a supported feature.
Neri continued that the kernel's NUMA-balancing code can also look at the class IDs and exchange tasks between nodes if that would lead to better system performance. Something similar could also be done with busy load balancing, which tries to even out the load across a busy system. This idea made some developers nervous; it would be easy to break load balancing in ways that create performance regressions that don't come to light until long afterward. Neri emphasized that the class ID would only be used in load-balancing decisions if the existing heuristics led to a tie between two options.
The final moments of the session were dedicated to the problem of scheduling on Intel's Alder Lake CPUs (which started shipping earlier this year). Specifically, the kernel's energy-aware scheduling heuristics don't work well on those CPUs. A number of features present there complicate the energy picture; these include SMT, Intel's "turbo boost" mode, and the CPU's internal power-management mechanisms. For many workloads, running on an ostensibly more power-hungry Pcore can be more efficient than using an Ecore. Time for discussion of the problem was lacking, though, and the session came to a close.
[Thanks to LWN subscribers for supporting my travel to this event.]
| Index entries for this article | |
|---|---|
| Kernel | Scheduler |
| Conference | Linux Plumbers Conference/2022 |
Posted Sep 30, 2022 21:18 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (32 responses)
Posted Sep 30, 2022 21:37 UTC (Fri)
by james (subscriber, #1325)
[Link]
Posted Sep 30, 2022 21:50 UTC (Fri)
by andromeda (guest, #138427)
[Link] (30 responses)
[1] https://en.wikichip.org/wiki/intel/microarchitectures/ald...
Posted Sep 30, 2022 21:58 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (29 responses)
Posted Oct 1, 2022 3:42 UTC (Sat)
by roc (subscriber, #30627)
[Link] (14 responses)
Posted Oct 1, 2022 3:57 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link] (13 responses)
Posted Oct 1, 2022 5:44 UTC (Sat)
by roc (subscriber, #30627)
[Link] (8 responses)
In theory there could be a non-uniform workload that performs really well with 64 p-cores and 256 e-cores *all sharing memory*, but that sounds relatively unusual.
Posted Oct 1, 2022 7:40 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (5 responses)
Posted Oct 1, 2022 12:41 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Oct 1, 2022 20:36 UTC (Sat)
by flussence (guest, #85566)
[Link]
Posted Oct 1, 2022 21:43 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
It went down like a brick. It was barely better than Xeons and not really any cheaper, while requiring a completely new architecture.
Posted Oct 3, 2022 9:24 UTC (Mon)
by paulj (subscriber, #341)
[Link] (1 responses)
UltraSPARC T1 is probably more like it. A whole bunch of (relatively) classic, simple, in-order, 5-stage RISC SPARC CPUs, behind a thread-selector stage to rapidly switch threads and a huge memory controller. Low instructions-per-thread, but - if you loaded it up with an embarrassingly parallel IO workload (i.e. uncomplicated HTTP serving), very high aggregate IPC.
[In the initial versions, they neglected to add FPUs (just 1 FPU in the first T1, shared across all cores) and then - once away from Suns' in-house benchmarks and into the real-world - they realised that a lot of real-world applications depend on scripting languages that use FP by default, even for integer math. ;) ]
Posted Oct 3, 2022 11:15 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Oct 1, 2022 11:31 UTC (Sat)
by tlamp (subscriber, #108540)
[Link] (1 responses)
For some server workloads it could be also interesting to have still both, like for example 128 E-cores and 32 P-cores.
For example, if I got an HTTP API with various endpoints, some of them may be quite compute-intensive while others not, a mixed setup may have a higher throughput at a lower total power consumption.
For very specialized workloads it may be better to handle such things explicitly in the application, or user space itself via pinning, but I saw quite some services where such different request work loads can happen, and it'd be nice if the OS scheduler would handle that better in a general way, even if not 100% optimal.
Posted Oct 2, 2022 8:40 UTC (Sun)
by roc (subscriber, #30627)
[Link]
Posted Oct 3, 2022 12:51 UTC (Mon)
by rbanffy (guest, #103898)
[Link] (3 responses)
They did - it was called Xeon Phi and making it run at its theoretical peak performance was difficult - it was better suited for HPC workloads (it borrowed heavily from their Larrabee chip) than general-purpose high-throughput tasks such as web serving. I think it's a shame it ended up abandoned.
Posted Oct 3, 2022 19:13 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (2 responses)
Posted Oct 3, 2022 20:45 UTC (Mon)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Phi as an add on accelerator had all of the same problems as GPU development (E.g. network I/O), without mature (CUDA was a monopoly at the time) tooling, and without a cost / performance advantage over GPUs (or so I've heard). However, for workloads like batch HPC/HTC, the improved flops/watt could have been a attractive and it would have saved the labor investment required to deal with GPU development.
Posted Oct 3, 2022 23:46 UTC (Mon)
by rbanffy (guest, #103898)
[Link]
Such a shame. I really wanted one.
The first generation (it was 4 threads per core, I think, each running at a quarter speed) was very limited and extracting reasonable performance from it was not trivial. Subsequent generations were much better, but, by then, a lot of the heavy lifting was available on GPUs and the "friendly" programming of the Phi was no longer a huge advantage (if you can hide the GPU behind a library, it's not significantly uglier than using intrinsics to directly play with AVX-512).
Again, a huge shame. I hope Intel makes something with lots of E cores, even if for no other reason than to teach programmers to make more parallel code, because clock x IPCs won't get higher as quickly as core counts.
Posted Oct 1, 2022 6:23 UTC (Sat)
by epa (subscriber, #39769)
[Link] (12 responses)
Posted Oct 1, 2022 7:38 UTC (Sat)
by drago01 (subscriber, #50715)
[Link] (10 responses)
Posted Oct 1, 2022 8:59 UTC (Sat)
by ballombe (subscriber, #9523)
[Link] (9 responses)
Posted Oct 1, 2022 10:24 UTC (Sat)
by drago01 (subscriber, #50715)
[Link] (7 responses)
Posted Oct 1, 2022 11:31 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (6 responses)
Posted Oct 2, 2022 7:30 UTC (Sun)
by drago01 (subscriber, #50715)
[Link] (5 responses)
Posted Oct 2, 2022 12:06 UTC (Sun)
by khim (subscriber, #9252)
[Link]
The biggest problem there is the fact that decision to use (or not use) SSE, AVX, AVX-512 is local (you pick these on level is tiny, elementary, functions) while the question about whether AVX-512 is beneficial or not is global. Essentially the same dilemma which killed Itanic, just not as acute.
Posted Oct 3, 2022 9:32 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (3 responses)
The difficulty comes in when my workload is scheduled on a single server with other workloads. The right decision for my workload if on a machine by itself is AVX-512 at the lower clocks; however, depending on what the scheduler does, the right decision might become AVX2 if other workloads are more important than mine, and are adversely affected by the core doing AVX-512 downclocking.
This is the problem with using local state ("does this OS thread make use of AVX-512") to drive a global decision ("what clock speed should this core run at"). The correct answer depends not only on my workload, but also on all other workloads sharing this CPU core - which is fine for HPC type workloads, where there are no other workloads sharing a CPU core, but more of a problem with general deployment of AVX-512.
As a side note, as Intel moves on with process from the 14nm of original AVX-512 CPUs, the downclock becomes less severe, and it's nearly non-existent on the latest designs. This, to me, suggests that the downclock is a consequence of backporting AVX-512 to Skylake on 14nm, and thus will become a historic artefact over time.
Posted Oct 3, 2022 15:15 UTC (Mon)
by drago01 (subscriber, #50715)
[Link] (2 responses)
Clocks don't matter much though, what matters is the performance you are getting. And if you workload benefits from wide vectors it will offset any clock changes.
Posted Oct 3, 2022 15:33 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
The critical difference is that with SKX, the maximum permitted clock assuming that thermals allowed was massively reduced for "heavy" AVX-512, because it caused thermal hot-spots on the chip that weren't properly accounted for by "normal" thermal monitoring. With ICL and with RKL there's no longer a huge limit - instead of the SKX thing (where a chip could drop from 3.0 GHz "base" to 2.8 GHz "max turbo" if you used AVX-512), you now can always sustain the same "base", but the max turbo is reduced by 100 MHz or so.
Posted Oct 13, 2022 13:54 UTC (Thu)
by roblucid (guest, #48964)
[Link]
Posted Oct 13, 2022 13:48 UTC (Thu)
by roblucid (guest, #48964)
[Link]
Dr Ian Cutress's program uses AVX2 or AVX512 and it had a 6x speed up on Zen3.
Posted Oct 1, 2022 7:42 UTC (Sat)
by Sesse (subscriber, #53779)
[Link]
Posted Oct 2, 2022 0:03 UTC (Sun)
by neggles (subscriber, #153254)
[Link]
Intel are of the opinion that hybrid serves no purpose in servers, as you're typically not going to have a particularly non-uniform workload on a server; they tend to run at fairly steady-state load levels, with a lot of simultaneous tasks, so you're not generally going to have unloaded/unused cores. Heterogeneous compute isn't super useful if you never have an opportunity to clock/power gate some P-cores; AMD and ARM tend to agree with them on this point, too.
That said, Intel do have a pure E-core Xeon CPU due out in 2024, probably with some frankly ludicrous number of cores, targeted at the mobile game market; quite a few mobile games for Android run entirely in the cloud, with a dedicated core per simultaneous user, then stream the result down to the user's device; prevents cheating, saves device battery usage, and means you get the same experience regardless of your device's performance. (Personally I don't like it, but that's what they're doing...) and these are targeted at that.
Posted Oct 1, 2022 12:21 UTC (Sat)
by Karellen (subscriber, #67644)
[Link] (10 responses)
Isn't that backwards? Wouldn't it be better to generally prefer Ecores while the load is low enough that the Ecores can handle the workload, and only start waking up and putting processes on the Pcores once the demand for CPU time rises high enough that they can't? What's the benefit of using the low-power cores as the fall-backs?
Posted Oct 1, 2022 13:30 UTC (Sat)
by epa (subscriber, #39769)
[Link] (2 responses)
Even today there are applications, like a certain Lisp-based text editor, which are single threaded, interactive, and often need to grind the CPU before they can respond to the user.
Or when you say ‘until the demand for CPU time rises…’, are you including this scenario in that condition?
Posted Oct 3, 2022 12:31 UTC (Mon)
by Karellen (subscriber, #67644)
[Link]
When I said "until the demand for CPU time rises", I was considering the equivalent of "when load average climbs above the number of awake cores" for a suitably short-duration load average, e.g. 2s or less. If that helps.
Anyway, I always thought that higher-performance cores had "deeper" sleep states, and took longer to wake from them. So even for latency-based workloads, if the amount of work to be done was relatively small, a low-power core might still complete the task sooner? Am I mistaken there?
But also, it still seems weird on the kind of machine you're discussing where you want work done as quickly as possible. You're always scheduling work on the fastest core available, but once you get to the point where 16 high-performance cores are all in use, and you need even more work done, that that is the point the system decides "Well, now is the time to wake up those energy-saving, low-power, low-performance cores. They'll be a great help here!" I don't see how that strategy makes sense.
Posted Oct 6, 2022 10:50 UTC (Thu)
by maxfragg (guest, #122266)
[Link]
Posted Oct 1, 2022 15:30 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (5 responses)
Posted Oct 1, 2022 20:03 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (4 responses)
If not, then (as the main text states) you need to distribute the load between E- and P-cores somewhat intelligently.
Posted Oct 1, 2022 20:43 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link] (2 responses)
Posted Oct 6, 2022 12:02 UTC (Thu)
by maxfragg (guest, #122266)
[Link] (1 responses)
Posted Oct 6, 2022 12:18 UTC (Thu)
by paulj (subscriber, #341)
[Link]
Posted Oct 1, 2022 22:37 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
Are these general purpose chips? Sounds like they're just a mix of Pentium/Arm cores (or whatever the standard x86_64 chip is called nowadays ...)
So if that chip is in a desktop (and is it financially profitable for Intel to make two different chips for server and desktop?) then I'm sure that a lot of chips will spend a lot of time asleep? xosview tells me that a lot of my cores spend a lot of time at 0% utilisation. And it looks like tasks spend a lot of time jumping from core to core. To my untutored glance, it looks like - most of the time - I wouldn't notice in the slightest if most of my cores were disabled.
What makes sense for a dedicated single-purpose chip may make completely different sense when looked at by an accountant asking "is it worth spending that money?".
Cheers,
Posted Oct 3, 2022 15:06 UTC (Mon)
by excors (subscriber, #95769)
[Link]
Intel advertises the hybrid architecture as good for gaming with lots of low-priority background tasks (like "Discord or antivirus software" - https://www.intel.com/content/www/us/en/gaming/resources/...), which sounds plausible. Modern game consoles have 8-core CPUs, so game engines are probably designed to scale effectively to 8 cores but maybe not much further. They should fit well on the Intel CPU's 8 P-cores, and then Intel stuffs in as many E-cores as they can afford (in terms of power and manufacturing cost) to handle any less-latency-sensitive or more-easily-parallelisable tasks without interrupting the game.
Posted Oct 3, 2022 11:04 UTC (Mon)
by timrichardson (subscriber, #72836)
[Link]
Posted Jun 24, 2023 13:49 UTC (Sat)
by strom1 (guest, #165771)
[Link]
Many, if not most, user applications are single threaded, so dedicating to a P-Core by default makes sense.
It would make sense to change the affinity to an E-core when the application is not in window focus, unless "nice" 'd.
There has been a mention of Phi and Knights Landing; The main issue is memory access speeds/delay. This problem is a local cache size issue, or more importantly where the instructions/data resides. Unless your web server code and served data fit within the shared cache, L3, adding more cores will not help capacity issues. The HBM memory incorporated in the current Xeon Max series solves the starvation issue, by filling more of the caches at one time, at the cost of latency.
My suggestion would be to modify the nice/priority system, Using odd nice values to indicate an affinity towards E-Cores. This change would allow developers to specify thread priority for affinity.
Hybrid scheduling gets more complicated
Apparently, four E-cores take up the same physical space as one P-core on Alder Lake.
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
[2] https://en.wikipedia.org/wiki/7_nm_process#7_nm_process_n...
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Where the later could retire twice as many instructions at the as thermal envelop (assuming memory bandwidth isn't a bottleneck). Intel hasn't offered such a beast, so I'm guessing the scaling isn't linear.
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
s/ SPARC/UltraSPARC/g
Hybrid scheduling gets more complicated
Tasks that, e.g., do lots of number crunching (or profit from advanced CPU core capabilities) can be moved to the P-cores while the relatively big amount of E-cores can serve parallel to that a massive amount of the "cheaper" responses using a low(er) amount of compute resources and potentially avoiding latency spikes compared to the homogeneous setups.
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
AVX-512
AVX-512
AVX-512
Why write AVX512 code when almost nobody can run it and when for half of those who can, it is slower than AVX2 ?
AVX-512
If your workload benefits from AVX-512 it will be significantly faster than AVX2.
AVX-512
AVX-512
AVX-512
AVX-512
AVX-512
But give how CPUs work now days that's not entirely true either because a lighter workload will result into higher clocks and vise versa. CPUs try to maximize performance within the power budget.
AVX-512
AVX-512
AVX-512
AVX-512
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
The Pcores have higher performance and also support simultaneous multi-threading (SMT). The Ecores, instead, are more focused on energy efficiency than performance; [...] As of 4.10, Intel's ITMT (standing for "Intel Turbo Boost Max Technology") support caused the scheduler to prefer Pcores over Ecores, all else being equal. That had the effect of putting processes on the faster CPUs when possible, but it also would load all SMT sibling CPUs before falling back to the Ecores,
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
The scheduler by default tries to keep everything on the LITTLE cores and only mitigates tasks to big cores, if they are repeatedly use up timeslice before blocking.
The result of that is, that you as a user get a noticeable slow period, if you interact with the system after some idle period, which really is a bad user experience.
Say you are running a webbrowser (and javascript usually suffers quite badly on simple in-order cores), and after idling it takes around 3s until all relevant threads have been migrated to the big cores again and your website is actually scrolling smoothly.
Not what I would call an ideal user experience, especially if the system isn't so strictly confined by power consumption.
Some applications shouldn't get periodically shifted down to slow cores, simply for the user sake, while other things, which might be janitor-tasks from the operating system, probably never should run on big cores.
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
That is what Samsung, Google and Qualcomm all are using in their high end chips and I still don't quite understand, how scheduling heuristics should handle such a SOC in an optimal way.
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Wol
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
Hybrid scheduling gets more complicated
