Frequency-invariant utilization tracking for x86
Appropriate CPU-frequency selection is important for a couple of reasons. If a CPU's frequency is set too high, it will consume more power than needed, which is a concern regardless of whether that CPU is running in a phone or a data center. Setting the frequency too low, instead, will hurt the performance of the system; in the worst cases, it could cause the available CPU power to be inadequate for the workload and, perhaps, even increase power consumption by keeping system components awake for longer than necessary. So there are plenty of incentives to get this decision right.
One key input into any CPU-frequency algorithm is the amount of load a given CPU is experiencing. A heavily loaded processor must, naturally, be run at a higher frequency than one which is almost idle. "Load" is generally characterized as the percentage of time that a CPU is actually running the workload; a CPU that is running flat-out is 100% loaded. There is one little detail that should be taken into account, though: the current operating frequency of the CPU. A CPU may be running 100% of the time, but if it is at 50% of its maximum frequency, it is not actually 100% loaded. To deal with this, the kernel's load tracking scales the observed load by the frequency the CPU is running at; this scaled value is used to determine how loaded a CPU truly is and how its frequency should change, if at all.
At least, that's how it is done on some processors. On x86 processors, though, this frequency-invariant load tracking isn't available; that means that frequency governors like schedutil cannot make the best decisions. It is not entirely surprising that performance (as measured in both power consumption and CPU throughput) suffers.
This would seem like an obvious problem to fix. The catch is that, on contemporary Intel processors, it is not actually possible to know the operating frequency of a CPU. The operating system has some broad control over the operating power point of the CPU and can make polite suggestions as to what it should be, but details like actual running frequency are dealt with deep inside the processor package and the kernel is not supposed to worry its pretty little head about them. Without that information, it's not possible to get the true utilization of an x86 processor.
It turns out that there is a way to approximate this information, though; it was suggested by Len Brown back in 2016 but not adopted at that time. There are two model-specific registers (MSRs) on modern x86 CPUs called APERF and MPERF. Both can be thought of as a sort of time-stamp counter that increments as the CPU executes (though Intel stresses that the contents of those registers don't have any specific timekeeping semantics). MPERF increments at constant a rate proportional to the maximum frequency of the processor, while APERF increments at a variable rate proportional to the actual operating frequency. If aperf_change is the change in APERF over a given time period, and mperf_change is the change in MPERF over that same period, then the operating frequency can be approximated as:
operating_freq = (max_freq*aperf_change)/mperf_change;
Reading those MSRs is relatively expensive, so this calculation cannot be made often, but once per clock tick (every 1-10ms) turns out to be enough.
There is one other little detail, though, in the form of Intel's "turbo mode". Old timers will be thinking of the button on the case of a PC that would let it run at a breathtaking 6MHz, but this is different. When the load in a particular package is concentrated on a small number of CPUs, and the others are idle, the busy CPUs can be run at a frequency higher than the alleged maximum. That makes it hard to know what the true utilization of a CPU is, because its capacity will vary depending on what other CPUs in the system are doing.
The patches (posted by Giovanni Gherdovich) implement the above mentioned method to calculate the operating frequency, and use the turbo frequency attainable by four processors simultaneously as the maximum possible. The result is a reasonable measure of what the utilization of a given processor is. That lets schedutil make better decisions about what the operating frequency of each CPU should actually be.
As it happens, the algorithm used by schedutil to choose a frequency changes a bit when it knows that the utilization numbers it gets are frequency-invariant. Without invariance, schedutil will move frequencies up or down one step at a time. With invariance, it can better calculate what the frequency should be, so it can make multi-step changes. That allows it to respond more quickly to the actual load.
The end result, Gherdovich said in the patch changelog, is performance from
schedutil that is "closer to, if not on-par with, the powersave
governor from the intel_pstate driver/framework
". To back that up,
the changelog includes a long series of benchmark results; the changelog is
longer than the patch itself. While the effects of the change are not all
positive, the improvements (in both performance and power usage)
tend to be large while the regressions happen
with more focused benchmarks and are relatively small. One of the
workloads that benefits the most is kernel compilation, a result that almost
guarantees acceptance of the change in its own right.
The curious can read the patch changelog for the benchmark results in their
full detail. For the rest of us, what really matters is that the
schedutil CPU-frequency governor should perform much better on x86 machines
than it has in the past. Whether that will cause distributions to switch
to schedutil remains to be seen; that will depend on how well it works on
real-world workloads, which often have a disturbing tendency to not behave
the way the benchmarks did.
Index entries for this article | |
---|---|
Kernel | Power management/Frequency scaling |
Kernel | Scheduler/Load tracking |
Kernel | Schedutil governor |
Posted Apr 2, 2020 17:03 UTC (Thu)
by jcm (subscriber, #18262)
[Link] (1 responses)
Posted Apr 3, 2020 23:33 UTC (Fri)
by flussence (guest, #85566)
[Link]
On some of the older CPUs I have there's also a 100mhzsteps feature flag in /proc/cpuinfo, I don't know if it's just not worth the effort or if nobody ever wrote the code to support it. The ACPI cpufreq driver definitely doesn't have 100MHz granularity.
Posted Apr 2, 2020 22:48 UTC (Thu)
by valschneider (subscriber, #120198)
[Link]
Posted Apr 3, 2020 8:24 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (2 responses)
intel_pstate is confusingly two very different things in one: HardWare P-states vs not. I think the main advantage claimed by HardWare P-states (a.k.a. "Speed Shift") is lower latency and better efficiency thanks to among others very frequent load sampling, much more frequent than anything software could achieve.
It would have been interesting to prove that software can do better that "hardware accelerated governor", unfortunately the benchmarks seem to treat HWP and software intel_pstate like they were minor variants of the same thing...
I guess these comparisons can always be done later; it doesn't sound like this series removes anything. No big deal.
Posted Apr 3, 2020 10:13 UTC (Fri)
by jan.kara (subscriber, #59161)
[Link] (1 responses)
Posted Apr 3, 2020 17:07 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
Considering scheduling and governing frequency is all about predicting the future, it makes sense a stream of randomly spaced packets is one of the toughest nuts to crack.
There is a gazillion of throughput benchmarks, we really need more latency benchmarks - especially for something advertised like "Speed Shift".
I googled "Kolivas for a 5 seconds" and instantly found this:
At $DAYJOB I've seen test reports bragging about video conferences "scoring" 59.7 FPS average over 1h, much better than the previous 57.9 FPS average. Like the user cared. Zero consideration for freezes, drops, out of sync audio,...
https://bravenewgeek.com/everything-you-know-about-latenc...
Posted Apr 3, 2020 10:58 UTC (Fri)
by epa (subscriber, #39769)
[Link] (6 responses)
Then again, there are also reasons why you might want to pick a lower frequency for a less loaded core. I guess the point is that it's hard to summarize power consumption strategies into simple rules of thumb.
Posted Apr 3, 2020 16:50 UTC (Fri)
by shemminger (subscriber, #5739)
[Link] (3 responses)
Posted Apr 4, 2020 11:30 UTC (Sat)
by anton (subscriber, #25547)
[Link] (2 responses)
Posted Apr 18, 2020 9:35 UTC (Sat)
by oak (guest, #2786)
[Link] (1 responses)
Posted Apr 21, 2020 5:18 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Apr 3, 2020 17:40 UTC (Fri)
by jcm (subscriber, #18262)
[Link]
Posted Apr 6, 2020 4:12 UTC (Mon)
by florianfainelli (subscriber, #61952)
[Link]
In premise yes, but if you factor in voltage scaling to support that high frequency the time to move from idle to a higher frequency + higher voltage may outweigh choosing a slightly lower frequency + voltage point. You need to ensure you have converged to a higher voltage before you can raise the frequency otherwise, well bad things could happen. A lot of this is clearly magic on the hardware side, or rather the magician spent a lot of time in the lab to come up with something adequate. The key for the OS to make good decisions is almost always the same though: you need a cost function that is as accurate as possible.
Posted Apr 4, 2020 1:57 UTC (Sat)
by vstinner (subscriber, #42675)
[Link] (7 responses)
If an isolated CPU never gets the scheduler interrupt, its workload is ignored to decide the P-state of the CPU. As a consequence, the performance of isolated CPUs only rely on the non-isolated CPUs workload. For a benchmark, it means that a benchmark can become suddenly 2x faster or slower...
How I found this issue in practice: https://vstinner.github.io/intel-cpus-part2.html
The maintainer of the intel_pstate driver just told me that he never tested isolated CPUs with NOHZ_FULL. Kernel realtime developers told me that NOHZ_FULL cannot work with intel_pstate by design.
Workaround: don't use NOHZ_FULL or use fixed CPU frequency.
Posted Apr 4, 2020 18:42 UTC (Sat)
by jcm (subscriber, #18262)
[Link] (1 responses)
Posted Apr 6, 2020 4:06 UTC (Mon)
by florianfainelli (subscriber, #61952)
[Link]
Posted Apr 8, 2020 14:20 UTC (Wed)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Apr 8, 2020 22:26 UTC (Wed)
by vstinner (subscriber, #42675)
[Link] (3 responses)
Sorry, my sentence is wrong: the issue is not NOHZ_FULL alone, but NOHZ_FULL+isolated CPUs. I understood that intel_pstate is not compatible with isolated CPUs using NOHZ_FULL.
Posted Apr 9, 2020 0:44 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Apr 18, 2020 18:20 UTC (Sat)
by zlynx (guest, #2285)
[Link] (1 responses)
Of course they also set performance to maximum, so this wouldn't affect power and frequency management.
Anyway, I believe this is more common than you may think.
Posted Jun 2, 2020 1:36 UTC (Tue)
by nix (subscriber, #2304)
[Link]
(What they are presumably actually looking for here is CPU affinity with QEMU to try to keep a roughly 1:1 mapping between QEMU vCPU cores and real CPU cores. There were patches to do it inside QEMU but they never made it upstream and eventually bitrotted: it looks like libvirt does it from outside QEMU by brute force and cgroups.)
Frequency-invariant utilization tracking for x86
Frequency-invariant utilization tracking for x86
Frequency-invariant utilization tracking for ~x86~ arm64!
Frequency-invariant utilization tracking for x86
Frequency-invariant utilization tracking for x86
Frequency-invariant utilization tracking for x86
https://lwn.net/Articles/720227/
> The MuQSS scheduler has reportedly better Interbench benchmark scores than CFS. However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.
Frequency, heat, and idleness
A heavily loaded processor must, naturally, be run at a higher frequency than one which is almost idle.
Is that really true? If a processor is grinding away at 100% usage then you have to choose a frequency that can be sustained with a full workload. This might be lower than the maximum frequency the processor is capable of, perhaps for heat reasons. On the other hand if it's almost idle but has a few occasional bursts of activity, you may want to get those bursts done with the highest possible clock frequency. During periods of idleness the processor can go into a sleep state, which also allows heat to dissipate ready for the next burst.
Frequency, heat, and idleness
I have read this often, but have never seen any empirical support for it. It certainly was not true for the Athlon 64 system I used when I first read it. I have not measured it recently, but voltage scaling still applies as it did at that time, so I expect that the results will be similar.
Frequency, heat, and idleness
Frequency, heat, and idleness
Frequency, heat, and idleness
Frequency, heat, and idleness
Frequency, heat, and idleness
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR
Kernel realtime developers told me that NOHZ_FULL cannot work with intel_pstate by design.
Really? This configuration is the common case for every distro kernel I checked. Sounds like we need better communication somewhere...
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR
NOHZ_FULL, isolated CPUs and reading CPU MSR