Frequency-invariant utilization tracking for x86

By Jonathan Corbet
April 2, 2020

The kernel provides a number of CPU-frequency governors to choose from; by most accounts, the most effective of those is "schedutil", which was merged for the 4.7 kernel in 2016. While schedutil is used on mobile devices, it still doesn't see much use on x86 desktops; the intel_pstate governor is generally seen giving better results on those processors as a result of the secret knowledge embodied therein. A set of patches merged for 5.7, though, gives schedutil a better idea of what the true utilization of x86 processors is and, as a result, greatly improves its effectiveness.

Appropriate CPU-frequency selection is important for a couple of reasons. If a CPU's frequency is set too high, it will consume more power than needed, which is a concern regardless of whether that CPU is running in a phone or a data center. Setting the frequency too low, instead, will hurt the performance of the system; in the worst cases, it could cause the available CPU power to be inadequate for the workload and, perhaps, even increase power consumption by keeping system components awake for longer than necessary. So there are plenty of incentives to get this decision right.

One key input into any CPU-frequency algorithm is the amount of load a given CPU is experiencing. A heavily loaded processor must, naturally, be run at a higher frequency than one which is almost idle. "Load" is generally characterized as the percentage of time that a CPU is actually running the workload; a CPU that is running flat-out is 100% loaded. There is one little detail that should be taken into account, though: the current operating frequency of the CPU. A CPU may be running 100% of the time, but if it is at 50% of its maximum frequency, it is not actually 100% loaded. To deal with this, the kernel's load tracking scales the observed load by the frequency the CPU is running at; this scaled value is used to determine how loaded a CPU truly is and how its frequency should change, if at all.

At least, that's how it is done on some processors. On x86 processors, though, this frequency-invariant load tracking isn't available; that means that frequency governors like schedutil cannot make the best decisions. It is not entirely surprising that performance (as measured in both power consumption and CPU throughput) suffers.

This would seem like an obvious problem to fix. The catch is that, on contemporary Intel processors, it is not actually possible to know the operating frequency of a CPU. The operating system has some broad control over the operating power point of the CPU and can make polite suggestions as to what it should be, but details like actual running frequency are dealt with deep inside the processor package and the kernel is not supposed to worry its pretty little head about them. Without that information, it's not possible to get the true utilization of an x86 processor.

It turns out that there is a way to approximate this information, though; it was suggested by Len Brown back in 2016 but not adopted at that time. There are two model-specific registers (MSRs) on modern x86 CPUs called APERF and MPERF. Both can be thought of as a sort of time-stamp counter that increments as the CPU executes (though Intel stresses that the contents of those registers don't have any specific timekeeping semantics). MPERF increments at constant a rate proportional to the maximum frequency of the processor, while APERF increments at a variable rate proportional to the actual operating frequency. If aperf_change is the change in APERF over a given time period, and mperf_change is the change in MPERF over that same period, then the operating frequency can be approximated as:

    operating_freq = (max_freq*aperf_change)/mperf_change;

Reading those MSRs is relatively expensive, so this calculation cannot be made often, but once per clock tick (every 1-10ms) turns out to be enough.

There is one other little detail, though, in the form of Intel's "turbo mode". Old timers will be thinking of the button on the case of a PC that would let it run at a breathtaking 6MHz, but this is different. When the load in a particular package is concentrated on a small number of CPUs, and the others are idle, the busy CPUs can be run at a frequency higher than the alleged maximum. That makes it hard to know what the true utilization of a CPU is, because its capacity will vary depending on what other CPUs in the system are doing.

The patches (posted by Giovanni Gherdovich) implement the above mentioned method to calculate the operating frequency, and use the turbo frequency attainable by four processors simultaneously as the maximum possible. The result is a reasonable measure of what the utilization of a given processor is. That lets schedutil make better decisions about what the operating frequency of each CPU should actually be.

As it happens, the algorithm used by schedutil to choose a frequency changes a bit when it knows that the utilization numbers it gets are frequency-invariant. Without invariance, schedutil will move frequencies up or down one step at a time. With invariance, it can better calculate what the frequency should be, so it can make multi-step changes. That allows it to respond more quickly to the actual load.

The end result, Gherdovich said in the patch changelog, is performance from schedutil that is "closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework". To back that up, the changelog includes a long series of benchmark results; the changelog is longer than the patch itself. While the effects of the change are not all positive, the improvements (in both performance and power usage) tend to be large while the regressions happen with more focused benchmarks and are relatively small. One of the workloads that benefits the most is kernel compilation, a result that almost guarantees acceptance of the change in its own right.

The curious can read the patch changelog for the benchmark results in their full detail. For the rest of us, what really matters is that the schedutil CPU-frequency governor should perform much better on x86 machines than it has in the past. Whether that will cause distributions to switch to schedutil remains to be seen; that will depend on how well it works on real-world workloads, which often have a disturbing tendency to not behave the way the benchmarks did.

Index entries for this article
Kernel	Power management/Frequency scaling
Kernel	Scheduler/Load tracking
Kernel	Schedutil governor

Frequency-invariant utilization tracking for x86

Posted Apr 2, 2020 17:03 UTC (Thu) by jcm (subscriber, #18262) [Link] (1 responses)

There is generally this notion that has existed for too long that frequency control is the only thing to be concerned about. In reality, there's a lot more an OS/platform can collaborate on, from frequency, to thermal pressure. On Arm servers, the CPPC extension to ACPI allows some of this to be enabled in a generic way and AMD extended this spec in their Zen2+ designs also. I would like to see generic CPPC/schedutil integration and adoption of such specs beyond just Arm platforms. To include thermal pressure and many other aspects as well. To that end, I'm planning for us to do some work on this and propose standard solutions in the months ahead. Very keen to hear from others who would like to discuss gaps in the current generic standards and how to plug those with a cross-industry solution.

Frequency-invariant utilization tracking for x86

Posted Apr 3, 2020 23:33 UTC (Fri) by flussence (guest, #85566) [Link]

There was a patch[1] a few months ago, submitted by an AMD employee, trying to add CPPC2 awareness to the ondemand governor. It was rejected (rightly so IMO) for ignoring schedutil entirely and adding far too many knobs; the existing AMD CPB subdriver (which also ignores schedutil) only needs one boolean.

On some of the older CPUs I have there's also a 100mhzsteps feature flag in /proc/cpuinfo, I don't know if it's just not worth the effort or if nobody ever wrote the code to support it. The ACPI cpufreq driver definitely doesn't have 100MHz granularity.

[1]: https://lkml.org/lkml/2019/7/10/682

Frequency-invariant utilization tracking for ~x86~ arm64!

Posted Apr 2, 2020 22:48 UTC (Thu) by valschneider (subscriber, #120198) [Link]

We've had frequency invariance on arm64 for some time now, mostly because we don't have to deal with the turbo shenanigans so we can just do the invariance based on curr freq / max freq (from cpufreq's PoV). That said, we aren't safe from thermal throttling and other firmware tricks, so counters are preferable - and we should get just that in 5.7! See https://git.kernel.org/pub/scm/linux/kernel/git/arm64/lin..., it's the same concept as the APERF/MPERF stuff.

Frequency-invariant utilization tracking for x86

Posted Apr 3, 2020 8:24 UTC (Fri) by marcH (subscriber, #57642) [Link] (2 responses)

> Reading those MSRs is relatively expensive, so this calculation cannot be made often, but once per clock tick (every 1-10ms) turns out to be enough.

intel_pstate is confusingly two very different things in one: HardWare P-states vs not. I think the main advantage claimed by HardWare P-states (a.k.a. "Speed Shift") is lower latency and better efficiency thanks to among others very frequent load sampling, much more frequent than anything software could achieve.

It would have been interesting to prove that software can do better that "hardware accelerated governor", unfortunately the benchmarks seem to treat HWP and software intel_pstate like they were minor variants of the same thing...

I guess these comparisons can always be done later; it doesn't sound like this series removes anything. No big deal.

Frequency-invariant utilization tracking for x86

Posted Apr 3, 2020 10:13 UTC (Fri) by jan.kara (subscriber, #59161) [Link] (1 responses)

It depends on what you mean by "software can do better than "hardware accelerated governor"" - e.g. for workloads that are IO bound we have found some cases where HWP was worse than intel_pstate because it never considered CPU load to be high enough to bump up the frequency and so interrupt latency suffered...

Frequency-invariant utilization tracking for x86

Posted Apr 3, 2020 17:07 UTC (Fri) by marcH (subscriber, #57642) [Link]

Fascinating, thanks!

Considering scheduling and governing frequency is all about predicting the future, it makes sense a stream of randomly spaced packets is one of the toughest nuts to crack.

There is a gazillion of throughput benchmarks, we really need more latency benchmarks - especially for something advertised like "Speed Shift".

I googled "Kolivas for a 5 seconds" and instantly found this:
https://lwn.net/Articles/720227/
> The MuQSS scheduler has reportedly better Interbench benchmark scores than CFS. However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.

At $DAYJOB I've seen test reports bragging about video conferences "scoring" 59.7 FPS average over 1h, much better than the previous 57.9 FPS average. Like the user cared. Zero consideration for freezes, drops, out of sync audio,...

https://bravenewgeek.com/everything-you-know-about-latenc...

Frequency, heat, and idleness

Posted Apr 3, 2020 10:58 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

A heavily loaded processor must, naturally, be run at a higher frequency than one which is almost idle.

Is that really true? If a processor is grinding away at 100% usage then you have to choose a frequency that can be sustained with a full workload. This might be lower than the maximum frequency the processor is capable of, perhaps for heat reasons. On the other hand if it's almost idle but has a few occasional bursts of activity, you may want to get those bursts done with the highest possible clock frequency. During periods of idleness the processor can go into a sleep state, which also allows heat to dissipate ready for the next burst.

Then again, there are also reasons why you might want to pick a lower frequency for a less loaded core. I guess the point is that it's hard to summarize power consumption strategies into simple rules of thumb.

Frequency, heat, and idleness

Posted Apr 3, 2020 16:50 UTC (Fri) by shemminger (subscriber, #5739) [Link] (3 responses)

For many workloads, the go-fast and get to idle should be the optimum strategy.

Frequency, heat, and idleness

Posted Apr 4, 2020 11:30 UTC (Sat) by anton (subscriber, #25547) [Link] (2 responses)

I have read this often, but have never seen any empirical support for it. It certainly was not true for the Athlon 64 system I used when I first read it. I have not measured it recently, but voltage scaling still applies as it did at that time, so I expect that the results will be similar.

Frequency, heat, and idleness

Posted Apr 18, 2020 9:35 UTC (Sat) by oak (guest, #2786) [Link] (1 responses)

Suspend vs lowest freq power usage difference can be significantly larger than between lowest & highest freq. But Suspend transition times are longer, especially with old (x86) processors and they didn’t have as many suspend states.

Frequency, heat, and idleness

Posted Apr 21, 2020 5:18 UTC (Tue) by flussence (guest, #85566) [Link]

There are a few old x86 chips where the lower C-states exist but have to be avoided entirely to avoid nasty bugs, which doesn't help matters.

Frequency, heat, and idleness

Posted Apr 3, 2020 17:40 UTC (Fri) by jcm (subscriber, #18262) [Link]

It's only strictly true in the 1990s with in-order cores that behave somewhat linearly as a function of frequency. Today, there's a LOT of room for innovation. I'm currently driving some work that should help in this area but we'll need greater interaction between the host OS(PM) and the platform to coordinate.

Frequency, heat, and idleness

Posted Apr 6, 2020 4:12 UTC (Mon) by florianfainelli (subscriber, #61952) [Link]

> On the other hand if it's almost idle but has a few occasional bursts of activity, you may want to get those bursts done with the highest possible clock frequency.

In premise yes, but if you factor in voltage scaling to support that high frequency the time to move from idle to a higher frequency + higher voltage may outweigh choosing a slightly lower frequency + voltage point. You need to ensure you have converged to a higher voltage before you can raise the frequency otherwise, well bad things could happen. A lot of this is clearly magic on the hardware side, or rather the magician spent a lot of time in the lab to come up with something adequate. The key for the OS to make good decisions is almost always the same though: you need a cost function that is as accurate as possible.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 4, 2020 1:57 UTC (Sat) by vstinner (subscriber, #42675) [Link] (7 responses)

Using NOHZ_FULL and isolated CPUs reduces the system jitter to run benchmarks. But it is incompatible with CPU drivers which rely on the scheduler callback to read frequently CPU MSRs at scheduler interruption.

If an isolated CPU never gets the scheduler interrupt, its workload is ignored to decide the P-state of the CPU. As a consequence, the performance of isolated CPUs only rely on the non-isolated CPUs workload. For a benchmark, it means that a benchmark can become suddenly 2x faster or slower...

How I found this issue in practice: https://vstinner.github.io/intel-cpus-part2.html

The maintainer of the intel_pstate driver just told me that he never tested isolated CPUs with NOHZ_FULL. Kernel realtime developers told me that NOHZ_FULL cannot work with intel_pstate by design.

Workaround: don't use NOHZ_FULL or use fixed CPU frequency.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 4, 2020 18:42 UTC (Sat) by jcm (subscriber, #18262) [Link] (1 responses)

...Or have another CPU core provide the information from the OS about the isolated cores. I'll be pushing some spec updates for CPPC, etc. that will allow for this scenario in the coming months.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 6, 2020 4:06 UTC (Mon) by florianfainelli (subscriber, #61952) [Link]

I should really read upon CPPC, but in premise what vstinner describes is what we have encountered with systems that use TrustZone and the trusted OS mandates a specific CPU P-state to complete its duty cycle with the realtime deadline imposed. In that case though the trusted OS "wins" it all as the overall P-state decision is under control of an EL3 monitor which could be "lying" about the actual CPU cluster frequency to Linux. Our systems are multi-core (Cortex-A53) but the whole cluster has to be on the frequency and voltage point at any given time.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 8, 2020 14:20 UTC (Wed) by nix (subscriber, #2304) [Link] (4 responses)

Kernel realtime developers told me that NOHZ_FULL cannot work with intel_pstate by design.

Really? This configuration is the common case for every distro kernel I checked. Sounds like we need better communication somewhere...

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 8, 2020 22:26 UTC (Wed) by vstinner (subscriber, #42675) [Link] (3 responses)

> Kernel realtime developers told me that NOHZ_FULL cannot work with intel_pstate by design.

Sorry, my sentence is wrong: the issue is not NOHZ_FULL alone, but NOHZ_FULL+isolated CPUs. I understood that intel_pstate is not compatible with isolated CPUs using NOHZ_FULL.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 9, 2020 0:44 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

Oh right, that makes a lot more sense and explains why this hasn't caused more trouble (isolated CPUs are an exceedingly rare use case in the sort of generalist domains where enterprise kernels are used).

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Apr 18, 2020 18:20 UTC (Sat) by zlynx (guest, #2285) [Link] (1 responses)

I'm not too sure about "exceedingly rare" because several enthusiast forums I read give advice to use isolated CPUs, NOHZ_FULL and then explicitly assign CPUs to KVM virtual machines in order to get the very best Windows virtual machine performance.

Of course they also set performance to maximum, so this wouldn't affect power and frequency management.

Anyway, I believe this is more common than you may think.

NOHZ_FULL, isolated CPUs and reading CPU MSR

Posted Jun 2, 2020 1:36 UTC (Tue) by nix (subscriber, #2304) [Link]

Yeah: also enthusiast forums and enterprise kernels on stodgy old stability-first enterprise distros seem like things that won't be mixing very often. :)

(What they are presumably actually looking for here is CPU affinity with QEMU to try to keep a roughly 1:1 mapping between QEMU vCPU cores and real CPU cores. There were patches to do it inside QEMU but they never made it upstream and eventually bitrotted: it looks like libvirt does it from outside QEMU by brute force and cgroups.)