Telling the scheduler about thermal pressure

May 16, 2019

This article was contributed by Marta Rybczyńska

Even with radiators and fans, a system's CPUs can overheat. When that happens, the kernel's thermal governor will cap the maximum frequency of that CPU to allow it to cool. The scheduler, however, is not aware that the CPU's capacity has changed; it may schedule more work than optimal in the current conditions, leading to a performance degradation. Recently, Thara Gopinath did some research and posted a patch set to address this problem. The solution adds an interface to inform the scheduler about thermal events so that it can assign tasks better and thus improve the overall system performance.

The thermal framework in Linux includes a number of elements, including the thermal governor. Its task is to manage the temperature of the system's thermal zones, keeping it within an acceptable range while maintaining good performance (an overview of the thermal framework can be found in this slide set [PDF]). There are a number of thermal governors that can be found in the drivers/thermal/ subdirectory of the kernel tree. If the CPU overheats, the governor may cap the maximum frequency of that CPU, meaning that the processing capacity of the CPU gets reduced too.

The CPU capacity in the scheduler is a value representing the ability of a specific CPU to process tasks (interested readers can find more information in this article). The capacities of the CPUs in a system may vary, especially on architectures like big.LITTLE. The scheduler knows (at least it assumes it knows) how much work can be done on each CPU; it uses that information to balance the task load across the system. If the information the scheduler has on what a given CPU can do is inaccurate because of thermal events (or any other frequency capping), it is likely to put too much work onto that CPU.

Gopinath introduces a term that is useful when talking about this kind of event: "thermal pressure", which is the difference between the maximum processing capacity of a CPU and the currently available capacity, which may be reduced by overheating events. Gopinath explained in the patch set cover letter that the raw thermal pressure is hard to observe and that there is a delay between the capping of the frequency and the scheduler taking it into account. Because of this, the proposal is to use a weighted average over time, where the weight corresponds to the amount of time the maximum frequency was capped.

Different algorithms and their benchmarks

Gopinath tried multiple algorithms while working on this project (an earlier version of the patch set was posted in October 2018) and presented a comparison with benchmark results.

The first idea was to directly use the instantaneous value of the capped frequency in the scheduler; this algorithm improved performance, but only slightly. The other two algorithms studied use a weighted average. The first of those reused the per-entity load tracking (PELT) algorithm that is used to track the CPU load created by processes (and control groups); this variant incorporates averages of the realtime and deadline load and utilization. The final approach just uses a simple decay-based metric for thermal pressure, with a variable decay period. Both weighted-average algorithms gave better results than the instantaneous value, with throughput improvements on the order of 3-4%. The non-PELT version performed slightly better.

Ingo Molnar reviewed the results and responded positively to the framework, but would like to see more benchmarks run. He suggested testing more decay periods. Gopinath agreed, saying that tests on different system-on-chips (SoCs) would be a good idea, as the best decay period could differ between the systems. In addition, a configurable decay period is something that is planned.

In parallel, Peter Zijlstra noted that he would prefer a PELT-based approach instead of mixing different averaging algorithms. Molnar dug into the PELT code for ways to obtain better results with the existing algorithm. He found that the decay is set to a constant; on the other hand Gopinath's work shows that the performance depends heavily on its value. It should be possible to get better results with PELT if the code can be suitably modified. It looks like at least one solution has been found that doesn't require significant changes.

Ionela Voinescu ran some benchmarks in different conditions and found that the thermal pressure is indeed useful, but without a clear conclusion on which averaging algorithm to use. Gopinath and Voinescu agreed that more benchmarking will be needed.

The thermal pressure API

Gopinath's work introduces an API that allows the scheduler to be notified about thermal events. It includes two new functions. The first, sched_update_thermal_pressure(), should be called by any module that caps the maximum CPU frequency; its prototype is:

    void sched_update_thermal_pressure(struct cpumask *cpus,
                                       unsigned long cap_max_freq,
                                       unsigned long max_freq);

The mask of the CPUs to update the thermal pressure is passed in cpus, the new (capped) maximum frequency in cap_max_freq, and the available maximum frequency without any thermal events is in max_freq.

The scheduler can also obtain the thermal pressure of a given CPU by calling:

    unsigned long sched_get_thermal_pressure(int cpu);

Internally, the thermal pressure framework uses a per-CPU thermal_pressure structure to keep track of the current and old values of the thermal pressure along with the time it was last updated. Currently, the update happens from a periodic timer. However, during the discussion, Quentin Perret suggested that it be updated at the same time as other statistics. Doing this work during the load-balancing statistics update was proposed first, but Perret later suggested that the thermal-statistics update would be a better time; that would allow shorter decay periods and more accuracy for low-latency tasks.

The developers discussed whether user-space frequency capping should be included in the framework. The user (or a user-space thermal daemon) might change the maximum frequency for thermal reasons. On the other hand, that capping will last for seconds or more — which is different than capping by the thermal framework — and the reason for the change may be something other than thermal concerns. Whether the thermal pressure framework will include frequency capping from user space remains an open question for now.

Molnar asked whether there is a connection between the thermal pressure approach and energy-aware scheduling (EAS). Gopinath replied that the two approaches have different scope: thermal pressure is going to work better in asymmetric configurations where capacities are different and it is more likely to cause the scheduler to move tasks between CPUs. The two approaches should also be independent because thermal pressure should work even if EAS is not compiled in.

Current status and next steps

The kernel developers seem receptive to the proposed idea. It is likely that this, or a similar, framework will be merged in the future. Before that happens, there is still some work left: figuring out the details of the algorithm to be included (and whether to reuse the PELT code), the details of the decay period, and, of course, more benchmarking in different systems. Interested readers can find the Gopinath's slides from the Linux Plumbers Conference [PDF] that offer additional background information for the earlier version of the work.

Index entries for this article
Kernel	Scheduler
Kernel	Thermal management
GuestArticles	Rybczynska, Marta

Telling the scheduler about thermal pressure

Posted May 16, 2019 22:14 UTC (Thu) by ScienceMan (subscriber, #122508) [Link] (2 responses)

There are many other mechanisms that can result in frequency capping beyond just thermal events. In laptops, for example, reducing processor frequency is a standard method to reduce overall power drawn by the cpu, and more generally, interfaces like the Intel RAPL mechanism can be used to enforce power limits that result in cpu frequency throttling. Some batch job control systems, e.g. slurm, include capabilities to set power limits using these features.

Is there something specific about the Linux scheduler under conditions of thermal pressure that makes it necessary to adjust features of the kernel when cpu frequency is reduced? I don't think any of the above examples pay any attention to this if so.

Here are a couple of more or less randomly chosen references to ways that power capping is used in HPC systems, for example.

http://eethpc.net/wp-content/uploads/2018/06/Slurm-Power-...

https://www.scirp.org/journal/PaperInformation.aspx?Paper...

Telling the scheduler about thermal pressure

Posted May 17, 2019 15:29 UTC (Fri) by admalledd (subscriber, #95347) [Link] (1 responses)

My (naive!) understanding is that these specific thermal pressure events are very short lived, and can also be per-core. The scheduler already would take into account general frequency reduction over a longer term reasonably well, it is the rapid thermal reaction of cutting frequency drastically that is being added here specifically.

Telling the scheduler about thermal pressure

Posted Jun 4, 2019 17:06 UTC (Tue) by oak (guest, #2786) [Link]

TDP limits act are in my experience clearly faster than temperature limits. Heat takes some time to spread / dissipate whereas TDP limits are calculated and enforced (by firmware) based on HW counters of what HW is doing at given moment, it doesn’t need to wait for the effects of that activity.