Scheduler utilization clamping

By Jonathan Corbet
August 8, 2018

Once upon a time, the only way to control how the kernel's CPU scheduler treated any given process was to adjust that process's priority. Priorities are no longer enough to fully control CPU scheduling, though, especially when power-management concerns are taken into account. The utilization clamping patch set from Patrick Bellasi is the latest in a series of attempts to allow user space to tell the scheduler more about any specific process's needs.

Contemporary CPU schedulers have a number of decisions to make at any given time. They must, of course, pick the process that will be allowed to execute in each CPU on the system, distributing processes across those CPUs to keep the system as a whole in an optimal state of busyness. Increasingly, the scheduler is also involved in power management — ensuring that the CPUs do not burn more energy than they have to. Filling that role requires placing each process on a CPU that is appropriate for that process's needs; modern systems often have more than one type of CPU available. The scheduler must also pick an appropriate operating power point — frequency and voltage — for each CPU to enable it to run the workload in a timely manner while minimizing energy consumption.

One of the scheduler's key tools is load tracking: observing how much CPU time each process actually uses over time and using the result to estimate what its future needs will be. As used in this patch set, loads are expressed in terms of percentages; 0% for a process that is not running at all to 100% for a process that will use the full power of the fastest CPU in the system running at its highest frequency. Using load tracking, the scheduler can distribute processes in a way that avoids overloading any specific processor, put the more resource-intensive processes on the faster processors, and pick an operating power point that is fast enough to handle the total load on each CPU. But, while load tracking tells the scheduler how much CPU any given process is likely to need, it says less about how the process needs to use that time.

A realtime process, for example, probably does not need large amounts of CPU time, but it is not able to wait to get that time. Current schedulers respond by running the CPU at full speed whenever a realtime process is runnable to ensure that it doesn't miss its deadlines. But it might also make sense to put that process on one of the system's fastest CPUs. Similarly, non-realtime processes may present a small load, but they may do work that other parts of the system depend on; they should be run at high speed even though they demand little of the processor. On the other hand, a background processing process might be best run at low speed on an efficient processor, even if it could use more CPU power; it does not need to run quickly, and it should not demand too much of the system's battery.

Different tasks can be given different priorities, but that is not a sufficiently useful signal for the processor; priorities only say which process should run first. To fill this gap, Bellasi's patch set adds two more parameters, called the minimum and maximum clamping values; they work by constraining the scheduler's load calculations, essentially fooling the scheduler into treating processes differently than it otherwise would.

The first of those values, the minimum clamp, will, for any given process, place a lower bound on the calculated load for the processor on which that process will run. If process P, running on CPU C, has a minimum clamp value of 30%, then the calculated load for CPU C will never fall below 30% as long as P is runnable, even if the actual load adds up to less than that. The minimum clamp can thus be used to make a CPU appear to be busier than it really is; that, in turn, will affect the frequency that the scheduler chooses for that processor. An important control process might only require 2% of a CPU's capacity; if it's running alone, it will likely be run at a low speed. If its minimum clamp is set to 80%, though, the scheduler will pick a higher frequency and that process will get its time more quickly.

Similarly, the maximum clamp places an upper bound on how busy the processor will look. A background process may present a 99% CPU load, but setting the maximum clamp to a number like 20% will prevent that process from forcing the CPU frequency to a higher value. For both values, the effective value used by the scheduler is the maximum of all of the runnable process's values. If one process needs a minimum clamp of 50%, for example, the scheduler will not use a value lower than that. The default values are 0% and 100% for the minimum and maximum values, respectively.

There are a few ways to set these values. The clamp parameters for a specific process can be changed with the sched_setattr() system call; there do not appear to be any special privileges required if a process is changing its own values. Both ordinary and realtime processes can set their clamping values; processes running under the deadline scheduler already provide enough information for the scheduler to make the necessary decisions. Control groups can be used to set these values for all processes running within a group, via the new util.min and util.max knobs added to the CPU controller. Finally, default clamp values for processes running in the root group are controlled by the sched_uclamp_util_min and sched_uclamp_util_max sysctl knobs.

In this patch set, the clamp values only affect the operating power point chosen for any given CPU by the scheduler. Future plans include using these values for CPU selection; a process with a low maximum clamp might be relegated to a slow (efficient) processor even if it could consume more CPU time, for example.

The average desktop or server user is unlikely to make much use of this capability; it's probably not worth the trouble to figure out what the clamp values should be. But, in dedicated systems where it is relatively easy to figure out which processes are important — handsets, for example — a user-space daemon can automatically tune the system for better overall performance. So it is not surprising that this work has come out of the Android world, or that it is already in use in Android systems to ensure that processes important to the user run quickly, while keeping low-level background work from overheating the device or draining the battery. The Android developers have been looking for a way to get this sort of functionality upstream for some time; perhaps this patch set will be the one that succeeds and brings the Android kernel that much closer to the mainline.

Index entries for this article
Kernel	Scheduler

Mischevious processes

Posted Aug 9, 2018 3:30 UTC (Thu) by gshrikant (guest, #101640) [Link] (1 responses)

I'm pretty sure this has already been considered but if processes don't need special privileges decide their minimum clamp value - which affect how frequently it gets chosen - wouldn't this lead to mischievous processes abusing this feature to prioritize themselves?

Mischevious processes

Posted Aug 9, 2018 12:42 UTC (Thu) by daurnimator (guest, #92358) [Link]

How would it have any effect worse than entering a busyloop?

Scheduler utilization clamping

Posted Aug 9, 2018 14:34 UTC (Thu) by chrisr (subscriber, #83108) [Link]

I'm not the developer of these patches, but I am familiar with them so I thought I might try to expand a little for readers here who don't use LKML. Utilization clamping doesn't relate to priority or interact with pick_next_task and friends, it is more about the apparent execution time which is tracked (in a frequency and cpu-type invariant manner so that signals can be compared across CPUs in a system) by the scheduler and visible (aggregated at cpu level) to the cpu frequency governors. You cannot change your priority with this API.

This API only allows a task to impose restrictions on itself (i.e. use a lower value for min or max utilization) from the default values, which can be controlled hierarchically using the cpu cgroup controller. Regardless of the presence of this feature or any imposed restrictions on apparent utilization or not, a task may already execute for as long as it has work to do in accordance with it's granted cpu bandwith, priority and scheduling policy. I think the only risk from a malicious application here is running a busy loop to drain a battery or otherwise disrupt system resources - you already have to use the other tools at your disposal to guard against that if you want to, and this feature doesn't change any of those.

The cover letter talks about in future using the restrictions to impact task placement along with frequency guidance but this version only shows frequency guidance as the other mechanisms required to implement the task placement guidance aren't present yet. We've been using SchedTune ( https://lwn.net/Articles/706374/ ) in Android for equivalent functionality for a few years now but it wasn't really suitable for all Linux kernel users hence the util-clamp patches. In SchedTune we already use the modified utilization signals to limit the potential pool of CPUs in big.LITTLE systems (i.e. you can configure a task to never look large enough to run on a big CPU or never look small enough to run on a little CPU) but you can't do that without the scheduler making use of task utilization when selecting candidate CPUs for placement which isn't present upstream yet. The Energy Aware Scheduling patches - which Jonathan wrote about at https://lwn.net/Articles/749900/ - add utilization & capacity aware wakeup placement to CFS and v5 can be seen at https://lore.kernel.org/lkml/20180724122521.22109-1-quent... if anyone wants to test/review it. Once such mechanisms are available to the scheduler, they can be extended to use the utilization-clamped values to influence task placement as well.

Another part of the solution for improving task placement in a big.LITTLE system that we use in Android is made from a feature called 'Misfit Task Detection', which you can also find on LKML at https://lore.kernel.org/lkml/1530699470-29808-1-git-send-...

These pieces are individually useful but also fit together to provide all the mechanisms required for fine-grained control of CPU placement and frequency selection to an intelligent userspace for battery powered devices and potentially other types of hardware.