Scheduler utilization clamping
Contemporary CPU schedulers have a number of decisions to make at any given time. They must, of course, pick the process that will be allowed to execute in each CPU on the system, distributing processes across those CPUs to keep the system as a whole in an optimal state of busyness. Increasingly, the scheduler is also involved in power management — ensuring that the CPUs do not burn more energy than they have to. Filling that role requires placing each process on a CPU that is appropriate for that process's needs; modern systems often have more than one type of CPU available. The scheduler must also pick an appropriate operating power point — frequency and voltage — for each CPU to enable it to run the workload in a timely manner while minimizing energy consumption.
One of the scheduler's key tools is load tracking: observing how much CPU time each process actually uses over time and using the result to estimate what its future needs will be. As used in this patch set, loads are expressed in terms of percentages; 0% for a process that is not running at all to 100% for a process that will use the full power of the fastest CPU in the system running at its highest frequency. Using load tracking, the scheduler can distribute processes in a way that avoids overloading any specific processor, put the more resource-intensive processes on the faster processors, and pick an operating power point that is fast enough to handle the total load on each CPU. But, while load tracking tells the scheduler how much CPU any given process is likely to need, it says less about how the process needs to use that time.
A realtime process, for example, probably does not need large amounts of CPU time, but it is not able to wait to get that time. Current schedulers respond by running the CPU at full speed whenever a realtime process is runnable to ensure that it doesn't miss its deadlines. But it might also make sense to put that process on one of the system's fastest CPUs. Similarly, non-realtime processes may present a small load, but they may do work that other parts of the system depend on; they should be run at high speed even though they demand little of the processor. On the other hand, a background processing process might be best run at low speed on an efficient processor, even if it could use more CPU power; it does not need to run quickly, and it should not demand too much of the system's battery.
Different tasks can be given different priorities, but that is not a sufficiently useful signal for the processor; priorities only say which process should run first. To fill this gap, Bellasi's patch set adds two more parameters, called the minimum and maximum clamping values; they work by constraining the scheduler's load calculations, essentially fooling the scheduler into treating processes differently than it otherwise would.
The first of those values, the minimum clamp, will, for any given process, place a lower bound on the calculated load for the processor on which that process will run. If process P, running on CPU C, has a minimum clamp value of 30%, then the calculated load for CPU C will never fall below 30% as long as P is runnable, even if the actual load adds up to less than that. The minimum clamp can thus be used to make a CPU appear to be busier than it really is; that, in turn, will affect the frequency that the scheduler chooses for that processor. An important control process might only require 2% of a CPU's capacity; if it's running alone, it will likely be run at a low speed. If its minimum clamp is set to 80%, though, the scheduler will pick a higher frequency and that process will get its time more quickly.
Similarly, the maximum clamp places an upper bound on how busy the processor will look. A background process may present a 99% CPU load, but setting the maximum clamp to a number like 20% will prevent that process from forcing the CPU frequency to a higher value. For both values, the effective value used by the scheduler is the maximum of all of the runnable process's values. If one process needs a minimum clamp of 50%, for example, the scheduler will not use a value lower than that. The default values are 0% and 100% for the minimum and maximum values, respectively.
There are a few ways to set these values. The clamp parameters for a specific process can be changed with the sched_setattr() system call; there do not appear to be any special privileges required if a process is changing its own values. Both ordinary and realtime processes can set their clamping values; processes running under the deadline scheduler already provide enough information for the scheduler to make the necessary decisions. Control groups can be used to set these values for all processes running within a group, via the new util.min and util.max knobs added to the CPU controller. Finally, default clamp values for processes running in the root group are controlled by the sched_uclamp_util_min and sched_uclamp_util_max sysctl knobs.
In this patch set, the clamp values only affect the operating power point chosen for any given CPU by the scheduler. Future plans include using these values for CPU selection; a process with a low maximum clamp might be relegated to a slow (efficient) processor even if it could consume more CPU time, for example.
The average desktop or server user is unlikely to make much use of this
capability; it's probably not worth the trouble to figure out what the
clamp values should be.  But, in dedicated systems where it is relatively easy
to figure out which processes are important — handsets, for example — a
user-space daemon can automatically tune the system for better overall
performance.  So it is not surprising that this work has come out of the
Android world, or that it is already in use in Android systems to
ensure that processes important to the user run quickly, while keeping
low-level background work from overheating the device or draining the
battery.  The Android developers have been looking for a way to get this
sort of functionality upstream for some time; perhaps this patch set will
be the one that succeeds and brings the Android kernel that much closer to
the mainline.
| Index entries for this article | |
|---|---|
| Kernel | Scheduler | 
      Posted Aug 9, 2018 3:30 UTC (Thu)
                               by gshrikant (guest, #101640)
                              [Link] (1 responses)
       
     
    
      Posted Aug 9, 2018 12:42 UTC (Thu)
                               by daurnimator (guest, #92358)
                              [Link] 
       
     
      Posted Aug 9, 2018 14:34 UTC (Thu)
                               by chrisr (subscriber, #83108)
                              [Link] 
       
This API only allows a task to impose restrictions on itself (i.e. use a lower value for min or max utilization) from the default values, which can be controlled hierarchically using the cpu cgroup controller. Regardless of the presence of this feature or any imposed restrictions on apparent utilization or not, a task may already execute for as long as it has work to do in accordance with it's granted cpu bandwith, priority and scheduling policy. I think the only risk from a malicious application here is running a busy loop to drain a battery or otherwise disrupt system resources - you already have to use the other tools at your disposal to guard against that if you want to, and this feature doesn't change any of those. 
The cover letter talks about in future using the restrictions to impact task placement along with frequency guidance but this version only shows frequency guidance as the other mechanisms required to implement the task placement guidance aren't present yet. We've been using SchedTune ( https://lwn.net/Articles/706374/ ) in Android for equivalent functionality for a few years now but it wasn't really suitable for all Linux kernel users hence the util-clamp patches. In SchedTune we already use the modified utilization signals to limit the potential pool of CPUs in big.LITTLE systems (i.e. you can configure a task to never look large enough to run on a big CPU or never look small enough to run on a little CPU) but you can't do that without the scheduler making use of task utilization when selecting candidate CPUs for placement which isn't present upstream yet. The Energy Aware Scheduling patches - which Jonathan wrote about at https://lwn.net/Articles/749900/  - add utilization & capacity aware wakeup placement to CFS and v5 can be seen at https://lore.kernel.org/lkml/20180724122521.22109-1-quent... if anyone wants to test/review it. Once such mechanisms are available to the scheduler, they can be extended to use the utilization-clamped values to influence task placement as well. 
Another part of the solution for improving task placement in a big.LITTLE system that we use in Android is made from a feature called 'Misfit Task Detection', which you can also find on LKML at https://lore.kernel.org/lkml/1530699470-29808-1-git-send-... 
These pieces are individually useful but also fit together to provide all the mechanisms required for fine-grained control of CPU placement and frequency selection to an intelligent userspace for battery powered devices and potentially other types of hardware. 
     
    Mischevious processes
      
Mischevious processes
      
Scheduler utilization clamping
      
           