A power-aware scheduling update
Putting the CPU scheduler in charge of CPU power management decisions has a certain elegance; the scheduler is arguably in the best position to know what the system's needs for processing power will be in the near future. But this idea immediately runs afoul of another trend in the kernel: actual power management decisions are moving away from the scheduler toward low-level hardware driver code. As Arjan van de Ven noted in a May Google+ discussion, power management policies for Intel CPUs are being handled by CPU-specific code in recent kernels:
Arjan suggests that any discussion that is based on control of CPU frequencies and voltages misses an important point: current processors have a more complex notion of power management, and they vary considerably from one hardware generation to the next. The scheduler is not the right place for all that low-level information; instead, it belongs in low-level, hardware-specific code.
There is, however, fairly widespread agreement that passing more information between the scheduler and the low-level power management code would be helpful. In particular, there is a fair amount of interest in better integration of the scheduler's load-balancing code (which decides how to distribute processes across the available CPUs) and the power management logic. The load balancer knows what the current needs are and can make some guesses about the near future; it makes sense that the same code could take part in deciding which CPU resources should be available to handle that load.
Based on these thoughts and more, Morten Rasmussen has posted a design proposal for a reworked, power-aware scheduler. The current scheduler would be split into two separate modules:
- The CPU scheduler, which is charged with making the best use of the
CPU resources that are currently available to it.
- The "power scheduler," which takes the responsibility of adjusting the currently available CPU resources to match the load seen by the CPU scheduler.
The CPU scheduler will handle scheduling as it is done now. The power scheduler, instead, takes load information from the CPU scheduler and, if necessary, makes changes to the system's power configuration to better suit that load. These changes can include moving CPUs from one power state to another or idling (or waking) CPUs. The power scheduler would talk with the current frequency and idle drivers, but those drivers would remain as separate, hardware-dependent code. In this design, load balancing would remain with the CPU scheduler; it would not move to the power scheduler.
Of course, there are plenty of problems to be solved beyond the simple implementation of the power scheduler and the definition of the interface with the CPU scheduler. The CPU scheduler still needs to learn how to deal with processors with varying computing capacities; the big.LITTLE architecture requires this, but more flexible power state management does too. Currently, processes are charged by the amount of time they spend executing in a CPU; that is clearly unfair to processes that are scheduled onto a slower processor. So charging will eventually have to change to a unit other than time; instructions executed, for example. The CPU scheduler will need to become more aware of the power management policies in force. Scheduling processes to enable the use of "turbo boost" mode (where a single CPU can be overclocked if all other CPUs are idle) remains an open problem. Thermal limits will throw more variables into the equation. And so on.
It is also possible that the separation of CPU and power scheduling will not work out; as Morten put it:
Even with these uncertainties, the "power scheduler" approach should prove
to be a useful starting point; Morten and his colleagues plan to post a
preliminary power scheduler implementation in the near future. At that
point we may hear how Ingo feels about this design relative to the
requirements he put forward; he (along with the other core scheduler
developers) has been notably absent from the recent discussion.
Regardless, it seems clear that the development community will be working
on power-aware scheduling for quite some time.
Index entries for this article | |
---|---|
Kernel | Power management/CPU scheduling |
Kernel | Scheduler/and power management |
Posted Jun 20, 2013 22:41 UTC (Thu)
by dlang (guest, #313)
[Link]
The problem with saying that the scheduler shouldn't care about this is that if it has no idea how fast a core is, or is going to be, how can it possibly attempt to put the right amount of load on it, or charge the process for the time it spent on that core?
According to Arian, the only way to find out how fast a core is running is to measure it, and the speed that a core is running may vary by a factor of 2 without any notice to the OS.
How can any system possibly make reasonable decisions if the hardware is so unpredictable?
A power-aware scheduling update