By Jonathan Corbet
August 21, 2012
Years of work to improve power utilization in Linux have made one thing
clear: efficient power behavior must be implemented throughout the system.
That certainly includes the CPU scheduler, but the kernel's scheduler
currently has little in the way of logic aimed at minimizing power use. A
recent proposal has started a discussion on how the
scheduler might be made to be more power-aware. But, as this discussion
shows, there is no single, straightforward answer to the question of how
power-aware scheduling should be done.
Interestingly, the scheduler did have power-aware logic from 2.6.18
through 3.4. There was a sysctl knob (sched_mc_power_savings)
that would cause the scheduler to try to group runnable processes onto the
smallest possible number of cores, allowing others to go idle. That code
was removed in 3.5 because it never worked very well and nobody was putting
any effort into improving it. The result was the removal of some rather
unloved code, but it also left the scheduler with no power awareness at
all. Given the level of interest in power savings in almost every
environment, having a power-unaware scheduler seems less than optimal; it
was only a matter of time until somebody tried to put together a better
solution.
Alex Shi started off the conversation with a
rough proposal on how power awareness might be added back to the
scheduler. This proposal envisions two modes, called "power" and
"performance," that would be used by the scheduler to guide its decisions.
Some of the first debate centered around how that policy would be chosen,
with some developers suggesting that "performance" could be used while on
AC power and "power" when on battery power. But that policy entirely
ignores an important constituency: data centers. Operators of data
centers are becoming increasingly concerned about power usage and its
associated costs; many of them are likely to want to run in a lower-power
mode regardless of where the power is coming from. The obvious conclusion
is that the kernel needs to provide a mechanism by which the mode can be
chosen; the policy can then be decided by the system administrator.
The harder question is: what would that policy decision actually do? The
old power code tried to cause some cores, at least, to go completely idle
so that they could go into a sleep state.
The proposal from Alex takes a different approach. Alex claims
that trying to idle a subset of the CPUs in the system is not going to save
much power; instead, it is best to spread the runnable processes across the
system as widely as possible and try to get to a point where all
CPUs can go idle. That seems to be the best approach, on x86-class
processors, anyway. On that architecture, no processor can go into a deep
sleep state unless they all go into that state; having even a single
processor running will keep the others in a less efficient sleep state. A
single processor also keeps associated hardware — the memory controller,
for example — in a powered-up state. The first CPU is by far the most
expensive one; bringing in additional CPUs has a much lower incremental cost.
So the general rule seems to be: keep all of the processors busy as long as
there is work to be done. This approach should lead to the quickest
processing and best cache utilization; it also gives the best power
utilization. In other words, the best policy for power savings looks a
lot like the best policy for performance. That conclusion came as a
surprise to some, but it makes some sense; as Arjan van de Ven put it:
So in reality, the very first thing that helps power, is to run
software efficiently. Anything else is completely secondary. If
placement policy leads to a placement that's different from the
most efficient placement, you're already burning extra power...
So why bother with multiple scheduling modes in the first place? Naturally
enough, there are some complications that enter this picture and make it a
little bit less neat. The first of these is that spreading load across
processors only helps if the new processors are actually put to work for a
substantial period of time, for values of "substantial" around 100μs.
For any shorter period, the cost of bringing the CPU out of even a shallow
sleep exceeds the savings gained from running a process there. So extra
CPUs should not be brought into play for short-lived tasks. Properly
implementing that policy is likely to require that the kernel gain a better
understanding of the behavior of the processes running in any given
workload.
There is also still scope for some differences of behavior between the two
modes. In a performance-oriented mode, the scheduler might balance tasks
more aggressively, trying to keep the load the same on all processors. In
a power-savings mode, processes might stay a bit more tightly packed onto a
smaller number of CPUs, especially processes that have an observed history
of running for very short periods of time.
But the conversation has, arguably, only barely touched on the biggest
complication of all. There was a lot of talk of what the optimal behavior
is for current-generation x86 processors, but that is far from the only
environment in which Linux runs. ARM processors have a complex set of
facilities for power management, allowing much finer control over which
parts of the system have power and clocks at any given time. The ARM world
is also pushing the boundaries with asymmetric architectures like big.LITTLE; figuring out the optimal task
placement for systems with more than one type of CPU is not going to be an
easy task.
The problem is thus architecture-specific; optimal behavior on one
architecture may yield poor results on another. But the eventual solution
needs to work on all
of the important architectures supported by Linux. And, preferably, it
should be easily modifiable to work on future versions of those
architectures, since the way to get the best power utilization is likely to
change over time. That suggests that the mechanism currently used to
describe architecture-specific details to the scheduler (scheduling domains) needs to grow the ability
to describe parameters relevant to power management as well. An
architecture-independent scheduler could then use those parameters to guide
its behavior. That scheduler will also need a better understanding of
process behavior; the almost-ready
per-entity load tracking patch set may help
in this regard.
Designing and implementing these changes is clearly not going to be a
short-term job. It will require a fair amount of cooperation between the
core scheduler developers and those working on specific architectures.
But, given how long we have been without power management support in
the scheduler, and given that the bulk of the real power savings are to be
had elsewhere (in drivers and in user space, for example), we can wait a
little longer while a proper scheduler solution is
worked out.
(
Log in to post comments)