|
|
Log in / Subscribe / Register

The power-aware scheduling miniconference

By Jonathan Corbet
August 27, 2014

Kernel Summit 2014
For the second year in a row, the annual Kernel Summit included a miniconference on the problem of power-aware scheduling — how can the scheduler place processes so that their execution consumes a minimal amount of power? Morten Rasmussen provided a summary of this year's meeting to the Summit as a whole; it seems clear that a full solution to this problem is still distant, but some progress is slowly being made.

One of the results from the 2013 meeting was that the power-aware scheduling developers needed to come up with a set of metrics and benchmarks so that the results of their work could be judged. Without such metrics, there is no way to know if power-aware scheduling patches are actually achieving their objective. This year, two tools developed by Linaro are being put into use. The first of those is a workload generation tool which can be used to run specific scheduling algorithms and watch the results. A test pattern can be described in JSON and run in the system. There are two workloads described currently: audio playback on an Android system and a generic web-browsing workload.

The other tool is called "idlestat." It works from data obtained with ftrace on running systems to generate statistics on the sleep states entered by the processor and how long is spent in each. It can accept a power model for a given processor describing the power requirements for each processor state. That model can then be used to generate an estimate of the total amount of power consumed by a given run.

These tools are a good start, Morten said, but they are just a start. There is a need for community feedback on how well they describe the scheduling problem, and the developers would very much like to have more [Morten Rasmussen] workload descriptions. For now, though, Morten cautioned, this work is being limited to CPU power consumption. That is a hard enough problem to solve without also trying to address power consumption in the graphics processor or other peripheral devices.

The load tracking that has been added to the scheduler is helpful, Morten said, but it turns out that power-aware scheduling also needs good CPU-utilization tracking, which is a bit different. With utilization tracking, the scheduler can come up with better estimates of how much CPU time each process will require in the future and use those estimates to make better scheduling decisions. Load tracking is also entirely unaware of CPU frequency scaling, a problem that must be fixed. The next step in that direction is to start to control CPU frequency scaling from the scheduler itself, rather than trying to react to what an independent CPU frequency governor is doing.

Morten also noted that energy awareness will always have to be an optional feature for the scheduler. Some hardware wants to manage power awareness internally with a bunch of "magic behind the scenes." If that functionality cannot be turned off, the hardware is simply not going to be easy to cooperate with in this area; it's better to just let it have its way.

Developers are currently working on an energy-model-driven scheduling proposal that tries to improve the scheduler's decision-making. The task has proved to be more challenging than expected; some relatively simple techniques like small-task packing make sense sometimes, but not always. So some way of choosing between scheduling algorithms must be arrived at. One could do it with heuristics, Morten said, but that is painful in the long run. Heuristics are always, at best, an approximation of a complete solution.

The alternative is to put a model of the CPU platform into the scheduler itself. For any given configuration of processes in the system, the model can generate an estimate of what the energy cost will be. That allows the scheduler to contemplate moving processes around and estimate what the resulting power consumption will be. The platform model must be supplied by architecture-specific code; it will be based on processor idle and sleep states. There is a patch set in circulation, and there appears to be a consensus around this approach, with no major objections being expressed.

A future task, Morten said, will be to move CPU idle-state awareness into the scheduler. Like frequency scaling, the CPU idle-management code runs as a separate subsystem with no direct communication with the scheduler. Bringing idle awareness into the power model will allow the scheduler to better manage idle time and to make better predictions of future wakeup events.

Another future-work area is the management of power policies under virtualization. Guest systems, too, want to run in a power-efficient manner. The consensus seems to be, though, that power management should be handled entirely in the host. Guests can communicate their constraints to the hypervisor, but any attempt to implement those constraints belongs on the host side.

As Morten's report came to a close, a developer asked whether the power-aware scheduling developers were working on thermal awareness as well. That topic came up during the miniconference, Morten said, but it is not being worked on at the moment. The power model is being kept as simple as possible for now; the developers feel like they have enough complexity to deal with as it is. Once it appears that a solution to the simpler problem is in sight, they can consider taking on additional constraints like thermal management.

Index entries for this article
KernelPower management/CPU scheduling
KernelScheduler/and power management
ConferenceKernel Summit/2014


to post comments

The power-aware scheduling miniconference

Posted Aug 29, 2014 1:24 UTC (Fri) by patrakov (subscriber, #97174) [Link] (2 responses)

Not only scheduling needs to be power-aware.

First, we need to have generally available enterprise-grade monitoring tools that know about power saving modes. I have already wasted money by ordering a new server prematurely, because the current Zabbix installation said "OMG, the peak CPU usage is over 80%". The fact is that the old server was running at 800 MHz. Now, due to this inability to adequately monitor the load (and, ultimately, answer the question "how many extra clients can I afford on this server") with CPU frequency scaling enabled, I always use the "performance" governor on servers.

Second, I have several bug reports from PulseAudio users who also use dcaenc (which is a rather CPU-hungry DTS encoder intended for use via its ALSA plugin). PulseAudio has a limit of CPU usage that is allowed for its real-time threads. For the affected users, the system starts, e.g., with 2.4 GHz CPU frequency. At that frequency, the encoder eats, say, 12%, which is just fine. Then, the scaling governor notices that the CPU is only lightly loaded and decides to reduce its frequency. It goes all the way down to 800 MHz, which would bring the CPU usage by the encoder to 36%. This value is still low enough that the governor thinks that the CPU is only lightly loaded. But this is not the case for the purpose of applying the real-time CPU load: PulseAudio gets killed, because it is not allowed to use more than 30%. Again, the "performance" governor is a working workaround that even doesn't increase the power consumption much in this scenario.

Let's hope that both issues will eventually be resolved.

The power-aware scheduling miniconference

Posted Aug 29, 2014 10:29 UTC (Fri) by dgm (subscriber, #49227) [Link] (1 responses)

It seems to me that limiting % CPU usage in a world where CPU speeds can change dynamically is intrinsically broken. What one needs to watch for is if your task is holding back others (say, 30% usage _and_ 100% system load).

Additionally I think PulseAudio should be just dropping real-time priority instead of simply killing itself, or adding a switch to allow such behavior.

The power-aware scheduling miniconference

Posted Sep 5, 2014 11:28 UTC (Fri) by farnz (subscriber, #17727) [Link]

The kernel knob you can set easily is RLIMIT_RTTIME (number of microseconds your process runs for between blocking system calls). The problem is that it's difficult to set a good value for this knob, or for a userspace knob implemented on top of it. On the one hand, blocking a core for too long is not nice to other users; on the other hand, if there are no other users, there's no harm done if you claim the core 24/7.

cgroups provides a better knob, but it's still not perfect; the cgroups knob is a RT period and RT runtime pair, where RT processes in the CPU cgroup can use the processor for a total runtime out of every period. On my system, that's set to 950,000 microseconds RT runtime out of every 1,000,000 microseconds - (or 95% of each second). The trouble here is again releasing the CPU when it's in a low power state due to nothing else to do - 95% of 800 MHz is very different to 95% of 4,000 MHz.

The other issue that plays into all of this is the turbo modes that modern CPUs have - my laptop CPU is guaranteed to clock at a sustained 2.5 GHz, but where thermals allow can reach 3.1 GHz; I know from using similar CPUs in devices with good cooling that they can sustain that 3.1 GHz turbo speed forever, as long as the CPU cooler is sufficiently overspecced. How do you determine 100% system load, when the upper limit is determined by environmental factors that you can't directly measure?

Virtualization

Posted Aug 31, 2014 20:40 UTC (Sun) by robbe (guest, #16131) [Link]

For the guest OS, virtualization can look like one of these "evil" hardware/firmware combinations that vary CPU speed at their own whim.

I know that, for example, VMware takes some pains to never let a single core of a multi-core VM run much more often ("faster" from the vantage point of the guest) than the others. If Linux could better cope with such hardware/firmware, the hypervisor would need to care less, resulting in higher consolidation.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds