LWN.net Logo

Rethinking power-aware scheduling

By Jonathan Corbet
January 10, 2012
Sometimes it seems that there are few uncontroversial topics in kernel development, but saving power would normally be among them. Whether the concern is keeping a battery from running down too soon or keeping the planet from running down too soon, the ability to use less power per unit of computation is seen as a good thing. So when the kernel's scheduler maintainer threatened to rip out a bunch of power-saving code, it got some people's attention.

The main thing the scheduler can do to reduce power consumption is to allow as many CPUs as possible to stay in a deep sleep state for as long as possible. With contemporary hardware, a comatose CPU draws almost no power at all. If there is a lot of CPU-intensive work to do, there will be obvious limits on how much sleeping the CPUs can get away with. But, if the system is lightly loaded, the way the scheduler distributes running processes can have a significant effect on both performance and power use.

Since there is a bit of a performance tradeoff, the scheduler exports a couple of tuning knobs under /sys/devices/system/cpu. The first, called sched_mc_power_savings, has three possible settings:

  1. The scheduler will not consider power usage when distributing tasks; instead, tasks will be distributed across the system for maximum performance. This is the default value.

  2. One core will be filled with tasks before tasks will be moved to other cores. The idea is to concentrate the running tasks on a relatively small number of cores, allowing the others to remain idle.

  3. Like (1), but with the additional tweak that newly awakened tasks will be directed toward "semi-idle" cores rather than started on an idle core.

There is another knob, sched_smt_power_savings, that takes the same set of values, but applies the results to the threads of symmetric multithreading (SMT) processors instead. These threads look a lot like independent processors, but, since they share most of the underlying hardware, they are not truly independent from each other.

Recently, Youquan Song noticed that sched_smt_power_savings did not actually work as advertised; a quick patch followed to fix the problem. Scheduler maintainer Peter Zijlstra objected to the fix, but he also made it clear that he objects to the power-saving machinery in general. Just to make that clear, he came back with a patch removing the whole thing and a threat to merge that patch unless somebody puts some effort into cleaning up the power-saving code.

Peter subsequently made it clear that he sees the value of power-aware scheduling; the real problem is in the implementation. And, within that, the real problem seems to be the control knobs. The two knobs provide similar behavioral controls at two levels of the scheduler domain hierarchy. But, with three possible values for each, the result is nine different modes that the scheduler can run in. That seems like too much complexity for a situation where the real choice comes down to "run as fast as possible," or "use as little power as possible."

In truth, it is not quite that simple. The performance cost of loading up every thread in an SMT processor is likely to be higher than that of concentrating tasks at higher levels. Those threads contend for the actual CPU hardware, so they will slow each other down. So one could conceive of situations where an administrator might want to enable different behavior at different levels, but such situations are likely to be quite rare. It is probably not worth the trouble of maintaining the infrastructure to support nine separate scheduler modes just in case somebody wants to do something special.

For added fun, early versions of the patch adding the "book" scheduling level (used only by the s390 architecture) included a sched_book_power_savings switch, though that switch went away before the patch was merged. There is also the looming possibility that somebody may want to do the same for scheduling at the NUMA node level. There comes a point where the number of possibilities becomes ridiculous. Some people - Peter, for example - think that point has already been reached.

That conclusion leads naturally to talk of what should replace the current mechanism. One solution would be a simple knob with two settings: "performance" or "low power." It could, as Ingo Molnar suggested, default to performance for line-connected systems and low power for systems on battery. That seems like a straightforward solution, but there is also a completely different approach suggested by Indan Zupancic: move that decision making into the CPU governor instead. The governor is charged with deciding which power state a CPU should be in at any given (idle) time. It could be given the additional task of deciding when CPUs should be taken offline entirely; the scheduler could then just do its normal job of distributing tasks among the CPUs that are available to it. Moving this responsibility to the governor is an interesting thought, but one which does not currently have any code to back it up; until somebody rectifies that little problem, a governor-based approach probably will not receive a whole lot more consideration.

Somebody probably will come through with the single-knob approach, though; whether they will follow through and clean up the power-saving implementation within the scheduler is harder to say. But it should be enough to avert the threat of seeing that code removed altogether. And that is certainly a good thing; imagine the power that would be uselessly consumed in a flamewar over a regression in the kernel's power-aware scheduling ability.


(Log in to post comments)

Rethinking power-aware scheduling

Posted Jan 12, 2012 17:45 UTC (Thu) by jhhaller (subscriber, #56103) [Link]

With continuing green legislation and regulatory enforcement, it will be more difficult to be in the "performance" setting and meet the legislation and regulatory efforts. There are standards for measuring power consumption both for some products used and some products sold. For example, for telecommunications systems, ETSI has defined standards for measuring power consumption for a number of product types, so that systems which power down under lower load intervals have better evaluations. The companies which install these products are under both regulatory and financial pressures to minimize electricity consumption. I'm sure other industries have similar pressures. One could see a future where Microsoft advertises lower power consumption per workload than Linux if we don't get this right.

I think the question is how do we get the right people motivated to make the performance of power saving scheduling algorithms have a minimal performance impact compared to existing algorithms, while still having relatively low power consumption. For those of us who pay others to supply their kernels, I suppose the only way we have of doing this is to tell our suppliers of the importance of the issue.

Rethinking power-aware scheduling

Posted Jan 13, 2012 10:04 UTC (Fri) by dgm (subscriber, #49227) [Link]

> One could see a future where Microsoft advertises lower power consumption per workload than Linux if we don't get this right.

Microsoft isn't advertising it yet, but it has been norm for a few years, specially on netbooks. The net is full of articles and blog post about that.

Rethinking power-aware scheduling

Posted Jan 13, 2012 17:52 UTC (Fri) by daglwn (subscriber, #65432) [Link]

That's exactly right. Even in the HPC world where I work, power is already at the top of the list of concerns. Performance always matters but we don't have unlimited resources as we could imagine just a few years ago.

One thing that troubles me about the conversation is the idea that one can determine power needs based on whether the machine is running on battery or not. I know that it can be customized. It's the thought process and assumptions made that concern me. We're at the point where EVERYONE needs low power, just various degrees of it.

Rethinking power-aware scheduling

Posted Jan 13, 2012 22:20 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

everyone needs low power, but not everyone is willing to sacrifice performance to get low power.

that's the key issue here.

the optimal performance thing is to distribute the work as widely as possible to reduce the performance impact of shared resource contention (even if that shared resource is just the cache attached to a particular core)

But that leads to many cores running at a small fraction of their capacity.

the optimal power saving mode is to get as many cores as possible to be completely idle so that they can be powered down, even if this reduces performance.

which one is the right choice depends on what you are trying to do, but if I purchase a machine with 8 cores, I don't want the system slowing my response time by 10% because it thinks that approximately the same performance can be achieved by only using 4 cores. If I was willing to accept that, I would have saved money (and even more power) by only buying 4 cores in the first place.

Rethinking power-aware scheduling

Posted Jan 13, 2012 22:32 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

You gain performance by reducing cache contention, but you potentially gain performance by having multiple threads on the same core and running directly out of cache. Bursty workloads may also benefit from being concentrated on one package in order to reduce the likelihood of that package entering deep package C states, while still giving an overall power win because the other packages can do.

It's not an either/or scenario. If you care deeply about performance then you need to tune your scheduler for your specific workload, just like you end up having to tune the VM or io scheduler.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:08 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

if you have processes/threads sharing something, then you should take that into account when scheduling them to reduce the cost of the sharing.

but if you are talking about processes (which is the more common case), then you don't gain anything by having them share a cache, and in fact you are less likely to allow them to run out of cache if they share it because they will be contending for the space.

if you have a problem of entering C states reducing your performance, then the answer should be to change the controller that is causing you to enter the C states to hurt you less.

this doesn't require you to tune the scheduler for every workload, it simply takes accepting the fact that what is best for power is not going to be best for performance, and therefor not insisting that 'everyone cares about power' which implies that the power saving mode is the only one that should matter.

going back up the thread a few posts, the heuristic that if you are on battery power you are probably willing to sacrifice a bit of performance for significant reductions in power use, but if you are on line power you are probably not does represent the real world. It's not perfect, which is why it is a default, not a hard-coded mode, but it's a pretty accurate heuristic.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:15 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

It's not the heuristic our customers ask for, so I'd be interested to know how you're defining it as accurate.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:23 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

It's exactly what people are used to.

on their laptops, when they unplug the screen dims slightly and the systems switch to a more aggressive power saving mode.

your customers are not asking for it explicitly because they are used to getting it by default.

most of them won't realise what the problem is if they don't get it, they will just consider their device sluggish (or at least not as fast as the competition) if they don't get peak performance when plugged in, and they will consider the device/OS to be a power hog if it doesn't last as long when on battery power.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:28 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Since we mostly sell into the enterprise server market, I'm pretty sure that that's not what they're talking about.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:39 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

Ok, I don't know who you are or what you market, but I also don't know very many enterprise servers that have battery powered modes.

Rethinking power-aware scheduling

Posted Jan 13, 2012 23:44 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

None. That's the point. They want aggressive power management despite these devices always being plugged in. The assumption that just because you're not running off battery you're not interested in power management is one that's untrue for a huge proportion of Linux users. It's in no way an accurate heuristic.

Rethinking power-aware scheduling

Posted Jan 14, 2012 0:02 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

so if you want aggressive power management, you set it. nobody is preventing it.

But setting aggressive power management for all cases as the default for everyone is wrong.

Rethinking power-aware scheduling

Posted Jan 14, 2012 22:43 UTC (Sat) by raven667 (subscriber, #5198) [Link]

But powering more hardware than required to run the workload is wasteful of power for no benefit. Isn't it a good idea to work on the scheduler so that it can run the computer just as hard with just as much power use as necessary and no more? You may want a tunable to make the power saving so aggressive that it affects performance, although the wisdom IIRC is that making workloads run slow makes them use more power by running longer. Making the default no power saving at all is probably not reasonable.

Rethinking power-aware scheduling

Posted Jan 14, 2012 22:48 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

there is no way to have power savings with no performance penalty under any conditions.

it takes time to bring CPUs out of sleep states, and during that time the work that is waiting for them may not be able to get done.

it is not always less power to run at full speed and then sleep, that is frequently the case, but it depends on the ability to move in and out of sleep, along with the amount of power saved.

In this case, we are talking about the options when you have multiple cores, some sharing components, and have less work than it takes to max out all the cores.

putting all the work on one core and powering off the other cores may save power, but it could make the work take longer (but not enough longer to use more power than the other cores would consume if they were not powered down). for some people having the work take slightly longer won't matter, for others it will.

Rethinking power-aware scheduling

Posted Jan 15, 2012 3:53 UTC (Sun) by raven667 (subscriber, #5198) [Link]

There is no reason to think that a power aware scheduler can't be good enough to be the default, is there? Even for latency sensitive operations the scheduler could keep some amount of idle capacity available for bursts of work without running the whole machine at full bore. It seems to me that power saving should be the default even for machines on mains power

Rethinking power-aware scheduling

Posted Jan 15, 2012 13:48 UTC (Sun) by mjg59 (subscriber, #23239) [Link]

Nonsense. Turbo mode is an example of aggressive power management resulting in significantly enhanced performance under certain workloads.

Rethinking power-aware scheduling

Posted Jan 15, 2012 11:09 UTC (Sun) by liljencrantz (guest, #28458) [Link]

What makes you so sure that your intuition on what is an accurate heuristic for power management is so much better than Matthew Garret's? As a kernel developer that seems to work almost full time on power issues for Red Hat, one would hope that he has a more than passing familiarity with the needs of the enterprise market. If your answer is along the lines of intuition/personal experience, then perhaps you should consider the possibility that your needs are the atypical ones? If your answer is something entirely different, then please elucidate us, because right now it might seem like you're stating opinions as facts.

My somewhat limited personal experience is that most data centers I've worked with have a total power limit per rack that is painfully low, and that reducing power usage by a few watts per system would allow us to stuff in one more server per rack, leading to a significant amount of savings. This resonates well with what Garret is saying.

Rethinking power-aware scheduling

Posted Jan 12, 2012 21:41 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246) [Link]

And then there's the fun AMD's new Bulldozer platform brings with its core/module structure. It's not exactly hyperthreading, but not exactly different either. You also get the excitement of being able to run some modules at a higher clock if others are powered down.

Speaking as an outsider that's not really familiar with the guts here, it does seem like the governor might be a good place to decide what CPUs the scheduler should play with, and maybe what order it should fill them in. It seems like that would potentially allow you to factor out the CPU / system specific heuristics for hyperthreading, core/module, NUMA, etc. from the actual process of scheduling.

Rethinking power-aware scheduling

Posted Jan 12, 2012 23:16 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

you can't abstract all of this out, for several of these things there are performance implications to how you distribute the work across the cores that you have, and only the scheduler should be involved in that discussion. this is mostly factors around shared resources (cores for hyperthreading, FPU for bulldozer, memory I/O for NUMA, etc)

but the governor can know that if it shuts down some cores it can increase the clock on the remaining cores, and as a result it may choose to shut down some cores even if the remaining cores couldn't _quite_ handle the expected load at the current clock speed.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds