LWN.net Logo

Power-aware scheduling meets a line in the sand

By Jonathan Corbet
June 5, 2013
As mobile and embedded processors get more complex — and more numerous — the interest in improving the power efficiency of the scheduler has increased. While a number of power-related scheduler patches exist, none seem all that close to merging into the mainline. Getting something upstream always looked like a daunting task; scheduler changes are hard to make in general, these changes come from a constituency that the scheduler maintainers are not used to serving, and the existence of competing patches muddies the water somewhat. But now it seems that the complexity of the situation has increased again, to the point that the merging of any power-efficiency patches may have gotten even harder.

The current discussion started at the end of May, when Morten Rasmussen posted some performance measurements comparing a few of the existing patch sets. The idea was clearly to push the discussion forward so that a decision could be made regarding which of those patches to push into the mainline. The numbers were useful, showing how the patch sets differ over a small set of workloads, but the apparent final result is unlikely to be pleasing to any of the developers involved: it is entirely possible that none of those patch sets will be merged in anything close to their current form, after Ingo Molnar posted a strongly-worded "line in the sand" message on how power-aware scheduling should be designed.

Ingo's complaint is not really about the current patches; instead, he is unhappy with how CPU power management is implemented in the kernel now. Responsibility for CPU power management is currently divided among three independent components:

  • The scheduler itself clearly has a role in the system's power usage characteristics. Features like deferrable timers and suppressing the timer tick when idle have been added to the scheduler over the years in an attempt to improve the power efficiency of the system.

  • The CPU frequency ("cpufreq") subsystem regulates the clock frequency of the processors in response to each processor's measured idle time. If the processor is idle much of the time, the frequency (and, thus, power consumption) can be lowered; an always-busy processor, instead, should run at a higher frequency if possible. Most systems probably use the on-demand cpufreq governor, but others exist. The big.LITTLE switcher operates at this level by disguising the difference between "big" and "little" processors to look like a wide range of frequency options.

  • The cpuidle subsystem is charged with managing processor sleep states. One might be tempted to regard sleeping as just another frequency option (0Hz, to be exact), but sleep is rather more complicated than that. Contemporary processors have a wide range of sleep states, each of which differs in the amount of power consumed, the damage inflicted upon CPU caches, and the time required to enter and leave that state.

Ingo's point is that splitting the responsibility for power management decisions among three components leads to a situation where no clear policy can be implemented:

Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly. Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'.

He would like to see a new design wherein the responsibility for all of these aspects of CPU operation has been moved into the scheduler itself. That, he claims, is where the necessary knowledge about the current workload and CPU topology lives, so that is where the decisions should be made. Any power-related patches, he asserts, must move the system in that direction:

This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles.

Needless to say, none of the current patch sets include a fundamental redesign of the scheduler, cpuidle, and cpufreq subsystems. So, for all practical purposes, all of those patches have just been rejected in their current form — probably not the result the developers of those patches were hoping for.

Morten responded with a discussion of the kinds of issues that an integrated power-aware scheduler would have to deal with. It starts with basic challenges like defining scheduling policies for power-efficient operation and defining a mechanism by which a specific policy can be chosen and implemented. There would be a need to represent the system's power topology within the scheduler; that topology might not match the cache hierarchy represented by the existing scheduling domains data structure. Thermal management, which often involves reducing CPU frequencies or powering down processors entirely, would have to be factored in. And so on. In summary, Morten said:

This is not a complete list. My point is that moving all policy to the scheduler will significantly increase the complexity of the scheduler. It is my impression that the general opinion is that the scheduler is already too complicated. Correct me if I'm wrong.

In his view, the existing patch sets are part of an incremental solution to the problem and a step toward the overall goal. Whether Ingo will see things the same way is, as of this writing, unclear. His words were quite firm, but lines in the sand are also relatively easy to relocate. If he holds fast to his expressed position, though, the addition of power-aware scheduling could be delayed indefinitely.

It is not unheard of for subsystem maintainers to insist on improvements to existing code as a precondition to merging a new feature. At past kernel summits, such requirements have been seen as being unfair, but they sometimes persist anyway. In this case, Ingo's message, on its face, demands a redesign of one of the most complex core kernel subsystems before (more) power awareness can be added. That is a significant raising of the bar for developers who were already struggling to get their code looked at and merged. A successful redesign on that scale is unlikely to happen unless the current scheduler maintainers put a fair amount of their own time into the requested redesign.

The cynical among us could certainly see this position as an easy way to simply make the power-aware scheduling work go away. That is certainly an incorrect interpretation, though. The more straightforward explanation — that the scheduler maintainers want to see the code get better and more maintainable over time — is far more likely. What has to happen now is the identification of a path toward that better scheduler that allows for power management improvements in the short term. The alternative is to see the power-aware scheduler code relegated to vendor and distributor trees, which seems like a suboptimal outcome.


(Log in to post comments)

Value

Posted Jun 5, 2013 20:16 UTC (Wed) by ncm (subscriber, #165) [Link]

Once again LWN reminds us why we keep up our subscriptions. Are you really chipping in your share?

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 20:40 UTC (Wed) by naptastic (subscriber, #60139) [Link]

*cough* pluggable schedulers *cough*

I want to learn more...

Posted Jun 5, 2013 22:06 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

I'm terribly unenlightened about these—can you provide pointers to their design overview, theory of operation, how a pluggable scheduler would work in the context of the Linux Kernel, etc.? Thanks!

Signed,

Just a little curious

I want to learn more...

Posted Jun 5, 2013 22:39 UTC (Wed) by SiliconSlick (subscriber, #39955) [Link]

Long ago, HP was working on something like this. Here's a link via the Way Back Machine:

http://web.archive.org/web/20031210095042/http://resource...

(see the white paper at the bottom)

Not sure whatever became of it.

I want to learn more...

Posted Jul 14, 2013 0:21 UTC (Sun) by alison (✭ supporter ✭, #63752) [Link]

Thanks for the link, SiliconSlick! The work you cite appears to anticipate cgroups: cool. I worked for HP at the time and regret not knowing Scott Rhine, the author.

I want to learn more...

Posted Jun 5, 2013 22:45 UTC (Wed) by naptastic (subscriber, #60139) [Link]

It's a not-quite perennial topic that has, historically, been shot down every time it's come up. However, coming up with a single scheduler / power configuration system that performs well on all systems looks less and less soluble with each passing year. For example, this conversation from 2004:

http://lwn.net/Articles/109458/

NUMA and hyperthreading were much newer and less understood, and things like bigLITTLE weren't even on the horizon. The objection there was that the effort of maintaining a separate scheduler per situation would outweigh the benefits, and result in many crappy schedulers instead of one good one; I believe the balance has shifted the other way now.

The Brainfuck Scheduler came out in 2009, and pluggable schedulers came up in conversation again. I can't find the reference but I think it was Linus that time that shot them down. He also shot down priority inheritance many times before finally agreeing to it, so things like this can change.

We have NUMA, HT, bigLITTLE, and AMD's somewhat odd two-cores-one-FPU setup now. Who knows what other permutations chip manufacturers will bring us in coming years. (Hopefully many!) The kernel scheduler has improved leaps and bounds, but there is still a meaningful gap in performance between mainline and (in my case) BFS, for specific hardware. BFS is marginally worse on my quad-socket Opteron system, and better enough for my single-socket systems that I use it exclusively. Imagine the performance deltas possible with a custom bigLITTLE scheduler.

I believe Ingo is right in his assertion that we need to look at scheduler and power management framework in the same context. I do not think that having one scheduler and power management framework to rule them all is realistic if we want to get the most from our hardware.

Of course, this is largely conjecture and meaningful benchmarks may not be available for a while yet.

I want to learn more...

Posted Jun 5, 2013 22:55 UTC (Wed) by naptastic (subscriber, #60139) [Link]

Oh, also in 2004, Con Kolivas posted the beginnings of a system to make the scheduler pluggable:

https://groups.google.com/forum/?fromgroups#!msg/fa.linux...

Based on my limited understanding of the issues involved, I don't think the approach Con took in this patch set is going a direction compatible with what Ingo is now suggesting. I could be wrong, of course.

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 23:21 UTC (Wed) by aliguori (guest, #30636) [Link]

That's just punting the problem.

I want low power consumption and I want maximum performance. It shouldn't have to be a choice.

In fact, the two are very much related. You want maximum work done while you're not sleep so you can extend the time that you sleep for.

And since most modern systems have NUMA characteristics, to get maximum performance you really need to be NUMA aware. It's all deeply inter-related.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 0:42 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

> I want low power consumption and I want maximum performance. It shouldn't have to be a choice.

sorry, it doesn't work that way in the real world

if you can get 1% better performance by spreading your work across 4 cores instead of 2 (and powering down the other two), which wins? performance or power?

If tasks were batched, you could do all your work and then go to sleep. But if you don't know the future work that you will be asked to do, should you go to low-power mode, even if it means that you will be slower to do work when it arrives?

It takes time to wake up from deep sleep states, sometimes you would rather be working than waiting to wake up :-)

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 2:22 UTC (Thu) by aliguori (guest, #30636) [Link]

> if you can get 1% better performance by spreading your work across 4 cores instead of 2 (and powering down the other two), which wins? performance or power?

If all you care about is performance, never do any power management. If all you care about is power savings, just shut the machine work and take the infinite performance hit.

I don't think anyone really falls on either side of the extreme.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 2:38 UTC (Thu) by naptastic (subscriber, #60139) [Link]

If I were the pointy-haired boss:

1. Tune settings on my device to err on the side of performance, rather than power (battery life)

2. With every update, move just a little more towards battery life, and advertise that you've increased the battery life.

3. When more people are moving away from your device because it's slow than were moving away from it because it had poor battery life, you've reached your balance.

4. Come out with a new product with newer hardware, start the process over again. Maybe do it the other direction just for kicks.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 7:05 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

and the point is that there is no one correct way of doing things. you can't maximize both performance and minimize power use.

You make a trade-off between them.

And since different people will want different points in the trade-off, this means that there needs to be a way of setting the different policies, and then different ways make decisions based on what the policy is

not to mention that the trade-offs are going to be very different on different hardware, and you have zero real clue about what the hardware will want in 2-3 years.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 7:29 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

a comment I posted elsewhere is also relevant to this discussion. I'll post a link rather than reposting the longer comment

in https://lwn.net/Articles/553086/ I give an example (disk I/O) where performance can involve 'wasting' effort by doing things that you may be able to avoid if you just wait.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 18:11 UTC (Thu) by sbohrer (subscriber, #61058) [Link]

> I don't think anyone really falls on either side of the extreme.

There are are plenty of us who disable all power savings. Having power saving enabled can easily mean a 30-40% performance hit which is unacceptable for some workloads.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 21:42 UTC (Thu) by naptastic (subscriber, #60139) [Link]

> There are are plenty of us who disable all power savings. Having power saving enabled can easily mean a 30-40% performance hit which is unacceptable for some workloads.

Or a 5ms latency increase at unpredictable times, which ruins my recording and playback.

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 21:02 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

I read his message as being less about redesigning the core scheduler as it is figuring out how to define the appropriate hooks in it so that it can make more decisions.

That could be a redesign of everything, but it doesn't need to be. And it especially should not be a redesign from scratch to get started.

Morton's response listed a lot of the concerns, but I think there is a fairly clear path to get started.

Since the current scheduling domains datastructure does not include the power related information, start by defining a structure that does.

Then people can start talking about how to hook the different policies into the scheduler and have it switch between them (start with the simple 'put things on as few CPUs as possible to save power' vs 'spread across as many cores as possible for performance')

meanwhile the mechanism to go into sleep modes can be implemented, and then in the idle slow path, you can insert logic to try and decide if you should go into a sleep state, if so which one, etc.

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 22:27 UTC (Wed) by rahvin (subscriber, #16953) [Link]

I read his message exactly as Jon read it. He wants everything power related in the scheduler so it's all in one place. I think you are making it a stretch to argue he meant hooking the scheduler and putting the code elsewhere.

Personally I don't get how he can insist it's all in the one place. I get it from an organization point of view but power management can be touching all sorts of things that are completely unrelated. If he wants to go down that road I would agree the solution is to hook the relevant subsystems and create a new power management system that manages all the diverse areas where power management is a concern, but then you go even further from CPU and scheduler and tuck in disk, screen and other management so you have centralized power control. (Maybe the only real solution to total power management on devices like tablets)

Such a system would probably decentralize things to an extent in that CPU frequency control would end up outside the CPU code and in the power management code. That sounds like it would greatly complicate the kernel and understanding how it works and would be contrary to the intent.

I don't think there is a clean solution to this problem, as I said, power management is a device wide issue and touches almost every subsystem. I think the modest improvements path is probably the most viable.

Great article!

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 22:39 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

I don't think that anyone is saying that all power management belongs there.

But having multiple systems trying to control how much power is used by the CPUs just does not make sense.

There are items that the scheduler MUST to (like deciding to concentrate or distribute tasks)

And then there are a things that the scheduler is in the best position to guess (predicting future loads for decisions on CPU speed settings)

having one process (cpufreq or cpuidle) try to guess what the scheduler is going to do just doesn't work well.

Power-aware scheduling meets a line in the sand

Posted Jun 5, 2013 23:59 UTC (Wed) by luto (subscriber, #39314) [Link]

For example, the new intel-idle driver is using fancy MSRs to estimate how hard the CPU has been working lately. The scheduler should (and probably is) using similar metrics to charge tasks for their CPU usage. This could be combined.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 5:05 UTC (Thu) by amit.kucheria (subscriber, #59246) [Link]

Similar experiments have been done to use data from ARM performance counters in decisions to determine CPU state e.g. scaling frequency. The results are encouraging but we need to figure out generic kernel interfaces to use these systems.

Power-aware scheduling meets a line in the sand

Posted Jun 6, 2013 18:11 UTC (Thu) by rahvin (subscriber, #16953) [Link]

Personally I think the future is device specific power management. How we get to that point or some subset of it is beyond me but there is so much different hardware available these days (how many version of ARM CPUs exist, is there even a count?) that I don't know how you can achieve reliable and efficient power management without going down the device specific route.

Even if the kernel is only responsible for CPU power management there are going to be so many variations of best practices in the future that I don't know that it's manageable without some abstraction that allows variable management for every different device. The massive variation in CPUs and capabilities is only going to get worse in the near term.

Power-aware scheduling meets a line in the sand

Posted Jun 7, 2013 2:25 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

do you really want to have to have a different kernel for every chip release?

how about having to pick the right kernel for running on battery vs being plugged in?

yes different CPUs and systems have different details, but most of these can be abstracted out so that the policy engine can make reasonable decisions based on a description of the system (and the related costs). When new systems come out that can't be described reasonably in the existing structure, the kernel will be adapted so that it covers the new type of system.

Power-aware scheduling meets a line in the sand

Posted Jun 11, 2013 1:48 UTC (Tue) by pr1268 (subscriber, #24648) [Link]

While I can empathize with Ingo's frustration regarding a lack of clear policy, I still think there needs to be a separation of the scheduler code and the cpufreq/cpuidle subsystems based on my (perhaps naïve) understanding of these:

As our editor mentioned, the scheduler has a distinct role, but my guess is that the scheduler is architecture-independent.

Conversely, the cpufreq and cpuidle subsystems are arch-dependent (although it's rare to find an arch these days that doesn't support either or both of these).

Thus, I propose merging the cpufreq and cpuidle systems, because (like our editor said), cpuidle is just like reducing cpufreq to 0 Hz.

The scheduler would be kept separate from the merged cpufreq/cpuidle, but there could still be a coupling: if the process load is light, then the scheduler could dictate the cpufreq system to slow down. If the process load approaches zero, then the scheduler could even tell cpufreq to go to zero (i.e. cpuidle). Conversely, if the system load is high, then the scheduler would be aware of this, and it could tell the cpufreq system to speed up (which, of course, would un-idle the CPU[s]).

Granted, this is merely a very high-level view of a proposed system, but the basic idea is that (1) the arch-independent code is kept separate from the arch-dependent code; (2) the scheduler acts as the single-point of control for power-aware scheduling (and this is where a clear policy can be implemented); and (3) any changes in scheduler state could then drive changing the CPU frequency (which could subsequently control idling the CPU[s]).

Just a thought...

Power-aware scheduling meets a line in the sand

Posted Jun 11, 2013 2:05 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

the scheduler is not architecture independent, it needs to take into account what core share cache (potentially to several levels), which ones share NUMA nodes, etc.

right now it doesn't know which ones share power/speed settings, or have the ability to modify it's bahavior based on this

for a hard problem, consider a modern cpu with the turbo feature, if the system has just over half it's cpu used, it is better off keeping all cores running, or to power half of them off to speed up the other half?

this requires not just watching the overall load, but watching the utilization of individual processes/threads.

if you have a process that is using every bit of time it can get on a CPU, it may gain significantly from switching into turbo mode, if all the other threads can fit on the remaining turbo cores.

on the other hand, if every process is sleeping well before it's timeslice is up, you are probably better off with more cores, spreading the threads across all of them (utilizing more cache so more can remain hot)

If you are going to pass this level of detailed information out of the scheduler, you are paying a significant amount of overhead, if you don't how is your speed controller going to be able to figure out what's best?

Power-aware scheduling meets a line in the sand

Posted Jun 15, 2013 22:11 UTC (Sat) by mm7323 (guest, #87386) [Link]

Why not just make the scheduler pluggable already and accept that on some systems the scheduler is going to be a bit different in order to make best use of hardware features? One scheduler to rule them all is nice in theory, but it will always be a compromise when considering the vast number of platforms Linux runs on these days.

Power-aware scheduling meets a line in the sand

Posted Jun 15, 2013 22:21 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

Because they don't want to end up with the situation that to run one type of application you want one scheduler and to run another type of application you need a different scheduler (and if you want to run both, you are just out of luck)

there's also the issue that making the entire scheduler pluggable adds significant overhead to the fastpath of deciding which process to run next

this may end up resulting in some parts of the scheduler being plugable, but they would not be the fastpath of what process to run next, but the slowpath where the scheduler makes decisions on moving processes from one core to another.

Power-aware scheduling meets a line in the sand

Posted Jun 15, 2013 23:07 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

It's hard enough to find talent to write one scheduler...

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds