LWN: Comments on "Power-aware scheduling meets a line in the sand"

I want to learn more...

alison — Sun, 14 Jul 2013 00:21:10 +0000

Thanks for the link, SiliconSlick! The work you cite appears to anticipate cgroups: cool. I worked for HP at the time and regret not knowing Scott Rhine, the author.

Power-aware scheduling meets a line in the sand

andresfreund — Sat, 15 Jun 2013 23:07:58 +0000

It's hard enough to find talent to write one scheduler...

Power-aware scheduling meets a line in the sand

dlang — Sat, 15 Jun 2013 22:21:18 +0000

Because they don't want to end up with the situation that to run one type of application you want one scheduler and to run another type of application you need a different scheduler (and if you want to run both, you are just out of luck)

there's also the issue that making the entire scheduler pluggable adds significant overhead to the fastpath of deciding which process to run next

this may end up resulting in some parts of the scheduler being plugable, but they would not be the fastpath of what process to run next, but the slowpath where the scheduler makes decisions on moving processes from one core to another.

Power-aware scheduling meets a line in the sand

mm7323 — Sat, 15 Jun 2013 22:11:36 +0000

Why not just make the scheduler pluggable already and accept that on some systems the scheduler is going to be a bit different in order to make best use of hardware features? One scheduler to rule them all is nice in theory, but it will always be a compromise when considering the vast number of platforms Linux runs on these days.

Power-aware scheduling meets a line in the sand

dlang — Tue, 11 Jun 2013 02:05:28 +0000

the scheduler is not architecture independent, it needs to take into account what core share cache (potentially to several levels), which ones share NUMA nodes, etc.

right now it doesn't know which ones share power/speed settings, or have the ability to modify it's bahavior based on this

for a hard problem, consider a modern cpu with the turbo feature, if the system has just over half it's cpu used, it is better off keeping all cores running, or to power half of them off to speed up the other half?

this requires not just watching the overall load, but watching the utilization of individual processes/threads.

if you have a process that is using every bit of time it can get on a CPU, it may gain significantly from switching into turbo mode, if all the other threads can fit on the remaining turbo cores.

on the other hand, if every process is sleeping well before it's timeslice is up, you are probably better off with more cores, spreading the threads across all of them (utilizing more cache so more can remain hot)

If you are going to pass this level of detailed information out of the scheduler, you are paying a significant amount of overhead, if you don't how is your speed controller going to be able to figure out what's best?

Power-aware scheduling meets a line in the sand

pr1268 — Tue, 11 Jun 2013 01:48:57 +0000

While I can empathize with Ingo's frustration regarding a lack of clear policy, I still think there needs to be a separation of the scheduler code and the cpufreq/cpuidle subsystems based on my (perhaps naïve) understanding of these:

As our editor mentioned, the scheduler has a distinct role, but my guess is that the scheduler is architecture-independent.

Conversely, the cpufreq and cpuidle subsystems are arch-dependent (although it's rare to find an arch these days that doesn't support either or both of these).

Thus, I propose merging the cpufreq and cpuidle systems, because (like our editor said), cpuidle is just like reducing cpufreq to 0 Hz.

The scheduler would be kept separate from the merged cpufreq/cpuidle, but there could still be a coupling: if the process load is light, then the scheduler could dictate the cpufreq system to slow down. If the process load approaches zero, then the scheduler could even tell cpufreq to go to zero (i.e. cpuidle). Conversely, if the system load is high, then the scheduler would be aware of this, and it could tell the cpufreq system to speed up (which, of course, would un-idle the CPU[s]).

Granted, this is merely a very high-level view of a proposed system, but the basic idea is that (1) the arch-independent code is kept separate from the arch-dependent code; (2) the scheduler acts as the single-point of control for power-aware scheduling (and this is where a clear policy can be implemented); and (3) any changes in scheduler state could then drive changing the CPU frequency (which could subsequently control idling the CPU[s]).

Just a thought...

Power-aware scheduling meets a line in the sand

dlang — Fri, 07 Jun 2013 02:25:55 +0000

do you really want to have to have a different kernel for every chip release?

how about having to pick the right kernel for running on battery vs being plugged in?

yes different CPUs and systems have different details, but most of these can be abstracted out so that the policy engine can make reasonable decisions based on a description of the system (and the related costs). When new systems come out that can't be described reasonably in the existing structure, the kernel will be adapted so that it covers the new type of system.

Power-aware scheduling meets a line in the sand

naptastic — Thu, 06 Jun 2013 21:42:43 +0000

> There are are plenty of us who disable all power savings. Having power saving enabled can easily mean a 30-40% performance hit which is unacceptable for some workloads.

Or a 5ms latency increase at unpredictable times, which ruins my recording and playback.

Power-aware scheduling meets a line in the sand

sbohrer — Thu, 06 Jun 2013 18:11:51 +0000

> I don't think anyone really falls on either side of the extreme.

There are are plenty of us who disable all power savings. Having power saving enabled can easily mean a 30-40% performance hit which is unacceptable for some workloads.

Power-aware scheduling meets a line in the sand

rahvin — Thu, 06 Jun 2013 18:11:37 +0000

Personally I think the future is device specific power management. How we get to that point or some subset of it is beyond me but there is so much different hardware available these days (how many version of ARM CPUs exist, is there even a count?) that I don't know how you can achieve reliable and efficient power management without going down the device specific route.

Even if the kernel is only responsible for CPU power management there are going to be so many variations of best practices in the future that I don't know that it's manageable without some abstraction that allows variable management for every different device. The massive variation in CPUs and capabilities is only going to get worse in the near term.

Power-aware scheduling meets a line in the sand

dlang — Thu, 06 Jun 2013 07:29:42 +0000

a comment I posted elsewhere is also relevant to this discussion. I'll post a link rather than reposting the longer comment

in https://lwn.net/Articles/553086/ I give an example (disk I/O) where performance can involve 'wasting' effort by doing things that you may be able to avoid if you just wait.

Power-aware scheduling meets a line in the sand

dlang — Thu, 06 Jun 2013 07:05:03 +0000

and the point is that there is no one correct way of doing things. you can't maximize both performance and minimize power use.

You make a trade-off between them.

And since different people will want different points in the trade-off, this means that there needs to be a way of setting the different policies, and then different ways make decisions based on what the policy is

not to mention that the trade-offs are going to be very different on different hardware, and you have zero real clue about what the hardware will want in 2-3 years.

Power-aware scheduling meets a line in the sand

amit.kucheria — Thu, 06 Jun 2013 05:05:30 +0000

Similar experiments have been done to use data from ARM performance counters in decisions to determine CPU state e.g. scaling frequency. The results are encouraging but we need to figure out generic kernel interfaces to use these systems.

Power-aware scheduling meets a line in the sand

naptastic — Thu, 06 Jun 2013 02:38:00 +0000

If I were the pointy-haired boss:

1. Tune settings on my device to err on the side of performance, rather than power (battery life)

2. With every update, move just a little more towards battery life, and advertise that you've increased the battery life.

3. When more people are moving away from your device because it's slow than were moving away from it because it had poor battery life, you've reached your balance.

4. Come out with a new product with newer hardware, start the process over again. Maybe do it the other direction just for kicks.

Power-aware scheduling meets a line in the sand

aliguori — Thu, 06 Jun 2013 02:22:49 +0000

> if you can get 1% better performance by spreading your work across 4 cores instead of 2 (and powering down the other two), which wins? performance or power?

If all you care about is performance, never do any power management. If all you care about is power savings, just shut the machine work and take the infinite performance hit.

I don't think anyone really falls on either side of the extreme.

Power-aware scheduling meets a line in the sand

dlang — Thu, 06 Jun 2013 00:42:03 +0000

> I want low power consumption and I want maximum performance. It shouldn't have to be a choice.

sorry, it doesn't work that way in the real world

if you can get 1% better performance by spreading your work across 4 cores instead of 2 (and powering down the other two), which wins? performance or power?

If tasks were batched, you could do all your work and then go to sleep. But if you don't know the future work that you will be asked to do, should you go to low-power mode, even if it means that you will be slower to do work when it arrives?

It takes time to wake up from deep sleep states, sometimes you would rather be working than waiting to wake up :-)

Power-aware scheduling meets a line in the sand

luto — Wed, 05 Jun 2013 23:59:47 +0000

For example, the new intel-idle driver is using fancy MSRs to estimate how hard the CPU has been working lately. The scheduler should (and probably is) using similar metrics to charge tasks for their CPU usage. This could be combined.

Power-aware scheduling meets a line in the sand

aliguori — Wed, 05 Jun 2013 23:21:10 +0000

That's just punting the problem.

I want low power consumption and I want maximum performance. It shouldn't have to be a choice.

In fact, the two are very much related. You want maximum work done while you're not sleep so you can extend the time that you sleep for.

And since most modern systems have NUMA characteristics, to get maximum performance you really need to be NUMA aware. It's all deeply inter-related.

I want to learn more...

naptastic — Wed, 05 Jun 2013 22:55:44 +0000

Oh, also in 2004, Con Kolivas posted the beginnings of a system to make the scheduler pluggable:

https://groups.google.com/forum/?fromgroups#!msg/fa.linux...

Based on my limited understanding of the issues involved, I don't think the approach Con took in this patch set is going a direction compatible with what Ingo is now suggesting. I could be wrong, of course.

I want to learn more...

naptastic — Wed, 05 Jun 2013 22:45:04 +0000

It's a not-quite perennial topic that has, historically, been shot down every time it's come up. However, coming up with a single scheduler / power configuration system that performs well on all systems looks less and less soluble with each passing year. For example, this conversation from 2004:

http://lwn.net/Articles/109458/

NUMA and hyperthreading were much newer and less understood, and things like bigLITTLE weren't even on the horizon. The objection there was that the effort of maintaining a separate scheduler per situation would outweigh the benefits, and result in many crappy schedulers instead of one good one; I believe the balance has shifted the other way now.

The Brainfuck Scheduler came out in 2009, and pluggable schedulers came up in conversation again. I can't find the reference but I think it was Linus that time that shot them down. He also shot down priority inheritance many times before finally agreeing to it, so things like this can change.

We have NUMA, HT, bigLITTLE, and AMD's somewhat odd two-cores-one-FPU setup now. Who knows what other permutations chip manufacturers will bring us in coming years. (Hopefully many!) The kernel scheduler has improved leaps and bounds, but there is still a meaningful gap in performance between mainline and (in my case) BFS, for specific hardware. BFS is marginally worse on my quad-socket Opteron system, and better enough for my single-socket systems that I use it exclusively. Imagine the performance deltas possible with a custom bigLITTLE scheduler.

I believe Ingo is right in his assertion that we need to look at scheduler and power management framework in the same context. I do not think that having one scheduler and power management framework to rule them all is realistic if we want to get the most from our hardware.

Of course, this is largely conjecture and meaningful benchmarks may not be available for a while yet.

I want to learn more...

SiliconSlick — Wed, 05 Jun 2013 22:39:34 +0000

Long ago, HP was working on something like this. Here's a link via the Way Back Machine:

http://web.archive.org/web/20031210095042/http://resource...

(see the white paper at the bottom)

Not sure whatever became of it.

Power-aware scheduling meets a line in the sand

dlang — Wed, 05 Jun 2013 22:39:04 +0000

I don't think that anyone is saying that all power management belongs there.

But having multiple systems trying to control how much power is used by the CPUs just does not make sense.

There are items that the scheduler MUST to (like deciding to concentrate or distribute tasks)

And then there are a things that the scheduler is in the best position to guess (predicting future loads for decisions on CPU speed settings)

having one process (cpufreq or cpuidle) try to guess what the scheduler is going to do just doesn't work well.

Power-aware scheduling meets a line in the sand

rahvin — Wed, 05 Jun 2013 22:27:52 +0000

I read his message exactly as Jon read it. He wants everything power related in the scheduler so it's all in one place. I think you are making it a stretch to argue he meant hooking the scheduler and putting the code elsewhere.

Personally I don't get how he can insist it's all in the one place. I get it from an organization point of view but power management can be touching all sorts of things that are completely unrelated. If he wants to go down that road I would agree the solution is to hook the relevant subsystems and create a new power management system that manages all the diverse areas where power management is a concern, but then you go even further from CPU and scheduler and tuck in disk, screen and other management so you have centralized power control. (Maybe the only real solution to total power management on devices like tablets)

Such a system would probably decentralize things to an extent in that CPU frequency control would end up outside the CPU code and in the power management code. That sounds like it would greatly complicate the kernel and understanding how it works and would be contrary to the intent.

I don't think there is a clean solution to this problem, as I said, power management is a device wide issue and touches almost every subsystem. I think the modest improvements path is probably the most viable.

Great article!

I want to learn more...

pr1268 — Wed, 05 Jun 2013 22:06:05 +0000

I'm terribly unenlightened about these—can you provide pointers to their design overview, theory of operation, how a pluggable scheduler would work in the context of the Linux Kernel, etc.? Thanks!

Signed,

Just a little curious

Power-aware scheduling meets a line in the sand

dlang — Wed, 05 Jun 2013 21:02:29 +0000

I read his message as being less about redesigning the core scheduler as it is figuring out how to define the appropriate hooks in it so that it can make more decisions.

That could be a redesign of everything, but it doesn't need to be. And it especially should not be a redesign from scratch to get started.

Morton's response listed a lot of the concerns, but I think there is a fairly clear path to get started.

Since the current scheduling domains datastructure does not include the power related information, start by defining a structure that does.

Then people can start talking about how to hook the different policies into the scheduler and have it switch between them (start with the simple 'put things on as few CPUs as possible to save power' vs 'spread across as many cores as possible for performance')

meanwhile the mechanism to go into sleep modes can be implemented, and then in the idle slow path, you can insert logic to try and decide if you should go into a sleep state, if so which one, etc.

Power-aware scheduling meets a line in the sand

naptastic — Wed, 05 Jun 2013 20:40:23 +0000

*cough* pluggable schedulers *cough*

Value

ncm — Wed, 05 Jun 2013 20:16:47 +0000

Once again LWN reminds us why we keep up our subscriptions. Are you really chipping in your share?