Making power policy just work

By Jonathan Corbet
June 30, 2008

The sched_mc_power_savings parameter (cleverly hidden under /sys/devices/system/cpu) was introduced in the 2.6.18 kernel. If this parameter is set to one (the default is zero), it changes the scheduler load balancing code in an interesting way: it makes an ongoing effort to gather together processes on the smallest number of CPUs. If the system is not heavily loaded, this policy will result in some processors being entirely idle; those processors can then be put into a deep sleep and left there for some time. And that, of course, results in lower power consumption, which is a good thing.

Vaidyanathan Srinivasan recently noted that, while this policy works well in a number of situations, there are others where things could be better. The sched_mc_power_savings policy is relatively conservative in how it loads processes onto CPUs, taking care to not overload those CPUs and create excessive latency for applications. As a result, the workload on a large system can still end up spread out more widely than might be optimal, especially if the workload is bursty. In response, Vaidyanathan suggests making the power savings policy more flexible, with the system administrator being able to select a combination of power savings and latency which works well for the workload. On systems where power savings matters a lot, a more aggressive mode (which would pack processes more tightly into CPUs) could be chosen.

This suggestion was controversial. Nobody disputes the idea that smarter power savings policy would be a good idea. But there is resistance to the idea of creating more tuning knobs to control this policy; instead, it is felt, the kernel should work out the optimal policy on its own. As Andi Kleen puts it:

Tunables are basically "we give up, let's push the problem to the user" which is not nice. I suspect a lot of users won't even know if their workloads are bursty or not. Or they might have workloads which are both bursty and not bursty.

There are a couple of answers to that objection. One is that the system cannot know, on its own, what priorities the users and/or administrators have. Those priorities could even change over time, with performance being emphasized during peak times and low power usage otherwise. Additionally, not all users see "performance" the same way; some want responsiveness and low latency, while others place a higher priority on throughput. If the system cannot simultaneously optimize all of those parameters, it will need guidance from somewhere to choose the best policy.

And that's where the other answer comes in: that guidance could come from user space. Special-purpose software running on large installations can monitor the performance of important applications and adjust resources (and policies) to get the desired results. Or, in a somewhat different vision, individual applications could register their performance needs and expected behavior. In this case, the kernel is charged with somehow mediating between applications with different expectations and coming up with a reasonable set of policies.

In the middle of all this, it was pointed out that a mechanism by which expectations can be communicated to the kernel already exists: the nice level (priority) associated with each process. In a simple view of the world, a process's nice level would tell the kernel how to manage it with regard to power savings; on a system with a number of niced processes, those processes would be gathered onto a subset of processors during period of relatively low activity. In essence, this policy says that it is not worthwhile to power up more processors just to give better throughput to low-priority processes.

It does not take long, though, to come up with situations where the use of nice levels leads to the wrong sort of results. Peter Zijlstra observed that he has niced processes (created with distcc) which should have access to all of the CPU power available, but which should not contend with interactive processes on the same system. In such cases, those processes should have a high nice value with regard to CPU usage, but that should not interfere with their ability to move onto idle CPUs, if any exist. So the answer may take the form of a separate "powernice" command which would regulate a process's priority when it comes to causing the system to draw more power.

Nice levels may (or may not) prove to be sufficient information to let the system choose an optimal power policy. But it will be some time before anybody really knows that; work on optimizing power usage - especially on server systems - is not in an advanced state. So pressure to add tuning knobs for power policies may continue, for one simple reason: people want ways of experimenting with different policies and seeing what the results are. Until we really know what the effects of different policies are - on both power usage and system performance - it will be hard to build a system which can choose an optimal policy on its own.

Index entries for this article
Kernel	Power management

Making power policy just work

Posted Jul 3, 2008 8:24 UTC (Thu) by rvfh (guest, #31018) [Link] (2 responses)

> Peter Zijlstra observed that he has niced processes (created with distcc) which should have
access to all of the CPU power available, but which should not contend with interactive
processes on the same system.

Maybe the interactive task should be niced upwards, rather than distcc downwards. Looks like
what he is really trying to achieve: a reactive desktop even with distcc running at full
blast.

Making power policy just work

Posted Jul 3, 2008 12:54 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

I think that 'priority' and 'interactivity' are two separate things and it's a mistake to use
a single niceness level for both.

My bash process shouldn't have a high priority if it starts chewing CPU time for long periods,
but it needs to respond quickly to user input and then go back to sleep.

Making power policy just work

Posted Jul 3, 2008 15:28 UTC (Thu) by mattdm (subscriber, #18) [Link]

I had a boss a while ago who liked to lecture on the difference between "urgent" and
"important". Not everything which is important needs attention right away, and there are some
things which need to be responded to quickly but which aren't, in the big picture, as
important. In this case, your bash prompt is urgent, but isn't necessarily as important as a
background application.

Making power policy just work

Posted Jul 3, 2008 12:14 UTC (Thu) by dipankar (subscriber, #7820) [Link]

Power nice levels would be nice to have, but I would dread using them to set a global
system-wide power policy which is how it will be used in most data centers at the moment. I
don't think it would be a good idea to add another user of tasklist_lock and iterate through
every task in the system to set the power nice value.

Making power policy just work

Posted Jul 3, 2008 13:23 UTC (Thu) by davecb (subscriber, #1574) [Link] (1 responses)

Andi Kleen notes that:
| Tunables are basically "we give up, let's push the problem to the user"
| which is not nice. I suspect a lot of users won't even know if their
| workloads are bursty or not. Or they might have workloads which are 
| both bursty and not bursty. 

Tunables are also something for external daemons to change: the famous
example is the IBM z/OS Workload Manager (WLM), which looks at 
application (actually workload) progress and a table of requirements 
and adjusts CPU, IO and memory tunables to speed to slow the workload.

--dave

Making power policy just work

Posted Jul 5, 2008 1:31 UTC (Sat) by IkeTo (subscriber, #2122) [Link]

> | Tunables are basically "we give up, let's push the problem to the user"

> Tunables are also something for external daemons to change

So "user" means "user mode".

Making power policy just work

Posted Jul 3, 2008 18:23 UTC (Thu) by ikm (guest, #493) [Link] (1 responses)

Would the current dual-core notebooks benefit from this, or is this server-only?

Making power policy just work

Posted Jul 7, 2008 9:27 UTC (Mon) by zdzichu (subscriber, #17118) [Link]

If I understand correctly, this is about spreading load on CPU, not cores. So dual core Intel
CPUs won't gain anything from this. But quadcore Intel laptop CPUs would, as they are still
not native QC but two dual-core sticked together.

On the other hand, AMD K10 family CPUs have very power-independent cores, so even dual-core
would benefit. AMD "Griffin" have split power-plane so I suppose this scheduling policy can
help lowering power consumption on it.

Making power policy just work

Posted Jul 17, 2008 20:19 UTC (Thu) by greened (guest, #52956) [Link]

It strikes me that this problem has been studied already.  For many 
applications, the kernel *can* discover usage patterns and adapt to save 
power.  See for example the Vertigo project:

http://www.arm.com/pdfs/IEMSoftwarePaper-OSDI2002.pdf