User: Password:
|
|
Subscribe / Log in / New account

Idle cycle injection

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jonathan Corbet
April 14, 2010
When Google's Mike Waychison addressed the 2009 Kernel Summit, one of the goals he laid out was the merging of Google's idle cycle injection code into the mainline. Idle cycle injection is the forced idling of the CPU to avoid overheating; essentially, it is Google's way of running processors to the very edge of their capability without going past that edge and allowing the smoke to escape. This sort of power management is certainly not a Google-specific problem, so it makes sense to get the code upstream. Salman Qazi's recently posted kidled patch series shows the current form of this work.

The core idea is simple: through some new control files under /proc/sys/kernel/kidled, the system administrator can set, on a per-CPU basis, the percentage of time that the CPU should be idle and an interval over which that percentage is calculated. If the end of an interval draws near and the CPU has not been naturally idle for the requisite time, kidled will force the processor to go idle for a while.

Naturally enough, there are some complications. The first is that it would be nice to avoid forcing idle cycles when important processes are running. So kidled includes the notion of "eager cycle injection." By way of the control group mechanism, processes can be marked as being "interactive." When so-marked processes are not running, kidled will try to get its forced idle cycles in early. When interactive processes are running, instead, idle cycles will be forced only when strictly necessary. In this way, "interactive" processes will not be impeded by idle cycle injection except when there is no alternative.

The other twist has to do with the accounting of idle CPU time. The injection of idle cycles takes CPU time away from somebody; the kidled code allows the administrator to say who the victims should be. There is another control group parameter which controls the "power capping priority" of each process. When idle cycles are injected, kidled will mess around in the scheduler's data structures, causing processes with lower priorities to be charged for the idle time. That means that, when CPU usage must be throttled, specific processes can be made to suffer more than others.

As of this writing, there has been little public discussion of the patches. The core concept is not controversial, but it will be interesting to see how the scheduler-related parts of the series are received.


(Log in to post comments)

Who pays

Posted Apr 15, 2010 2:48 UTC (Thu) by ncm (subscriber, #165) [Link]

Processes that drive the CPU over the temperature limit are consuming a very expensive resource. It seems to me those are the ones that should be charged extra for the idling needed. Low-priority processes get to run infrequently so probably contribute little to overheating. Instead of charging idling time to another process, the time spent running at high temperature should be charged at a premium to the process that's running. E.g, crudely: within two degrees of the limit, CPU time costs twice as much.

The rub is that absolute time is a poor measure of how much a process is heating the CPU. If CPUs had a counter that revealed how much time it spent stalled waiting on cache fills, the kernel could get a better measure. Maybe they do?

I wonder if programs could be encouraged to use less processing-intensive algorithms when temperatures get high -- e.g. degrading frame rate. It seems tricky to communicate that in a useful way.

Who pays

Posted Apr 16, 2010 19:37 UTC (Fri) by intgr (subscriber, #39733) [Link]

> If CPUs had a counter that revealed how much time it spent stalled waiting
> on cache fills, the kernel could get a better measure. Maybe they do?

They're called performance counters. Not sure how accurately they let you model CPU power usage, but they can count the number of cache misses for all cache levels, among other things.

Who pays

Posted Apr 16, 2010 23:25 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I don't understand why the priorities for allocation of CPU time should be different when the CPU is hot than when it isn't. So why isn't the existing CPU scheduler enough? As far as I can tell, the only issue is that that the priority of kidled has to vary with temperature so that it never runs when the CPU is cool, runs in preference to low-priority processes when the temperature is near the limit, and runs in preference to everybody when the temperature is at the limit. I don't see why a whole separate system for charging the idle time to particular processes is necessary.

I like the idea of making processes pay more for CPU cycles when the CPU is hot and CPU cycles are scarce, but it makes a practical difference only if there's an economy in which processes can substitute other things for CPU cycles, and I've never seen such a thing in the Linux world. CPU cycles themselves are the only currency I've ever seen used at a layer below human administration.

Incidentally, the article seems somehow to have completely mischaracterized the Google proposal. It isn't about CPUs overheating, which they won't do even with zero idle cycles inserted by Linux. It's about tripping circuit breakers, which they might do because Google wants to save money on electrical power infrastructure and not have enough power available to feed every server in the datacenter at 100% at the same time.

But that doesn't make any difference to the CPU cycle economy where talking about.

Who pays

Posted Apr 24, 2010 21:26 UTC (Sat) by efexis (guest, #26355) [Link]

"Low-priority processes get to run infrequently so probably contribute little to overheating"

Yes but they're less important, as expressed by their lower priority. If you don't want a process to be charged as much for the cycles, because it's more important that it gets those cycles, the answer is to reflect its greater importance and need for cycles by increasing it's priority. That is, after all, what priority means.

Idle cycle injection

Posted Apr 15, 2010 6:07 UTC (Thu) by aleXXX (subscriber, #2742) [Link]

"the system administrator can set, on a per-CPU basis, the percentage of time that the CPU should be idle and an interval over which that percentage is calculated."

This sounds like the interval and budget of the coming EDF scheduler (is it called deadline now ?).
Over which magnitude of intervals and idle time are we talking ? 10 cycles, 100 microseconds ?
Maybe it could be just a user space daemon which does nops and has set its EDF parameters accordingly ?

Alex

Idle cycle injection

Posted Apr 15, 2010 12:02 UTC (Thu) by fperrin (guest, #61941) [Link]

What is the difference between, on the one hand, running a CPU at full speed (and maybe overclock it? is it what is implied by "running processors to the very edge of their capability"?), and then stop it 15% of the time, and on the other hand, running a CPU at 85%?

I see that there is some notion of "CPU burst", with the eager cycle injection, but it seems a bit over-engineered. Wouldn't a CPU governor (on_demand plus a policy like "you can't run at 100% for more than 5s straight") be easier, and less intrusive?

Well, the difference is simple...

Posted Apr 15, 2010 19:29 UTC (Thu) by khim (subscriber, #9252) [Link]

May be it's good idea to read the original proposal before commenting?

What is the difference between, on the one hand, running a CPU at full speed (and maybe overclock it? is it what is implied by "running processors to the very edge of their capability"?), and then stop it 15% of the time, and on the other hand, running a CPU at 85%?

Well, the difference is simple: in first case you datacenter goes down in fire sooner or later, in the second case it serves user's requests as it should...

Well, the difference is simple...

Posted Apr 15, 2010 20:25 UTC (Thu) by PhracturedBlue (subscriber, #4193) [Link]

Thanks, that article makes the requirement a lot clearer.
I would have expected that the best way to get power efficiency is to couple the cpu temperature with voltage/frequency (as in most recent X86 chips these days) thus running the part right to its threshold but being able to smoothly optimize to prevent hardware failure.

But that doesn't work on systems that don't have the relevant power management, nor does it allow macro optimization (in the case of the linked article keeping total data-center power usage under the limit)

p4-clockmod

Posted Apr 15, 2010 13:38 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link]

Sounds like a re-engineered p4-clockmod driver.

Idle cycle injection

Posted Apr 15, 2010 19:34 UTC (Thu) by bradfitz (subscriber, #4378) [Link]


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds