|
|
Log in / Subscribe / Register

Toward better CPU idle-time predictions

By Jonathan Corbet
October 29, 2014

Linux Plumbers Conference
The CPU idle ("cpuidle") code has one of those tasks that would be best handled with absolute knowledge of the future: knowing how long the processor will be idle so that the most appropriate sleep state can be chosen. Since that knowledge is hard to come by, the cpuidle code must get by with heuristics. At the 2014 Linux Plumbers Conference (LPC), Daniel Lezcano talked about a scheme he has to improve those heuristics and, in the process, bring about better integration between the scheduler and the cpuidle subsystem.

The current cpuidle code suffers from a number of shortcomings. It is not actually tied to the scheduler, so it has no idea of what the scheduler plans to do next. Even the most sophisticated governors tend to get things wrong, leading to the wrong sleep states being chosen. By focusing on a few relatively predictable parameters, Daniel hopes to come up with an approach to cpuidle that is both simpler and more accurate.

The "menu" cpuidle governor used on many systems looks at the recent past and tries to come up with a good guess as to how long the system will sleep the next time it goes idle. But actual system behavior depends on a wide variety of different events. Some, like timer expirations, are entirely predictable. Others, such as block I/O operations, are reasonably [Daniel Lezcano] predictable, especially if one is watching carefully. Others, including things like keyboard events, are not predictable at all. Including these latter events in the calculation, Daniel said, leads to bad predictions and erratic performance from the cpuidle governor.

Daniel's cpuidle patch set addresses the most predictable wakeup events by having the scheduler pass the next timer expiration time into the cpuidle code. The scheduler also passes in the current latency requirement; cpuidle can use that information to avoid putting the processor into an overly deep sleep. The most unpredictable events, instead, are simply ignored in this version of the cpuidle code.

That leaves the moderately predictable events, primarily block I/O. Daniel's patch set starts by maintaining a simple running average of per-task I/O completion times. All tasks waiting for block I/O on a given CPU are put into a red-black tree; the closest expected completion time is easily obtained from that tree and used to predict the next wakeup time. But the running average is a bit too simple, being overly affected by the occasional operation that takes much longer (or much less time) than expected. So something a bit more complex is called for.

Daniel's response is to divide I/O completion times into buckets; after some investigation, he settled on 200µs as the optimal bucket granularity. Each bucket contains a counter of "hits," being the number of times that an operation has fallen into that bucket's duration. The buckets tracking I/O completion times are stored in a linked list. When a process first starts up, the data structure might look something like this:

[cpuidle bucket data structure]

The "hits" counter is incremented in the appropriate bucket for each I/O operation completion. After a small number of operations, the data structure might look like this:

[cpuidle bucket data structure]

Something interesting happens every time one bucket gains five hits though: it is moved to the beginning of the list. So, if the next operation completes in just over 200µs, the data structure will look like this:

[cpuidle bucket data structure]

The idea is that the buckets that see the most activity will be found toward the beginning of the list. When it comes time to predict when the next operation will complete, Daniel's code iterates through the list, computing a score for each bucket. Essentially, that score is the number of hits in the bucket divided by 1+2p, where p is the bucket's position in the list. So the bucket in the second position must have three times as many hits to get a higher score than the bucket in the first position.

The idea is to try to guess the most likely completion time based on both long-term and recent history. According to Daniel, it works pretty well, yielding far better results than the existing menu governor. Even so, this design did not survive the discussion in the LPC microconference, so any version of this patch set that gets into the mainline kernel is likely to look somewhat different.

The issue was the use of a per-task data structure for I/O completion time tracking. The advantage of this approach is that, when a task moves between CPUs, the tracking information will move with it, so the scheduler on the new CPU has an immediate idea of how long that task's I/O operations will take. But I/O completion times are not really a task-specific parameter; they are, instead, tied to the underlying device. And, as it happens, the kernel is already tracking device performance in the block layer. That information reflects current loads and should be quite accurate; using it might enable the entire bucket data structure to be done without.

So a future version of this patch set will probably be recast along those lines, using the backing store information already maintained by the kernel. But there are other challenges looming for this code. As Peter Zijlstra pointed out, the block layer is increasingly trying to maintain locality in I/O requests, ensuring that the CPU handling an I/O completion is the one that initiated the operation. Better locality makes sense, Peter said, but it also can conflict with the scheduler's attempts at distributing load across a system. It is harder to guess at wakeup times if it's not clear which CPU will wake up to deal with a specific I/O completion. The multiqueue block layer code is going to make this problem worse; some work will have to be done to reconcile these differing approaches to performance.

But, even if it is not a complete solution to the problem, Daniel's patch achieves the goals of better sleep-time predictions and better integration of the cpuidle code with the scheduler. That should be sufficient to get some version of this code into the mainline — someday.

[Your editor would like to thank the Linux Foundation for supporting his travel to this event].

Index entries for this article
KernelPower management
KernelScheduler/and power management
ConferenceLinux Plumbers Conference/2014


to post comments

Toward better CPU idle-time predictions

Posted Oct 30, 2014 15:03 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> Others, including things like keyboard events, are not predictable at all.

Errr... really not? Most humans I know seem to have a relatively constant typing speed and fairly simple start/stop activity patterns. I must miss something.

Toward better CPU idle-time predictions

Posted Oct 30, 2014 21:48 UTC (Thu) by robbe (guest, #16131) [Link]

"relatively constant" in human time scales, but not if you are looking at sub-millisecond timings. Try:
 #!/usr/bin/perl

 use Term::ReadKey;
 use Time::HiRes qw(gettimeofday tv_interval);

 ReadMode 'cbreak';
 my $k = ReadKey 0;
 my $t = [gettimeofday];
 while ($k ne "\n" && $k ne "\r") {
     $k = ReadKey 0;
     my $tt = [gettimeofday];
     print tv_interval($t, $tt),"\n";
     $t = $tt;
 }
 ReadMode 0;
For me standard deviation is about 100 ms -- much too jittery to rely on. Remember that the disk events are put in 0.2 ms buckets.

On the other hand, ignoring keypresses may be just the right thing to do. They happen so infrequently! If something more sophisticated is needed, try to find out if user is typing, then err on the side of caution and expect keypresses every 50ms.

Toward better CPU idle-time predictions

Posted Nov 7, 2014 5:53 UTC (Fri) by Russ.Dill@gmail.com (guest, #52805) [Link] (3 responses)

Keyboard events are a poor example. If you delay the processing of a keyboard event by 50ms, there won't be much of an effect on the system. But for other sporadic events, like IO, or network, delaying the processing can significantly effect the throughput of the system.

Toward better CPU idle-time predictions

Posted Nov 7, 2014 17:09 UTC (Fri) by Darkmere (subscriber, #53695) [Link]

If you delay the processing of a keyboard event with 50 ms you suddenly have a 3 frame latency at 60 fps, and a 6 frame latency at the desired 120 fps that is needed for things like the Occulus rift.

And trust me, you notice it when you get a 3 frame higher input latency than before. It's a very noticeable thing when you're in a game.

Toward better CPU idle-time predictions

Posted Nov 8, 2014 17:00 UTC (Sat) by marcH (subscriber, #57642) [Link] (1 responses)

> If you delay the processing of a keyboard event by 50ms,

Are you suggesting this to save power?

I don't think the article was delaying anything, was it?

Toward better CPU idle-time predictions

Posted Nov 13, 2014 5:32 UTC (Thu) by Funcan (guest, #44209) [Link]

Mis-guessing that there is no keyboard even due, and going into a deep sleep state, causes delays

Toward better CPU idle-time predictions

Posted Oct 31, 2014 11:31 UTC (Fri) by ras (subscriber, #33059) [Link] (2 responses)

Another day, another algorithm for predicting the future based on past events.

Surely we would be better off collecting a whole pile of data and feeding it to a neural network.

Toward better CPU idle-time predictions

Posted Oct 31, 2014 12:25 UTC (Fri) by mstone_ (subscriber, #66309) [Link] (1 responses)

That'll certainly save power. :)

Toward better CPU idle-time predictions

Posted Oct 31, 2014 17:48 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

It might actually do so[1]:

> They found that TrueNorth cut energy use by 176,000-fold compared to a traditional processor and by a factor of over 700 compared to specialized hardware designed to host neural networks.

[1]http://arstechnica.com/science/2014/08/ibm-researchers-ma...


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds