|
|
Log in / Subscribe / Register

Improving idle behavior in tickless systems

December 28, 2018

This article was contributed by Marta Rybczyńska

Most processors spend a great deal of their time doing nothing, waiting for devices and timer interrupts. In these cases, they can switch to idle modes that shut down parts of their internal circuitry, especially stopping certain clocks. This lowers power consumption significantly and avoids draining device batteries. There are usually a number of idle modes available; the deeper the mode is, the less power the processor needs. The tradeoff is that the cost of switching to and from deeper modes is higher; it takes more time and the content of some caches is also lost. In the Linux kernel, the cpuidle subsystem has the task of predicting which choice will be the most appropriate. Recently, Rafael Wysocki proposed a new governor for systems with tickless operation enabled that is expected to be more accurate than the existing menu governor.

Cpuidle governors

Predicting the time to the next event is not always an easy task; it is done using a heuristic that depends on the system's recent history. This heuristic can produce incorrect results if the system's behavior changes. Devices cause interrupts at (more or less) predictable intervals that depend on the applications that are running; a cpuidle governor can measure these intervals to make predictions for when the next device interrupt will occur. Also relevant is the regular scheduler timer tick; until a few years ago, kernels always had the timer interrupt running at 100 to 1000 times per second. This picture changed with the introduction of the tickless kernel; periods without interrupts can be longer (as the timer tick may be disabled) and, as a result, the processor can possibly enter deeper idle states.

Linux currently provides two cpuidle governors that reside in drivers/cpuidle/governors; they are called "ladder" and "menu". The basic ideas and interfaces of the cpuidle governors were discussed here back in 2010. The ladder governor chooses the shallowest idle mode first and then moves to the next deeper mode if the observed wait time is long enough. It is considered to be the better choice when running a system with regular clock ticks and when power consumption is not an important factor. The disadvantage of the ladder governor is that it may need a long time to reach a deep idle mode. The menu governor is, until now, the preferred choice for tickless systems. It tries to choose the most appropriate idle mode, not necessarily a shallow one. The user can check the governor they are running in /sys/devices/system/cpu/cpuidle/current_governor_ro.

The critique of the menu governor

The menu governor tries to find the deepest idle state that can be entered in the given conditions. It predicts the duration of the next idle period based on past history, then it correlates the observed recent idle durations with the idle states available to choose the idle state that will most likely match with the next idle interval to come. The menu governor applies different corrective factors for the time until the next predicted wakeup, including the system load and the number of tasks waiting for I/O. The corrective factors have, as their goal, limiting the performance impact of entering the idle states.

Wysocki noticed multiple issues that, according to him, make the menu governor not as accurate as it should be. The first observation is related to the creation of the interrupt history pattern. The menu governor uses all interrupts, including timers, to predict when the next one will happen. On the other hand, it already has the information when the next timer tick will happen, but does not correlate the two. As a result, it may happen that the governor predicts a wakeup (that would be a timer) when it should already know that the next timer event will actually happen later.

The second observation is that the governor uses the number of processes waiting for I/O as a corrective factor. The reason for this was the desire to lower the impact of idle modes for highly loaded systems. Entering deeper idle modes on such systems may have a more visible impact on performance, so the correction steers the governor toward the shallower modes. According to Wysocki, the number of processes waiting for I/O has no impact on the idle modes available, and should not be taken into account. Finally, he argued that the pattern detection used by the menu governor sometimes considers values that are too large to matter in practice. Those values could be omitted and the analysis would then use less resources.

Wysocki was considering a rework of the menu governor to address these issues, but that could worsen the performance of workloads that are tuned to work well with the current implementation of the menu governor. Because of that, he chose to implement a new governor, allowing users to benchmark the impact of the two in their actual workloads and make their own choice.

The timer events oriented governor

The new governor is called the "timer events oriented" (TEO) governor. It uses the same strategy as the menu governor — predicting the next idle duration and then selecting the idle mode that fits best — but the factors it takes into consideration are different. The concept behind TEO is that the most frequent source of CPU wakeups on many systems is timer events, not device interrupts. Wysocki notes that timer events might be even two or more orders of magnitude more frequent than other interrupts. So the time until the next timer event alone provides a strong predictive clue.

Another observation is that it is enough to use the recent past to provide accurate estimations of idle periods. In systems where wakeup sources other than timers are more important, this observation does not apply directly. Still, Wysocki argues, the analysis can be based only on a few idle-time intervals. In particular, only intervals that are shorter than the time to the next timer event need to be considered. This is because the longer durations are likely to belong to patterns that can be approximated to the closest timer, anyway.

TEO is designed around the idea that it is likely that the next wakeup will be the expiration of the next timer event; it chooses the deepest idle mode that corresponds to this interval. Then it verifies if this interval also matches the non-timer events, as seen in the pattern of observed idle times from the recent past. If the idle mode selected matches both the timer and non-timer events, it becomes the final choice; otherwise, TEO tries again with a shallower mode.

The algorithm also covers the case when the pattern is changing; there is a special check to determine whether most of the recent idle durations were too short for the idle mode selected. If this is the case, then TEO uses only those values to calculate the new expected idle duration. Then it selects the idle state again, which will result in selecting a shallower one.

Current state and benchmark results

The patch is in its tenth version at the time of this writing. Different developers have started evaluating the code. Giovanni Gherdovich shared benchmark results from the patch; they show a number of cases when the choice of the cpuidle governor has no importance and others where TEO usually offers a slight improvement in performance compared to menu. The detailed results are available separately for different versions of the patch, illustrating the impact on bandwidth and I/O latency. Doug Smythies provided other benchmark results and reported that performance improves and power usage stays the same.

The TEO governor is in an early stage. As the code is subtle, it will still require more work and benchmarking in different systems and architectures, especially with regard to the impact on the power consumption. In addition, Wysocki has also been working on other aspects of the power consumption and idle modes, presenting the work at Kernel Recipes. The early results are encouraging. The goal of the development — better prediction of the next idle mode to use — seems to be reached.

Index entries for this article
KernelPower management/cpuidle
GuestArticlesRybczynska, Marta


to post comments

Improving idle behavior in tickless systems

Posted Dec 28, 2018 18:49 UTC (Fri) by ejr (subscriber, #51652) [Link]

Is there an archive of system traces for different kinds of systems (laptops, servers, industrial/embedded, etc.)? Sounds like a *great* class project to suggest for machine learning using point processes. (Other twisted idea: BPF for modifying the predictor...)

Improving idle behavior in tickless systems

Posted Dec 28, 2018 19:18 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (9 responses)

Input devices don't generate interrupts at predictable intervals because these happen in response to unpredictable, external events. Timing of keyboard and mouse events is considered a high quality source of randomness. The only reason network input isn't is that some 'external adversary' could control the timing of network input events, at least to a degree.

Improving idle behavior in tickless systems

Posted Dec 28, 2018 20:07 UTC (Fri) by Paf (subscriber, #91811) [Link] (7 responses)

The article does not seem to suggest or mention using input devices. It just says “devices”. I took this to mean non-input devices, which I would expect to have predictable-ish interrupt rates.

Improving idle behavior in tickless systems

Posted Dec 28, 2018 20:20 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

The article said "devices". "Input devices" are a subset of "devices", hence, they weren't excluded. And their interrupt timing is not predictable (actually, the very reason for interrupts is that they notify the CPU of unpredictable, external events).

Improving idle behavior in tickless systems

Posted Dec 29, 2018 0:53 UTC (Sat) by Paf (subscriber, #91811) [Link] (5 responses)

I guess I assumed the author and the large group of kernel developers reviewing these patches were also aware of these facts about user input devices. At least among kernel developers, they are common knowledge.

Improving idle behavior in tickless systems

Posted Dec 31, 2018 21:24 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

The article contained the following statement:
Devices cause interrupts at (more or less) predictable intervals that depend on the applications that are running; a cpuidle governor can measure these intervals to make predictions for when the next device interrupt will occur.
And this is simply not true. A NIC will generate interrupts in response to uncontrollable, external events for both sending and receiving. The same is true for any other kind of interrupt-using device not driven by the computer itself. As I already wrote, the very reason interrupts exist is to notify a CPU of some unpredictable change in state which needs to be handled now (or as close to now as possible) regardless of whatever else the CPU happened to be doing at the time of the event.

The code also doesn't even try to "predict" anything. After heuristically selecting some idle state based on the time until the next timer event, it compares the length of the last 8 idle periods against the 'target residency' of the selected state (in a seriously contorted way, BTW) and if more than half of them were smaller, selects a different idle state (simplification).

Improving idle behavior in tickless systems

Posted Dec 31, 2018 21:37 UTC (Mon) by corbet (editor, #1) [Link] (2 responses)

I'm sorry, but I think you're trying a little too hard here.

An individual interrupt is not predictable, but, as with the weather, a prediction that "things will happen in the future much like they have in the recent past" is relatively likely to be true. Interrupts are not quantum events, they are the result of the workload and the environment that the system is in. So there is value in a heuristic that looks at residency times and predicts that something similar will happen next.

Perhaps it's time to give this a rest?

Improving idle behavior in tickless systems

Posted Dec 31, 2018 21:46 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

The code still doesn't and cannot predict device interrupts. That's something the article simply got wrong. Some part of it is based on the assumption that an idle state shouldn't be used if more than 4 of the last 8 measured idle periods were shorter than its so-called target residency. There doesn't seem to be any published rationale for this design and small powers of 2 always raise the suspicion of someone picking a number from a hat.

This may well be a sensible heuristic but a heuristic it remains --- no fortune telling aka "prediction" involved.

Improving idle behavior in tickless systems

Posted Dec 31, 2018 23:48 UTC (Mon) by rahulsundaram (subscriber, #21946) [Link]

> This may well be a sensible heuristic but a heuristic it remains --- no fortune telling aka "prediction" involved.

Prediction != fortune telling. There are multiple meanings for the same word.

https://www.merriam-webster.com/dictionary/prediction

One of them is forecast which is defined as

"to calculate or predict (some future event or condition) usually as a result of study and analysis of available pertinent data"

Improving idle behavior in tickless systems

Posted Jan 2, 2019 7:33 UTC (Wed) by flussence (guest, #85566) [Link]

>And this is simply not true. A NIC will generate interrupts in response to uncontrollable, external events for both sending and receiving. The same is true for any other kind of interrupt-using device not driven by the computer itself.
Input devices are some of the most predictable, actually. A pointer interrupt is statistically almost certain to be followed by another (thanks to Newton's laws of physics), and at regularly spaced intervals (grep MOUSE_DPI /lib/udev/hwdb.d/70-mouse.hwdb). A keyboard key down event is often followed by a corresponding key up a few milliseconds later, or some action on a known timeout.

Strong assumptions can also be made about network interrupts due to the earliest departure time queueing algorithm, wifi packet intervals, characteristics of low-spec consumer ISP hardware, timer quantization everywhere else in the stack. I just left wireshark open for a minute and saw several flows with regularly spaced packets.

Algorithmically generated inputs are really not good sources of entropy, if they were servers wouldn't need hardware RNGs.

Improving idle behavior in tickless systems

Posted Dec 29, 2018 7:38 UTC (Sat) by matthias (subscriber, #94967) [Link]

There are essentially two reasons, why this should not be a big problem for this prediction code:

First, user generated input interrupts are very few. Most people do not generate hundreds of keystrokes per second. And especially the average is really low. Therefore these interrupts hardly make a difference.

Second, while the user is producing input, especially moving the mouse, the interrupts are not that random as suggested. To generate randomness from these interrupts, one uses the least significant bits of the timing, i.e. nano- and microseconds. For power saving, we only need a rough estimate of the length of the idle period. Guessing the first bit that is not zero should be more than enough for this.

For network traffic, this is the same. Either there are few interrupts, or there is an ongoing transmission, which produces interrupts at quite regular intervals (or no interrupts at all, because the kernel switches to polling).

Improving idle behavior in tickless systems

Posted Dec 29, 2018 1:11 UTC (Sat) by bokr (guest, #58369) [Link]

I'm wondering about systems that contain multiple cooperating CPUs and devices (incl GPUs with DMA
capabilities etc) and whose CPUs may vary in power draw (which I'm presuming will weight the total cost
calculation).

Do the prediction algorithms only consider interrupts to a particular core, independently from equivalent
calculations for another core? What of interrupts due to IPC or other software-initiated interrupts?
What about convoying effects between cores doing different ideas of interdependent userland parallelism?

Seems like an overall optimum might not be achieved without a higher level model of activities (e.g, should
file system drivers consider what their scheduling and consolidations might do to cpuidle calculations?).
And should the programmer be able to hint stuff, like known syncings to, say video refresh, or continuous
DMA input circular buffer segment transitions, e.g., for audio samples at a given sample frequency etc?

Can CPU idle state transitions themselves do hardware interrupts accepted by a designated low power core
doing scheduling work? Open invention gift if not ;-)

Improving idle behavior in tickless systems

Posted Jan 1, 2019 21:42 UTC (Tue) by meyert (subscriber, #32097) [Link]

Would it be possible to move this kind of policy into a fork_usermode_blob (elf module)?


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds