User: Password:
|
|
Subscribe / Log in / New account

Teaching the scheduler about power management

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

June 18, 2014

This article was contributed by Nicolas Pitre

Power-efficient CPU scheduling is increasingly important in the mobile world, but it has become just as important in large data-center settings, where the electricity bills can be painful indeed. Unfortunately, the kernel's infrastructure for CPU power management lacks integration with the scheduler itself, with the result that scheduling decisions are not as good as they should be. This article reviews the state of the CPU power-management mechanisms in the kernel and looks at what is being done to improve the situation.

A bit of history

A process scheduler is the core component of an operating system responsible for selecting which process to run next. The scheduler implementation in the Linux kernel has been through a couple of iterations — and even complete rewrites — over the years. The Completely Fair Scheduler (CFS), written by Ingo Molnar, was introduced in the 2.6.23 kernel. It replaced the O(1) scheduler which, in turn, was introduced in version 2.5.2 of the kernel, also by Ingo, replacing the scheduler implementation that existed before that. Despite all of the different algorithms, the general goal is always the same: to try to make the most of available CPU resources.

CPU resources have also evolved during this time. Initially, the scheduler's role was simply to properly manage processor time between all runnable processes. Increasing parallelism in hardware due to the emergence of SMP, SMT (or Hyper-threading), and NUMA added more twists to the problem. And, of course, the scheduler had to scale to an ever-increasing number of processes and processors in the same system without consuming too much CPU time on its own. These changes explain why multiple scheduler implementations have been developed over the last half-century and are still being worked on today. In the process, the scheduler has grown in complexity and only a few individuals have become experts in this area.

Initially, task scheduling was all about throughput with no regard for energy consumption; scheduler work was driven by the enterprise space, where everything was plugged into the wall. At the other end of the spectrum, we saw the emergence of battery-operated devices from the embedded and mobile space, where power management is a primary concern. Separate subsystems dealing with power management, such as cpuidle and cpufreq, were introduced and contributed to by a different set of developers with little scheduler experience. In due course, the power management subsystems grew in complexity as well with its own experts.

This split arrangement worked out reasonably well… at least initially. The isolation between the subsystems allowed for easier development and maintenance. With mobile devices growing in capabilities, as well as ever-increasing data-center electric bills, everyone started caring about energy efficiency. This brought about core kernel changes such as deferrable timers, dyntick, and runtime power management. The rise of multi-core portable devices pushed the need for yet more aggressive power-management tricks such as the controversial use of CPU offlining.

There is a pattern that emerges from these changes: the more complex scheduler and power management become, the more isolated they are from each other. And this turns out to be completely counterproductive, since, as we'll see later, one side can't predict what the other side might do in the near future. Because (or in spite) of that, some chip manufacturers are increasingly implementing DVFS in hardware away from the operating system, which exacerbates the problem. Yet support for ARM's big.LITTLE and the increasing influence scheduler decisions have on power consumption in general have made it clear that merging power management with the scheduler is becoming unavoidable.

Scheduler: meet cpuidle

The cpuidle subsystem tries to minimize power consumption by selecting a low-power mode, or idle mode (often referred as C-State), when the CPU is idle. However, idling a CPU comes with a price: the more power savings such a mode provides, the longer it will take for the affected CPU to become fully operational again. A good balance between the power actually saved and the time "wasted" in entering and exiting a power-saving mode has to be reached. Furthermore, many modes consume some non-negligible amount of power for the CPU simply to transition in and out of them, meaning the CPU has to be idle for a sufficiently long period of time for those modes to be worth entering. Most CPUs have multiple idle modes, providing different trade-offs between achievable power savings and latency.

Therefore, the cpuidle code has to gather statistics on actual CPU usage to select the most appropriate idle mode depending on the observed idleness pattern of the CPU. And this statistics-gathering work duplicates what the scheduler already does, albeit through indirect and somewhat imprecise heuristics.

Idleness patterns are determined by wake-up events bringing the CPU out of idle. Those events can be classified into three categories:

  • Predictable events: This group comprises all timers from which we can obtain their next expiry time and deduce a minimum idle period.

  • Semi-predictable events: These are somewhat repetitive events, like I/O request completions, that often follow a known pattern.

  • Random events: Anything else, such as keystrokes, touchscreen events, network packets, etc.

By directly involving the scheduler in the idle-state selection process, we can do a much better job at considering the semi-predictable events. I/O patterns are mainly a function of those tasks generating them and the device they're directed to. The scheduler can therefore keep track of the average I/O latency on a per-task basis and, possibly, with some input from the I/O scheduler, provide an estimated delay for the next I/O completion to occur according to the list of waiting tasks on a given CPU. And if a task is migrated to a different CPU, its I/O latency statistics are migrated along. The scheduler is therefore in a better position to appreciate the actual idleness of a CPU.

It is therefore necessary for the scheduler and cpuidle to become better integrated, to let the scheduler manage the available idle modes and eventually bypass the current cpuidle governors entirely. Moving the main idle loop into the scheduler code will also allow for better accounting of CPU time spent in the servicing of interrupts and their occurrence rate while idle.

Furthermore, the scheduler should be aware of the current idle mode on each CPU to do a better job at load balancing. For instance, let's consider the function find_idlest_cpu() in kernel/sched/fair.c, which looks for the least-loaded CPU by comparing the weighted CPU load value for each CPU. If multiple CPUs are completely idle, their load would be zero, with no distinction for the idle mode they're in. In this case, it would be highly beneficial to choose the CPU whose current idle mode has the shortest exit latency. If idle exit latency is the same for all idle CPU candidates then the last to have entered idle mode is more likely to have a warmer cache (assuming the relevant idle mode preserves cache, of course). An initial patch series to that effect was posted by Daniel Lezcano.

This also highlighted the fact that some definitions for the same expression may differ depending on one's perspective. A function called find_idlest_cpu() in the scheduler context is simply the converse of find_busiest_cpu(), whereas in the cpuidle context this would mean looking for the CPU with the deepest idle state. The deeper an idle state is, the more costly it is to bring a CPU back to operational state — clearly not what we want here. A similar confusion may occur with the word "power". The traditional meaning in the scheduler code is "compute capacity" while it means "energy consumption rate" in a power management context. Patches to clarify this have recently been merged.

Scheduler: meet cpufreq

The scheduler keeps track of the average amount of work being done by scheduled tasks on each CPU in order to give each task fair access to CPU resources and to decide when to perform load balancing. The ondemand cpufreq governor does similar load tracking in order to dynamically set each CPU's clock frequency to optimize battery life. Since energy consumption is proportional to the square of the voltage, it is desirable to run at the lowest clock frequency, which allows for voltage reduction to the CPU, while still being fast enough to perform all the scheduled work during a given period of time.

As with cpuidle, the cpufreq subsystem was developed in isolation from the CPU scheduler. Many problems result from the split between those subsystems:

  • The cpufreq code goes to great lengths trying to evaluate the actual CPU load through indirect means, including heuristics to avoid misprediction, while, once again, the scheduler has all this information available already.

  • The scheduler can determine the load contribution of individual tasks, whereas the cpufreq code has no such ability. In the occurrence of a task migration, or a task waking up, the scheduler may determine in advance what the load on the target CPU is likely to become. The cpufreq code may only notice an average load increase and react to it after the fact.

  • The scheduler records the execution time for each task in order to ensure fairness between all tasks. However, since the scheduler has no awareness of CPU frequency changes, tasks executing on a CPU whose clock has been slowed down will be unfairly charged more execution time than similar tasks running on another CPU with a faster clock for the same amount of work. Fairness is thus compromised.

  • As the CPU clock frequency is reduced, the resulting apparent increase in task load may trigger load balancing toward a less-loaded CPU in order to spread the load, despite the fact that this increased apparent load was indeed the cpufreq's goal initially.

  • And if that load balancing happens while the target CPU's clock frequency is reduced, then that CPU could end up being oversubscribed. Because there is no coordination between the scheduler and cpufreq, either (or both) of them may react by, respectively, migrating a task back or raising the CPU clock frequency. The CPU may suddenly be underutilized, and the cycle could repeat again.

To fix this, the current plan is to integrate cpufreq more tightly with the scheduler. The various platform-specific, low-level cpufreq drivers will remain unchanged and still register with the cpufreq core as usual; however, the governors — the part that decides what clock frequency to ask for and when — could be bypassed entirely. In fact, the scheduler could simply register itself as a new governor with the cpufreq core.

The advantage of a tighter integration of cpufreq with the scheduler is the ability to be proactive with clock frequency changes rather than reactive, and also to coordinate better with scheduler activities like load balancing. A CPU clock frequency change could be requested in anticipation of a load change; this could happen in response to a call to fork(), exit(), or when a task sleeps or wakes up. The frequency policy could be different depending on the particular scheduler event, task historical behavior patterns, etc.

However, to be able to perform well in the presence of varying CPU clock frequencies, the notion of scale-invariant task load tracking must be added to the scheduler. This is a correction factor to normalize load measurements from CPUs executing code at different speeds so the load contribution of a task can be predicted when the task is moved. The relative computation capacity of each CPU as seen by the scheduler also has to be adjusted according to its effective clock frequency in order to do proper load balancing. It is still unclear how accurate this correction factor can be, considering that tasks making lots of memory accesses are less likely to be influenced by the CPU clock speed compared to tasks performing computational work, etc. Still, anything is going to be better than no correction at all like we have today.

Incidentally, the scale-invariant load tracking does apply to big.LITTLE scheduling as well. Leaving cpufreq aside for a moment, a "little" CPU is permanently slowed down, which translates into a reduced compute capacity to the scheduler, and conversely a "big" CPU has more capacity. With distinct correction factors permanently applied to "big" and "little" CPUs, the scheduler is likely to just work optimally in terms of task throughput, with no further changes to the scheduler. The cpufreq correction factor simply has to be combined with the "big" and "little" factors afterward.

Several developers, including Mike Turquette and Tuukka Tikkanen, are working on the cpufreq integration and initial patches should be posted for public review soon.

Scheduler: may the power be with you

Okay… So we might get to the point where cpuidle and cpufreq are tightly integrated with the scheduler. Are we done? Unlikely. In fact we now have more difficult decisions to make than before and they all relate to the new mechanisms at the scheduler's disposal to perform load balancing. For example:

  • When the system load goes up, should a new CPU be brought out of idle, or should the clock frequency on an already running CPU be increased instead? Or both?

  • Conversely, when the system load goes down, is it best to keep more CPUs alive with a reduced clock frequency or pack more tasks on fewer CPUs in order to send the other CPUs to sleep?

  • Is it best to consolidate loads onto fewer CPUs or to spread the load over more CPUs?

  • When is it time to perform active task packing to let a whole CPU cluster (or package) get into low-power mode?

The latest power-aware scheduler work from Morten Rasmussen provides a framework to evaluate the power cost of the available scheduling scenarios. This, in combination with Vincent Guittot's sched_domain topology and CPU capacity tracking consolidation work, should provide answers to the above questions.

What else?

We desperately need measurement tools to validate proposed solutions. Linaro is working on a tool called idlestat to validate idle-state usage and its effectiveness. Traditional benchmark tools such as sysbench may be combined with energy usage monitoring to provide a way to perform power characterization of a system. Extensions to cyclictest to create various synthetic workloads are being explored as well. This is still unwieldy, though, and more integration and automation are required.

This article hasn't covered thermal management. The Linux kernel implements a thermal-management interface that allows a user-space daemon to control thermal constraints. However, as we've seen, power-related issues are intertwined, and a thermal-control solution that lives separately from the scheduler is likely to be suboptimal. If the scheduler controls power states, it will also have to deal with platform temperature someday, providing thermal "provisioning" or the like. But we can save this for another day.

Thanks

Thanks to Amit Kucheria, Daniel Lezcano, Kevin Hilman, Mike Turquette, and Vincent Guittot for their help in reviewing this article.


(Log in to post comments)

Teaching the scheduler about power management

Posted Jun 20, 2014 7:00 UTC (Fri) by liam (subscriber, #84133) [Link]

This was a tremendously interesting article. Nice job.
I've a few comments, and questions, hopefully none of which are naive:)
Shouldn't predictable events, plus timer slack, indicate a maximum, rather than minimum, idle period?
You mentioned thermal management at the end, but what about platform power management? Say, putting USB to sleep, powering down the pcie bus a bit, etc?
All the questions at the end seem as though they should have fairly easy answers assuming the scheduler is made aware of a few package constants (available c/p states, energy consumed transitioning between c/p states (with the average wakeup period being then inferable), this scale invariant correction factor), along with the data its already gathered for scheduling purpose.
Lastly, given your description of the interplay between the scheduler, cpufreq, and cpuidle, it's astonishing things have worked as well as they have. I'm not saying we should call the Vatican, but is there any other possibility than divine intervention that could explain how cpufreq and the scheduler have managed as well as they have?

Teaching the scheduler about power management

Posted Jun 20, 2014 7:57 UTC (Fri) by amit.kucheria (subscriber, #59246) [Link]

> but what about platform power management? Say, putting USB to sleep, powering down the pcie bus a bit, etc?

The scheduler should only make decisions based on the active subset of available idle states for the CPU. The assumption I'll make here is that peripheral power management is being handled by runtime PM.

If the available idle state of the CPU depends on the state of certain peripherals (e.g. having USB active prevents a CPU OFF state), then IMO the peripheral PM code should active/deactivate the associated C-state. Then the scheduler/cpuidle just won't select it.

This requires modeling the SoC's separate power domains. There is some work happening in that area but there are probably gaps in what the current pm_domain/genpd/runtime PM/pm-qos abstractions allow us to do that still need fixing.

However, it can happen in parallel to the work of making the scheduler a bit more energy-aware.

Teaching the scheduler about power management

Posted Jun 20, 2014 21:59 UTC (Fri) by liam (subscriber, #84133) [Link]

> The scheduler should only make decisions based on the active subset of available idle states for the CPU. The assumption I'll make here is that peripheral power management is being handled by runtime PM.

Well, also p-states, I'd imagine, but it seems like you'd want to centralize power management SOMEWHERE. By centralize I mean that some single component is aware of all peripherals (including CPU) and can make decisions based upon overall system state (not always needed, but certainly useful at times). Such decisions seen best left up to the scheduler. A trivial example would be a lightning/thunderbolt connected peripheral, say, a GPU. The driver handles the power state of the GPU itself, but the pcie connection would be handled by the platform pm, I believe. Now, for a laptop, that GPU should be seen as just another compute source (not handled in the same way as the CPU cores by the scheduler but still as computational units for at least certain kinds of loads). Given some highly vectorized opencl code, the scheduler would seem to be in the best position to determine if it's worth bringing the GPU out of sleep to pound away at the data.
In more general cases, the scheduler would be the only single component in the system capable of making decisions about the periodic and semi-periodic events that would heavily inform power management of peripherals. For one thing, this can cut down on latency if the scheduler can accurately predict peripheral events some of the time. This is obviously a place where the data gathered would heavily inform the heuristics... and it's possible it may not be something that helpful in general. I don't know, but it appears to be the next needed step in order to increase platform efficiency.
It's for the above reasons that it seems, to me, that the scheduler would be in the best position to make decisions about peripheral power states.

> This requires modeling the SoC's separate power domains. There is some work happening in that area but there are probably gaps in what the current pm_domain/genpd/runtime PM/pm-qos abstractions allow us to do that still need fixing.

I'm sure you're right. Even now, I don't think all of the common drivers support pm-qos properly.
This is clearly an area where Linux has some catching up to do. Has anyone looked at how the OTHERS manage things?

> However, it can happen in parallel to the work of making the scheduler a bit more energy-aware.

Well, yeah, but only if it turns out that it doesn't make sense for the scheduler to be involved. I think I've provided a few cases where it having the scheduler involved in platform level decisions can yield benefits... but o might be completely off-base.

Teaching the scheduler about power management

Posted Jun 20, 2014 8:40 UTC (Fri) by intgr (subscriber, #39733) [Link]

> it's astonishing things have worked as well as they have. I'm not saying we should call the Vatican, but is there any other possibility than divine intervention that could explain how cpufreq and the scheduler have managed as well as they have?

Thanks to heuristics accumulated over the years. And it's not working so well, there still are common cases where cpufreq fails to ramp up frequencies sufficiently with CPU-bound workloads, like configure scripts or two processes synchronously communicating via RPC.

Teaching the scheduler about power management

Posted Jun 23, 2014 21:45 UTC (Mon) by linusw (subscriber, #40300) [Link]

For completion I guess I would complement the article with the concept of deep sleep, (the lowest C-state in ACPI lingo). This happens when the scheduler calls arch_cpu_idle() meaning "I have nothing whatsoever to do". So actually the scheduler *is* already doing some kind of power management: the final sledgehammer type.

At this point, if the hardware supports it, everything is powered off in the system, often including all the CPU cores, and the system goes into a state where it is totally dependent on some special always-on hardware to detect an external event (such as a wake-up signal from a keypad or network card) to occur until Cinderella wakes up. In embedded system these events are very often actually controlled into the pad ring around the chip edge (which we model and set up with the pin control mechanism). Sometimes the chip has a small low-power microcontroller for the sole purpose of waking up the system.

Systems that do not have the hardware to actually power off CPU cores usually have its cores go to a state where the CPU is powered but similarly catatonic, waiting for an IRQ of some sort, a state called "Wait For Interrupt" or WFI.

Another thing that occurred to me was inputs that sometimes needs to come into the cpu frequency or idleness selection related to the state of the entire electronic or SoC system, and unknown even to the scheduler. A typical example would involve a system with heavy use of DMA or GPUs or other accelerators, where these system components may require that the CPU be in a very responsive mode due to the fact that these hardware engines need quick responses to their IRQs etc, or that the system in some other way need to take power decisions (such as related to thermal) depending on the load on such CPU-external datapaths.

Teaching the scheduler about power management

Posted Jun 24, 2014 6:20 UTC (Tue) by amit.kucheria (subscriber, #59246) [Link]

> This happens when the scheduler calls arch_cpu_idle() meaning "I have nothing whatsoever to do". So actually the scheduler *is* already doing some kind of power management: the final sledgehammer type.

Right. The closer cpuidle-scheduler integration will encourage this behaviour - if we get the right task spread across the CPUs, we should be able to go to these deep sleep states more often and stay there longer without impacing performance.

> A typical example would involve a system with heavy use of DMA or GPUs or other accelerators, where these system components may require that the CPU be in a very responsive mode due to the fact that these hardware engines need quick responses to their IRQs etc,

I believe this is the bit liam was alluding to earlier - whole-system power/performance management. A combination of runtime PM-enabled drivers, correct PM domain modeling of the SoC and smart use of pm-qos constaints *should* allow us to do this but there aren't many good examples of how to do this today. Definitely something to fix.

Teaching the scheduler about power management

Posted Jun 26, 2014 19:20 UTC (Thu) by facorread (guest, #59578) [Link]

Thanks for posting this excellent article.

Teaching the scheduler about power management

Posted Jun 27, 2014 16:21 UTC (Fri) by richdawe (subscriber, #33805) [Link]

Thank you, this article is excellent.

Teaching the scheduler about power management

Posted Jul 1, 2014 14:53 UTC (Tue) by nye (guest, #51576) [Link]

>As the CPU clock frequency is reduced, the resulting apparent increase in task load may trigger load balancing toward a less-loaded CPU in order to spread the load

Is this something that only happens if the load on that CPU is >=1, or will it always try to keep CPUs balanced even when they're all below capacity? If the latter, what's the reasoning behind that?

I've noticed, on both Linux and Windows, that a task consuming 100% CPU will be shuffled around frequently enough that (on a quad core CPU) each core permanently appears to be 25% loaded.

This seems suboptimal, and I've done some informal benchmarking on Windows with the apparent result that manually setting the processor affinity of such processes to two cores increases overall throughput by several percent, by allowing Turbo Boost to do its thing. I say 'apparent' result, because I've not done a large sampling, though I did do enough runs that I'm reasonably happy with the result.

(Specifically, I was trying to reduce my Civ 5 turn times. Two cores work better than one there because it is in fact multithreaded; it's just that one thread represents at least 90% of the total load. So I'd guess that on a purely single-threaded workload it would probably be even better to force it on to one core.)


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds