Toward a more power-efficient scheduler

By Jonathan Corbet
April 10, 2013

When the system is idle, Linux is quite good at putting the processors into a deep sleep state and keeping them there; when we do nothing, we do it as efficiently as anybody. On a heavily loaded system, about the only thing to be done to save power is to get the work done as quickly as possible; again, Linux does well in this situation. In cases where the system is lightly loaded, though, the scheduler does not always make the right decisions; that situation can only get worse as the hardware gets more complicated. A few developers are working to improve the power efficiency of the kernel's scheduler, but it will probably be a while before this work makes it into the mainline.

Small-task packing

Vincent Guittot's "packing small tasks" patch set was examined here in October 2012. Development continues with this patch set; version 3 was posted in late March. The idea behind small-task packing is to sweep processes that only run occasionally "to the left" of each scheduling domain, and, eventually, toward CPU 0. Collecting these tasks onto a relatively small number of processors allows them to be run more efficiently while keeping them from pulling other CPUs out of the idle state. As long as the destination CPUs are not overloaded with small tasks, the result should be better power utilization and no real change in throughput.

The patch set is now able to to take advantage of the per-entity load tracking feature, which was merged for the 3.8 kernel. Better load tracking allows the easy identification of the small tasks in the system, and a more precise characterization of just how small they are. So Vincent's patch is now able to pick only those tasks which take no more than 20% of the available CPU time; additionally, statistics are kept so that only tasks which run for a maximum of 10ms at a time are eligible for packing. These heuristics attempt to ensure that the scheduler will not inadvertently pack tasks that will end up overloading the target CPU.

The actual packing happens when a small task wakes up, since such tasks are likely to be sleeping when the scheduler's periodic load balancing happens. Only one possible destination CPU is checked — the "buddy" assigned to each CPU at system initialization time. If the buddy is lightly loaded, the small task will be moved in that direction. Since the buddy relationship is one-way, tasks will tend to migrate toward lower-numbered CPUs. As a result, packing may happen slowly, but, since only one CPU needs to be checked at wakeup time, the overhead of the actual packing decision is low.

Power-aware scheduling

Alex Shi's power-aware scheduling patch set (last looked at in August 2012) takes a more comprehensive approach to the same problem. As with Vincent's patch set, one of the goals is to collect small tasks onto a relatively small set of CPUs. Once again, small tasks are migrated toward common CPUs at wakeup (or fork()) time. In this case, though, there are no "buddy" CPUs; instead, the migration code will examine all CPUs on the system, looking for the busiest CPU that still has some free time available.

This patch set also works at the load-balancing level. When load balancing occurs, the code will look for the non-idle CPUs with the lowest load, on the assumption that those CPUs could be made to go idle if the load were moved elsewhere. If a sufficiently lightly loaded CPU can be found, its processes will be pushed in the direction of more heavily loaded CPUs (though always to those that have some spare capacity available). As with Vincent's patch, some new statistics must be kept to help inform the process migration decisions. One significant difference is that Vincent's buddy scheme explicitly tries to move processes to CPUs driven by a different power line (so that the vacated CPUs can be powered down). Alex's patch, instead, collects processes onto a smaller number of CPUs without worrying about the power-line topology.

Another important difference is that, when the system becomes sufficiently loaded, Alex's patch set gives up on coalescing small tasks and switches over to the "performance" mode. In that mode, the active heuristic tries to spread tasks out to get the work done as quickly as possible. It is, in a sense, an implementation of the old "race to idle" idea; once there is a certain amount of work to do, it is best to simply throw all available resources at getting it done quickly so that the system can go back to sleep. As one might imagine, there is an inherent hazard to toggling between two modes that, respectively, gather tasks together and spread them out. To avoid excessive thrashing, there is some hysteresis built into the algorithm to keep it from switching between modes or moving processes around too often.

Heterogeneous systems

The above discussion assumes that all CPUs in a system are the same in every respect except their power connections. ARM's "big.LITTLE" architecture violates that assumption by packaging Cortex A15 processors (which are fast and power-hungry) on the same chip as Cortex A7's (which are slow and power-efficient). The Linux CPU scheduler was not written with such systems in mind, so it is not surprising that it does not make the best decisions when running in that environment. The full solution to this problem is involved and, arguably, not fully understood at this time. But one can try to make small steps in the right direction as a way of getting closer to running optimally on heterogeneous systems.

Morten Rasmussen's task placement on mixed cpu_power systems represents a couple of those small steps. Rather than concern itself with small tasks, though, this patch set works with the larger processes running in the system. In particular, if a low-powered CPU is running without idle time and there is a high-powered CPU available, the code will attempt to move the most CPU-hungry process over to that high-powered CPU. In this way, the more compute-intensive jobs will run more quickly, which is good for performance; by allowing the system to go idle, it should also provide better power utilization.

Benchmarks provided with the patch set show that, indeed, throughput improves considerably when more CPU-intensive tasks are steered toward the more capable CPUs. Things improve even more when a rather perverse behavior in the current scheduler is fixed. The scheduler already tracks the "power" of each CPU; that value is used to direct tasks toward CPUs (and higher-level scheduling domains) that have the most available power. In the face of CPUs with significantly different capabilities, though, the scheduler may see a big CPU as having more available power than a small one even if the big CPU is already loaded and the small CPU is idle. It could, consequently, add more tasks to the loaded CPU, leaving the idle CPU with no work to do. Morten's patch set addresses this problem by making a point of spreading tasks across the available CPUs before piling more work onto the bigger ones on the theory that throughput will be maximized in that way.

Conclusion

All of the patch sets described above have been circulating for some time, but none seem set to be merged for 3.10. Part of the problem is that changing heuristic-heavy core code is always hard; the maintainers want a high level of assurance that such changes will not cause regressions for other workloads. Some of the trouble is political; there are some big.LITTLE precursor patches that are currently blocked for non-technical reasons that have yet to be worked out. And part of the problem is just to be expected when a new batch of developers from a different part of the industry start to work on code that has long been managed by developers concerned mostly with enterprise computing.

In a sense, the challenges for scheduling have changed in a subtle but important way. Some years back, the problem came down to deciding which task(s) to run next, and for how long. The completely fair scheduler has, to a first approximation, solved that problem well enough. Now, rather than worrying about which processes to run, scheduler developers are having to concern themselves with where those tasks should run in a large and complex system. In the process, they have to try to maximize both throughput and power efficiency. Often, those two goals are complementary: the fastest way to run a job tends to be the most efficient. But sometimes the two goals conflict. It is going to be interesting to see how the scheduler developers solve those problems in a way that scales across a wide range of workloads.

Index entries for this article
Kernel	Power management/CPU scheduling
Kernel	Scheduler/and power management

Toward a more power-efficient scheduler

Posted Apr 11, 2013 10:51 UTC (Thu) by pj (subscriber, #4506) [Link] (1 responses)

Am I the only one reminded of openmosix's "economic" scheduler for figuring out which machine to migrate processes to? It seems like something similar might be useful to distribute work among heterogeneous cpus.

Toward a more power-efficient scheduler

Posted Apr 12, 2013 12:37 UTC (Fri) by ortalo (guest, #4654) [Link]

I do not remember that specific aspect of openmosix personnally, but indeed you are right to point out that several topics successfully approached in the context of distributed systems (or high performance parallel computers) that should be applicable today to multiprocessors seem to get reinvented regularly (and sometimes painfully).
Obviously the environment is not the same now, possibly these works are not available online but less conveniently via libraries and journal articles, and maybe some people are simply getting old and nostalgic ;-); but better bibliographic references also seem to be recommended sometimes too...

Toward a more power-efficient scheduler

Posted Apr 25, 2013 6:19 UTC (Thu) by amit.kucheria (subscriber, #59246) [Link]

One thing I'd like to point out about Vincent's patchset - it too will start spreading tasks out after a certain load threshold (like Alex's patchset).