Toward a more power-efficient scheduler
Small-task packing
Vincent Guittot's "packing small tasks" patch set was examined here in October 2012. Development continues with this patch set; version 3 was posted in late March. The idea behind small-task packing is to sweep processes that only run occasionally "to the left" of each scheduling domain, and, eventually, toward CPU 0. Collecting these tasks onto a relatively small number of processors allows them to be run more efficiently while keeping them from pulling other CPUs out of the idle state. As long as the destination CPUs are not overloaded with small tasks, the result should be better power utilization and no real change in throughput.
The patch set is now able to to take advantage of the per-entity load tracking feature, which was merged for the 3.8 kernel. Better load tracking allows the easy identification of the small tasks in the system, and a more precise characterization of just how small they are. So Vincent's patch is now able to pick only those tasks which take no more than 20% of the available CPU time; additionally, statistics are kept so that only tasks which run for a maximum of 10ms at a time are eligible for packing. These heuristics attempt to ensure that the scheduler will not inadvertently pack tasks that will end up overloading the target CPU.
The actual packing happens when a small task wakes up, since such tasks are likely to be sleeping when the scheduler's periodic load balancing happens. Only one possible destination CPU is checked — the "buddy" assigned to each CPU at system initialization time. If the buddy is lightly loaded, the small task will be moved in that direction. Since the buddy relationship is one-way, tasks will tend to migrate toward lower-numbered CPUs. As a result, packing may happen slowly, but, since only one CPU needs to be checked at wakeup time, the overhead of the actual packing decision is low.
Power-aware scheduling
Alex Shi's power-aware scheduling patch set (last looked at in August 2012) takes a more comprehensive approach to the same problem. As with Vincent's patch set, one of the goals is to collect small tasks onto a relatively small set of CPUs. Once again, small tasks are migrated toward common CPUs at wakeup (or fork()) time. In this case, though, there are no "buddy" CPUs; instead, the migration code will examine all CPUs on the system, looking for the busiest CPU that still has some free time available.
This patch set also works at the load-balancing level. When load balancing occurs, the code will look for the non-idle CPUs with the lowest load, on the assumption that those CPUs could be made to go idle if the load were moved elsewhere. If a sufficiently lightly loaded CPU can be found, its processes will be pushed in the direction of more heavily loaded CPUs (though always to those that have some spare capacity available). As with Vincent's patch, some new statistics must be kept to help inform the process migration decisions. One significant difference is that Vincent's buddy scheme explicitly tries to move processes to CPUs driven by a different power line (so that the vacated CPUs can be powered down). Alex's patch, instead, collects processes onto a smaller number of CPUs without worrying about the power-line topology.
Another important difference is that, when the system becomes sufficiently loaded, Alex's patch set gives up on coalescing small tasks and switches over to the "performance" mode. In that mode, the active heuristic tries to spread tasks out to get the work done as quickly as possible. It is, in a sense, an implementation of the old "race to idle" idea; once there is a certain amount of work to do, it is best to simply throw all available resources at getting it done quickly so that the system can go back to sleep. As one might imagine, there is an inherent hazard to toggling between two modes that, respectively, gather tasks together and spread them out. To avoid excessive thrashing, there is some hysteresis built into the algorithm to keep it from switching between modes or moving processes around too often.
Heterogeneous systems
The above discussion assumes that all CPUs in a system are the same in every respect except their power connections. ARM's "big.LITTLE" architecture violates that assumption by packaging Cortex A15 processors (which are fast and power-hungry) on the same chip as Cortex A7's (which are slow and power-efficient). The Linux CPU scheduler was not written with such systems in mind, so it is not surprising that it does not make the best decisions when running in that environment. The full solution to this problem is involved and, arguably, not fully understood at this time. But one can try to make small steps in the right direction as a way of getting closer to running optimally on heterogeneous systems.
Morten Rasmussen's task placement on mixed cpu_power systems represents a couple of those small steps. Rather than concern itself with small tasks, though, this patch set works with the larger processes running in the system. In particular, if a low-powered CPU is running without idle time and there is a high-powered CPU available, the code will attempt to move the most CPU-hungry process over to that high-powered CPU. In this way, the more compute-intensive jobs will run more quickly, which is good for performance; by allowing the system to go idle, it should also provide better power utilization.
Benchmarks provided with the patch set show that, indeed, throughput improves considerably when more CPU-intensive tasks are steered toward the more capable CPUs. Things improve even more when a rather perverse behavior in the current scheduler is fixed. The scheduler already tracks the "power" of each CPU; that value is used to direct tasks toward CPUs (and higher-level scheduling domains) that have the most available power. In the face of CPUs with significantly different capabilities, though, the scheduler may see a big CPU as having more available power than a small one even if the big CPU is already loaded and the small CPU is idle. It could, consequently, add more tasks to the loaded CPU, leaving the idle CPU with no work to do. Morten's patch set addresses this problem by making a point of spreading tasks across the available CPUs before piling more work onto the bigger ones on the theory that throughput will be maximized in that way.
Conclusion
All of the patch sets described above have been circulating for some time, but none seem set to be merged for 3.10. Part of the problem is that changing heuristic-heavy core code is always hard; the maintainers want a high level of assurance that such changes will not cause regressions for other workloads. Some of the trouble is political; there are some big.LITTLE precursor patches that are currently blocked for non-technical reasons that have yet to be worked out. And part of the problem is just to be expected when a new batch of developers from a different part of the industry start to work on code that has long been managed by developers concerned mostly with enterprise computing.
In a sense, the challenges for scheduling have changed in a subtle but
important way. Some years back, the problem came down to deciding which
task(s) to run next, and for how long. The completely fair scheduler has,
to a first approximation, solved that problem well enough. Now, rather
than worrying about which processes to run, scheduler developers are having
to concern themselves with where those tasks should run in a
large and complex system. In the process, they have to try to maximize
both throughput and power efficiency. Often, those two goals are
complementary: the fastest way to run a job tends to be the most
efficient. But sometimes the two goals conflict. It is going to be
interesting to see how the scheduler developers solve those problems in a
way that scales across a wide range of workloads.
Index entries for this article | |
---|---|
Kernel | Power management/CPU scheduling |
Kernel | Scheduler/and power management |
Posted Apr 11, 2013 10:51 UTC (Thu)
by pj (subscriber, #4506)
[Link] (1 responses)
Posted Apr 12, 2013 12:37 UTC (Fri)
by ortalo (guest, #4654)
[Link]
Posted Apr 25, 2013 6:19 UTC (Thu)
by amit.kucheria (subscriber, #59246)
[Link]
Toward a more power-efficient scheduler
Toward a more power-efficient scheduler
Obviously the environment is not the same now, possibly these works are not available online but less conveniently via libraries and journal articles, and maybe some people are simply getting old and nostalgic ;-); but better bibliographic references also seem to be recommended sometimes too...
Toward a more power-efficient scheduler