By Jonathan Corbet
April 10, 2013
When the system is idle, Linux is quite good at putting the processors into
a deep sleep state and keeping them there; when we do nothing, we do it as
efficiently as anybody. On a heavily loaded system, about the only thing
to be done to save power is to get the work done as quickly as possible;
again, Linux does well in this situation. In cases where the system is
lightly loaded, though, the scheduler does not always make the right
decisions; that situation can only get worse as the hardware gets more
complicated. A few developers are working to improve the power efficiency
of the kernel's scheduler, but it will probably be a while before this work
makes it into the mainline.
Small-task packing
Vincent Guittot's "packing small tasks" patch set was examined here in October 2012. Development
continues with this patch set; version 3 was posted in late March.
The idea behind small-task packing is to sweep processes that only run
occasionally "to the left" of each scheduling domain, and, eventually,
toward CPU 0. Collecting these tasks onto a relatively small number
of processors allows them to be run more efficiently while keeping them
from pulling other CPUs out of the idle state. As long as the destination
CPUs are not overloaded with small tasks, the result should be better power
utilization and no real change in throughput.
The patch set is now able to to take advantage of the per-entity load tracking
feature, which was merged for the 3.8 kernel. Better load tracking allows the
easy identification of the small tasks in the system, and a more precise
characterization of just how small they are. So Vincent's patch is now able to
pick only those tasks which take no more than 20% of the available CPU
time; additionally, statistics are kept so that only tasks which run for a
maximum of 10ms at a time are eligible for packing. These heuristics
attempt to ensure that the scheduler will not inadvertently pack tasks that
will end up overloading the target CPU.
The actual packing happens when a small task wakes up, since such tasks are
likely to be sleeping when the scheduler's periodic load balancing
happens. Only one possible destination CPU is checked — the "buddy"
assigned to each CPU at system initialization time. If the buddy is
lightly loaded, the small task will be moved in that direction. Since the
buddy relationship is one-way, tasks will tend to migrate toward
lower-numbered CPUs. As a result, packing may happen slowly, but, since
only one CPU needs to be checked at wakeup
time, the overhead of the actual packing decision is low.
Power-aware scheduling
Alex Shi's power-aware scheduling patch set
(last looked at in August 2012) takes
a more comprehensive approach to the same problem. As with Vincent's patch
set, one of the goals is to collect small tasks onto a relatively small set
of CPUs. Once again, small tasks are migrated toward common CPUs at wakeup
(or fork()) time. In this case, though, there are no "buddy"
CPUs; instead, the migration code will examine all CPUs on the system,
looking for the busiest CPU that still has some free time available.
This patch set also works at the load-balancing level. When load balancing
occurs, the code will look for the non-idle CPUs with the lowest load, on
the assumption that those CPUs could be made to go idle if the load were
moved elsewhere. If a sufficiently lightly loaded CPU can be found, its
processes will be pushed in the direction of more heavily loaded CPUs
(though always to those that have some spare capacity available). As with
Vincent's patch, some new statistics must be kept to help inform the
process migration decisions. One significant difference is that Vincent's
buddy scheme explicitly tries to move processes to CPUs driven by a
different power line (so that the vacated CPUs can be powered down).
Alex's patch, instead, collects processes onto a
smaller number of CPUs without worrying about the power-line topology.
Another important difference is that, when the system becomes sufficiently
loaded, Alex's patch set gives up on coalescing small tasks and switches
over to the "performance" mode. In that mode, the active heuristic tries
to spread tasks out to get the work done as quickly as possible. It is, in
a sense, an implementation of the old "race to idle" idea; once there is a
certain amount of work to do, it is best to simply throw all available
resources at getting it done quickly so that the system can go back to
sleep. As one might imagine, there is an inherent hazard to toggling
between two modes that, respectively, gather tasks together and spread them
out. To avoid excessive thrashing, there is some hysteresis built into the
algorithm to keep it from switching between modes or moving processes
around too often.
Heterogeneous systems
The above discussion assumes that all CPUs in a system are the same in
every respect except their power connections. ARM's "big.LITTLE"
architecture violates that assumption by packaging Cortex A15 processors
(which are fast and power-hungry) on the same chip as Cortex A7's (which
are slow and power-efficient). The Linux CPU scheduler was not written
with such systems in mind, so it is not surprising that it does not make
the best decisions when running in that environment. The full solution to this problem is involved and,
arguably, not fully understood at this time. But one can try to make small
steps in the right direction as a way of getting closer to running optimally on
heterogeneous systems.
Morten Rasmussen's task placement on mixed
cpu_power systems represents a couple of those small steps. Rather
than concern itself with small tasks, though, this patch set works with the
larger processes running in the system. In particular, if a low-powered
CPU is running without idle time and there is a high-powered CPU available,
the code will attempt to move the most CPU-hungry process over to that
high-powered CPU. In this way, the more compute-intensive jobs will run
more quickly, which is good for performance; by allowing the system to go
idle, it should also provide better power utilization.
Benchmarks provided with the patch set show that, indeed, throughput
improves considerably when more CPU-intensive tasks are steered toward the
more capable CPUs. Things improve even more when a rather perverse
behavior in the current scheduler is fixed. The scheduler already tracks
the "power" of each CPU; that value is used to direct tasks toward CPUs
(and higher-level scheduling domains) that have the most available power.
In the face of CPUs with significantly different capabilities, though, the
scheduler may see a big CPU as having more available power than a small
one even if the big CPU is already loaded and the small CPU is idle. It
could, consequently, add more tasks to the loaded CPU, leaving the idle CPU
with no work to do.
Morten's patch set addresses this problem by making a point of spreading
tasks across the available CPUs before piling more work onto the bigger
ones on the theory that throughput will be maximized in that way.
Conclusion
All of the patch sets described above have been circulating for some time,
but none seem set to be merged for 3.10. Part of the problem is that
changing heuristic-heavy core code is always hard; the maintainers want a
high level of assurance that such changes will not cause regressions for
other workloads. Some of the trouble is political; there are some
big.LITTLE precursor patches that are currently blocked for non-technical reasons that have yet to be
worked out. And part of the problem is just to be expected when a new
batch of developers from a different part of the industry start to work on
code that has long been managed by developers concerned mostly with
enterprise computing.
In a sense, the challenges for scheduling have changed in a subtle but
important way. Some years back, the problem came down to deciding which
task(s) to run next, and for how long. The completely fair scheduler has,
to a first approximation, solved that problem well enough. Now, rather
than worrying about which processes to run, scheduler developers are having
to concern themselves with where those tasks should run in a
large and complex system. In the process, they have to try to maximize
both throughput and power efficiency. Often, those two goals are
complementary: the fastest way to run a job tends to be the most
efficient. But sometimes the two goals conflict. It is going to be
interesting to see how the scheduler developers solve those problems in a
way that scales across a wide range of workloads.
(
Log in to post comments)