By Jonathan Corbet
January 9, 2013
The Linux kernel's CPU scheduler has a challenging task: it must allocate access
to the system's processors in a way that is fair and responsive while
maximizing system throughput and minimizing power consumption. Users
expect these results regardless of the characteristics of their own
workloads, and regardless of the fact that those objectives are often in
conflict with each other. So it is not surprising that the kernel has been
through a few CPU schedulers over the years. That said, things have
seemed relatively stable in recent times; the current "completely
fair scheduler" (CFS) was merged for 2.6.23 in 2007. But, behind that
apparent stability, a lot has changed, and one of the most significant
changes in some time was merged for the 3.8 release.
Perfect scheduling requires a crystal ball; when the kernel knows exactly
what demands every process will make on the system and when, it can schedule
those processes optimally. Unfortunately, hardware manufacturers continue
to push affordable prediction-offload devices back in their roadmaps, so
the scheduler has to be able to muddle through in their absence. Said
muddling tends to be based on the information that is actually available,
with each process's past performance being at the top of the list. But,
interestingly, while the kernel closely tracks how much time each process
actually spends running, it does not have a clear idea of how much each
process is contributing to the load on the system.
One might well ask whether there is a difference between "CPU time
consumed" and "load." The answer, at least as embodied in Paul Turner's
per-entity load tracking patch set, which
was merged for 3.8, would
appear to be "yes." A process can contribute to load even if it is not
actually running at the moment; a process waiting for its turn in the CPU
is an example. "Load" is also meant to be an instantaneous quantity — how
much is a process loading the system right now? — as opposed to a
cumulative property like CPU usage. A long-running process that consumed
vast amounts of processor time last week may have very modest needs at the
moment; such a process is contributing very little to load now, despite its
rather more demanding behavior in the past.
The CFS scheduler (in 3.7 and prior kernels) tracks load on a per-run-queue
basis. It's worth noting that "the" run queue in CFS is actually a long
list of queues; at a minimum, there is one for each CPU. When group
scheduling is in use, though, each control group has its own per-CPU run
queue array. Occasionally the scheduler will account for how much each run
queue is contributing to the load on the system as a whole. Accounting at
that level is sufficient to help the group scheduler allocate CPU time
between control groups, but it leaves the system as a whole unaware of
exactly where the current load is coming from. Tracking load at the run queue
level also tends to yield widely varying estimates even when the workload is
relatively stable.
Toward better load tracking
Per-entity load tracking addresses these problems by pushing this tracking
down to the level of
individual "scheduling entities" — a process or a control group full of
processes. To that end, (wall clock) time is viewed as a sequence of 1ms
(actually, 1024µs) periods. An entity's contribution to the system load in
a period pi is just the portion of that period that the
entity was runnable — either actually running, or waiting for an available
CPU. The trick, though, is to get an idea of contributed load that covers
more than 1ms of real time; this is managed by adding in a decayed version
of the entity's previous contribution to system load. If we let
Li designate the entity's load contribution in period
pi, then an entity's total contribution can be expressed
as:
L = L0 + L1*y +
L2*y2 + L3*y3 + ...
Where y is the decay factor chosen. This formula gives the most
weight to the most recent load, but allows past load to influence the
calculation in a decreasing manner. The nice thing about this series is
that it is not actually necessary to keep an array of past load
contributions; simply multiplying the previous period's total load
contribution by y and adding the new L0 is
sufficient.
In the current code, y has been chosen so that y32
is equal to 0.5 (though, of course, the calculation is done with integer
arithmetic in the kernel). Thus, an entity's load contribution 32ms in the past is
weighted half as strongly as its current contribution.
Once we have an idea of the load contributed by runnable processes, that
load can be propagated upward to any containing control groups with a
simple sum. But, naturally, there are some complications. Calculating the
load contribution of runnable entities is easy, since the scheduler has to
deal with those entities on a regular basis anyway. But non-runnable
entities can also contribute to load; the fact a password cracker is
currently waiting on a page fault does not change the fact that it may be
loading the system heavily. So there needs to be a way of tracking the
load contribution of processes that, by virtue of being blocked, are not
currently the scheduler's concern.
One could, of course, just iterate through all those processes, decay their
load contribution as usual, and add it to the total. But that would be a
prohibitively expensive thing to do. So, instead, the 3.8 scheduler will
simply maintain a separate sum of the "blocked load" contained in each
cfs_rq (control-group run queue) structure. When a process
blocks, its load is subtracted from the total runnable load value and
added to the blocked load instead. That load can be decayed in the same
manner (by multiplying it by y each period). When a blocked process
becomes runnable again, its (suitably decayed) load is transferred back to
the runnable load. Thus, with a bit of accounting during process state
transitions, the scheduler can track load without having to worry about
walking through a long list of blocked processes.
Another complication is throttled processes — those that are running under
the CFS bandwidth controller and have used
all of the CPU time available to them in the current period. Even if those
processes wish to run, and even if the CPU is otherwise idle, the scheduler
will pass them over. Throttled processes thus
cannot contribute to load, so removing their contribution while they
languish makes sense. But allowing their load contribution to decay while
they are waiting to be allowed to run again would tend to skew their
contribution downward. So, in the throttled case, time simply stops for
the affected processes until they emerge from the throttled state.
What it is good for
The end result of all this work is that the scheduler now has a much
clearer idea of how much each process and scheduler control group is
contributing to the load on the system — and it has all been achieved
without an increase in scheduler overhead. Better statistics are usually
good, but one might wonder whether this information is truly useful for the
scheduler.
It does seem that some useful things can be done with a better idea of an
entity's load contribution. The most obvious target is likely to be load
balancing: distributing the processes on the system so that each CPU is
carrying roughly the same load. If the kernel knows how much each process
is contributing to system load, it can easily calculate the effect of
migrating that process to another CPU. The result should be more accurate,
less error-prone load balancing. There are some patches in circulation that make use of
load tracking to improve the scheduler's load balancer; something will
almost certainly make its way toward the mainline in the near future.
Another feature needing per-entity load tracking is the small-task packing patch. The goal here is to
gather "small" processes onto a small number of CPUs, allowing other
processors in the system to be powered down. Clearly, this kind of
gathering requires a reliable indicator of which processes are "small";
otherwise, the system is likely to end up in a highly unbalanced state.
Other subsystems may also be able to use this information; CPU frequency
and power governors should be able to make better guesses about how much
computing power will be needed in the near future, for example. Now that
the infrastructure is in place, we are likely to see a number of developers
using per-entity load information to optimize the behavior of the system.
It is still not a crystal ball with a view into the future, but, at least,
we now have a better understanding of the present.
(
Log in to post comments)