The Linux CPU scheduler has come a long way since the early 2.6 days, when
it was the cause for quite a bit of worry. Scheduling domains fixed many
of the problems on larger systems, while a whole set of interactivity
heuristics made desktops work better. The interactivity work, in
particular, is based on the notion of a "sleep average." Any process which
spends a significant amount of its time sleeping, relative to the time it
runs, is deemed to be "interactive" and is given a higher priority.
This mechanism works well enough that few people complain about interactive
response with current 2.6 kernels. Every now and then, however, somebody
comes up with a workload which manages to fool the scheduler and bring the
desktop to a halt. Mike Galbraith has been chasing down a few of these,
producing patches in the process which should help to mitigate the
The Linux scheduler maintains two "arrays" of run queues for each
processor. When a process starts out, it is given a time slice and put
onto the "active" array, where it can compete for the CPU. Once the time
slice runs out, that process will move over to the "expired" array, where
it languishes until all other runnable processes have used up their time
slices. Once all processes are on the expired array, the two arrays are
switched and the process begins again.
There is an exception, however, in the 2.6 kernel: a process which is
deemed to be interactive (because it spends enough time in interruptible
sleeps) will, on expiration of its time slice, be put back onto the active
array. As a result, an interactive process should not have to wait while
some long-running batch process cranks through its time slice. To keep
this mechanism from blocking out expired processes altogether, however, the
scheduler checks to see if the processes in the expired array have been
waiting for too long. Once the starvation threshold has been exceeded, all processes
go to the expired array at the end of their slices, allowing the scheduler
to perform the array switch in the relatively near future.
Mike found that, on a system with a heavily-loaded Apache server running,
tasks could find themselves starved for long periods of time; it seems that
the starvation-avoidance logic was not working right. The problem turned
out to be in the wakeup code. That code was always putting
freshly-awakened processes onto the active array, regardless of what was
going on elsewhere in the system. With a large number of server processes
being continually awakened as requests came in, the scheduler was never
able to switch arrays. The fix was to put
the starvation test into __activate_task(); as a result, when
expired processes are starving, processes will be awakened onto the expired
array. That small fix fixed much of the problem.
A fuller fix, however, involves the task throttling patch which Mike
has been working on for some time. There's a number of fixes involved in
this work, but the core observation is this: the "sleep average" code can
be too generous to processes which sleep only part of the time. A process
which manages regular, short sleeps can boost its priority significantly,
to the point that it can force out other processes running on the system.
And once a process obtains an interactivity bonus, it can keep it for some
time. This behavior is all by design; some interactive programs can sit
for a very long time, then perform some serious processing for a while.
Think about the X server with that nice compositing window manager; it
spends quite a bit of time idle, only to pin the CPU when the user starts
dragging windows around. But this behavior can also give an interactive
priority bonus to processes which are not truly interactive.
The solution here involves a few changes. One of them is to simply be a
bit less generous with the interactivity bonuses. But the core of the
patch is a function called refresh_timeslice(). This function
looks at the current sleep average, and compares it to the amount of time
that the process is actually spending in the CPU. Based on this
comparison, a per-process throttle time is adjusted. If more CPU time is
being used than would be suggested by the sleep average, the throttle time
is moved backward; otherwise it moves forward. If a process runs into the
throttle time, its sleep average starts to decay quickly, depriving it of
its interactivity bonus.
The throttle time provides a grace period which allows processes to use
short bursts of CPU time without being penalized. The amount of grace time
can be adjusted by way of a pair of knobs exported by the throttling code.
"Grace 1" is the amount of time new processes get to establish their
averages before being exposed to the throttling mechanism, while
"grace 2" is how long a process can run above its expected CPU usage
before the throttle kicks in. There have been some objections to the
addition of these knobs; they look like another obscure set of kernel
tunables that most administrators will not know how to set properly. So
there has been a push for the knobs to be replaced with a simple on/off
switch. Systems meant for interactive use would leave the throttling on,
while server systems would simply turn it all off. Working this issue out
may delay the acceptance of this patch, though there seems to be little
disagreement with the rest of it.
to post comments)