Solving starvation problems in the scheduler

[Posted March 22, 2006 by corbet]

The Linux CPU scheduler has come a long way since the early 2.6 days, when it was the cause for quite a bit of worry. Scheduling domains fixed many of the problems on larger systems, while a whole set of interactivity heuristics made desktops work better. The interactivity work, in particular, is based on the notion of a "sleep average." Any process which spends a significant amount of its time sleeping, relative to the time it runs, is deemed to be "interactive" and is given a higher priority.

This mechanism works well enough that few people complain about interactive response with current 2.6 kernels. Every now and then, however, somebody comes up with a workload which manages to fool the scheduler and bring the desktop to a halt. Mike Galbraith has been chasing down a few of these, producing patches in the process which should help to mitigate the problems.

The Linux scheduler maintains two "arrays" of run queues for each processor. When a process starts out, it is given a time slice and put onto the "active" array, where it can compete for the CPU. Once the time slice runs out, that process will move over to the "expired" array, where it languishes until all other runnable processes have used up their time slices. Once all processes are on the expired array, the two arrays are switched and the process begins again.

There is an exception, however, in the 2.6 kernel: a process which is deemed to be interactive (because it spends enough time in interruptible sleeps) will, on expiration of its time slice, be put back onto the active array. As a result, an interactive process should not have to wait while some long-running batch process cranks through its time slice. To keep this mechanism from blocking out expired processes altogether, however, the scheduler checks to see if the processes in the expired array have been waiting for too long. Once the starvation threshold has been exceeded, all processes go to the expired array at the end of their slices, allowing the scheduler to perform the array switch in the relatively near future.

Mike found that, on a system with a heavily-loaded Apache server running, tasks could find themselves starved for long periods of time; it seems that the starvation-avoidance logic was not working right. The problem turned out to be in the wakeup code. That code was always putting freshly-awakened processes onto the active array, regardless of what was going on elsewhere in the system. With a large number of server processes being continually awakened as requests came in, the scheduler was never able to switch arrays. The fix was to put the starvation test into __activate_task(); as a result, when expired processes are starving, processes will be awakened onto the expired array. That small fix fixed much of the problem.

A fuller fix, however, involves the task throttling patch which Mike has been working on for some time. There's a number of fixes involved in this work, but the core observation is this: the "sleep average" code can be too generous to processes which sleep only part of the time. A process which manages regular, short sleeps can boost its priority significantly, to the point that it can force out other processes running on the system. And once a process obtains an interactivity bonus, it can keep it for some time. This behavior is all by design; some interactive programs can sit for a very long time, then perform some serious processing for a while. Think about the X server with that nice compositing window manager; it spends quite a bit of time idle, only to pin the CPU when the user starts dragging windows around. But this behavior can also give an interactive priority bonus to processes which are not truly interactive.

The solution here involves a few changes. One of them is to simply be a bit less generous with the interactivity bonuses. But the core of the patch is a function called refresh_timeslice(). This function looks at the current sleep average, and compares it to the amount of time that the process is actually spending in the CPU. Based on this comparison, a per-process throttle time is adjusted. If more CPU time is being used than would be suggested by the sleep average, the throttle time is moved backward; otherwise it moves forward. If a process runs into the throttle time, its sleep average starts to decay quickly, depriving it of its interactivity bonus.

The throttle time provides a grace period which allows processes to use short bursts of CPU time without being penalized. The amount of grace time can be adjusted by way of a pair of knobs exported by the throttling code. "Grace 1" is the amount of time new processes get to establish their averages before being exposed to the throttling mechanism, while "grace 2" is how long a process can run above its expected CPU usage before the throttle kicks in. There have been some objections to the addition of these knobs; they look like another obscure set of kernel tunables that most administrators will not know how to set properly. So there has been a push for the knobs to be replaced with a simple on/off switch. Systems meant for interactive use would leave the throttling on, while server systems would simply turn it all off. Working this issue out may delay the acceptance of this patch, though there seems to be little disagreement with the rest of it.

Index entries for this article
Kernel	Interactivity
Kernel	Scheduler

Solving starvation problems in the scheduler

Posted Mar 24, 2006 0:34 UTC (Fri) by xorbe (guest, #3165) [Link] (2 responses)

I apologize, but the whole Linux scheduler strikes me as so messed up. I have a 2.5GHz A64 w/2GB DDR500 that posts 7.8GB/s bandwidth in RMMA benchmark, with a recent hdd running 64-bit Linux -- and silly things like "ls" in the Mandriva cooker folder or copying to a USB2 drive bring the machine to a halt otherwise. My WinXP partition provides better interactivity, and it's starting to burn. I don't know what the problem is in the 2.6 series, but I wish it would get resolved soon.

Solving starvation problems in the scheduler

Posted Mar 24, 2006 21:09 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

Could it be possible that you are running into a distribution specific issue rather than a kernel problem?

Solving starvation problems in the scheduler

Posted Apr 20, 2006 21:02 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

well, there simply ARE problems with the kernel scheduler... tough he
might have more problems with the I/O scheduler (he should try CFQ...), as
i said some time ago (below here), the cpu scheduler needs some work, too.
and imho id needs removal... and replacement.

Solving starvation problems in the scheduler

Posted Mar 26, 2006 20:07 UTC (Sun) by jospoortvliet (guest, #33164) [Link] (3 responses)

as someone above said, this scheduler often comes up kind'a badly in the
news. really feels like a big hack... if you compare the design with dr.
Con Kolivas' [1] staircase scheduler - i really feel the kernel developers
should think about merging it.

its designed with interactivity in mind, not hacked on. all this fiddling
to work around a (seemingly) wrong design - it doesn't sound like the
right thing to do... and if it saves hundreds of lines of code, and
generally seems to perform better [2], wouldn't it be smarter to get it
in, even as an alternative choice, for even wider testing (altough it is
quite mature, after all these years of hard work)?

[1] http://members.optusnet.com.au/ckolivas/kernel/
[2] http://bhhdoa.org.au/pipermail/ck/2006-March/005693.html

Solving starvation problems in the scheduler

Posted Mar 31, 2006 0:06 UTC (Fri) by cyrus (subscriber, #36858) [Link] (2 responses)

What are you talking about? This article refers to the CPU SCHEDULER, which decides when which process is about to run. The Staircase scheduler from Con is an I/O scheduler, which controls a block device. (When to access disk and stuff..)

Staircase scheduler

Posted Mar 31, 2006 0:27 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

No, actually, the staircase scheduler is a CPU scheduler. See this LWN article from 2004.

Staircase scheduler

Posted Mar 31, 2006 1:09 UTC (Fri) by cyrus (subscriber, #36858) [Link]

Oh.. good to know, sorry for my previous post then, my fault.