Group scheduling and alternatives

By Jonathan Corbet
December 6, 2010

The TTY-based group scheduling patch set has received a lot of discussion on LWN and elsewhere; some distributors are rushing out kernels with this code added, despite the fact that it has not yet been merged into the mainline. That patch has evolved slightly since it was last discussed here. There have also been some interesting conversations about alternatives; this article will attempt to bring things up to date.

The main change to the TTY-based group scheduling patch set is that it is, in fact, no longer TTY-based. The identity of the controlling terminal was chosen as a heuristic which could be used to group together tasks which should compete with each other for CPU time, but other choices are possible. An obvious possibility is the session ID. This ID is used to identify distinct process groups; a process starts a new session with the setsid() system call. Since sessions are already used to group together related processes, it makes sense to use the session ID as the key when grouping processes for scheduling. More recent versions of the patch do exactly that. The session-based group scheduling mechanism appears to be stabilizing; chances are good that it will be merged in the 2.6.38 merge window.

Meanwhile, there have been a couple of discussions led by vocal proponents of other approaches to interactive scheduling. It is fair to say that neither is likely to find its way into the mainline. Both are worth a look, though, as examples of how people are thinking about the problem.

Colin Walters asked about whether group scheduling could be tied into the "niceness" priorities which have been implemented by Unix and Linux schedulers for decades. People are used to nice, he said, but they would like it to work better. Creating groups for nice levels would help to make that happen. But Linus was not excited about this idea; he claims that almost nobody uses nice now and that is unlikely to change.

More to the point, though: the semantics implemented by nice are very different from those offered by group scheduling. The former is entirely priority-based, making the promise that processes with a higher "niceness" will get less processor time than those with lower values. Group scheduling, instead, is about isolation - keeping groups of processes from interfering with each other. The concept of priorities is poorly handled by group scheduling now, it's just not how that mechanism works. Group scheduling will not cause one set of processes to run in favor of another; it just ensures that the division of CPU time between the groups is fair.

Colin went on to suggest that using groups would improve nice, giving the results that users really want. But changing something as fundamental as the effects of niceness would be, in a very real sense, an ABI change. There may not be many users of nice, but installations which depend on it would not appreciate a change in its semantics. So nice will stay the way it is, and group scheduling will be used to implement different (presumably better) semantics.

The group scheduling discussion also featured a rare appearance by Con Kolivas. Con's view is that the session-based group scheduling patch is another attempt to put interactivity heuristics into the kernel - an approach which has failed in the past:

You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just doesn't make sense in this light. I for one do not welcome our new heuristic overlords.

Con's alternative suggestion was to put control of interactivity more directly into the hands of user space. He would attach a parameter to every process describing its latency needs. Applications could then be coded to communicate their needs to the kernel; an audio processing application would request the lowest latency, while make would inform the kernel that latency matters little. Con would also add a global knob controlling whether low-latency processes would also get more CPU time. The result, he says, would be to explicitly favor "foreground" processes (assuming those processes are the ones which request lower latency). Distributors could set up defaults for these parameters; users could change them, if they wanted to.

All of that, Con said, would be a good way to "move away from the fragile heuristic tweaks and find a longer term robust solution." The suggestion has not been particularly well received, though. Group scheduling was defended against the "heuristics" label; it is simply an implementation of the scheduling preferences established by the user or system administrator. The session-based component is just a default for how the groups can be composed; it may well be a better default than "no groups," which is what most systems are using now. More to the point, changing that default is easily done. Lennart Poettering's systemd-driven groups are an example; they are managed entirely from user space. Group scheduling is, in fact, quite easy to manage for anybody who wants to set up a different scheme.

So we'll probably not see Con's knobs added anytime soon - even if somebody does actually create a patch to implement them. What we might see, though, is a variant on that approach where processes could specify exact latency and CPU requirements. A patch for that does exist - it's called the deadline scheduler. If clever group scheduling turns out not to solve everybody's problem (likely - somebody always has an intractable problem), we might see a new push to get the deadline scheduling patches merged.

Index entries for this article
Kernel	Group scheduling
Kernel	Scheduler/Group scheduling

Interactive versus batch processes

Posted Dec 9, 2010 14:34 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

Con Kolivas's suggestion makes sense, and is mostly orthogonal to group scheduling. Clearly, the requirements for 'gcc' or 'tar' are quite different to those for interactive processes. From gcc's point of view it matters little whether it gets a second of CPU time in a single lump and is then suspended for a whole second. Throughput is important, latency is not. Even a simple flag for 'this is a batch process' would work.

Interactive versus batch processes

Posted Dec 9, 2010 16:16 UTC (Thu) by sync (guest, #39669) [Link] (4 responses)

> Even a simple flag for 'this is a batch process' would work.

Already exists (since 2.6.16): SCHED_BATCH
See sched_setscheduler(2), chrt(1)

Interactive versus batch processes

Posted Dec 9, 2010 18:15 UTC (Thu) by walters (subscriber, #7396) [Link] (3 responses)

Okay, so figuring out the shell script usage of "chrt" was totally not obvious (the man page desperately needs examples): But so, here's the answer if you want your compile jobs to not take over the machine:

chrt --idle 0 ionice -c 3 make -j 64

Interactive versus batch processes

Posted Dec 10, 2010 14:51 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

-j 64 seems likely to slow your compiles down due to cache thrashing and/or simple swapping. -j (num of cores + a few) is generally recommended.

Interactive versus batch processes

Posted Dec 10, 2010 15:18 UTC (Fri) by walters (subscriber, #7396) [Link]

Right; actually, what I use is a little build tool "metabuild" here:

http://fedorapeople.org/gitweb?p=walters/public_git/homeg...

I picked a high number to emphasize the point basically, but yes, one needs to pick a good -j value.

make -j level

Posted Dec 10, 2010 20:59 UTC (Fri) by giraffedata (guest, #1954) [Link]

And if you have an exceptionally slow filesystem, also multiply by the expansion factor (single thread total time / CPU time). One one system I use, with a single CPU, I found -j6 gave minimum elapsed time.

Group scheduling and alternatives

Posted Dec 9, 2010 16:38 UTC (Thu) by jwarnica (subscriber, #27492) [Link] (7 responses)

If we are going to expect applications to properly announce their requirements, we might as well get rid of the pesky OS and code to bare metal. Or for something less extreme, just slide back to cooperative multitasking.

Its like asking users what they want. Everything, now, for free. Of course! Thanks for that.

Group scheduling and alternatives

Posted Dec 9, 2010 22:48 UTC (Thu) by iabervon (subscriber, #722) [Link] (6 responses)

Just because the application asks for something doesn't mean it gets it. Currently, the kernel acts as if applications all want all of the processor time. But some applications actually only want some of the processor time. If an application can only make use of the first 1 us of every 1 ms, and asks to run only then, the kernel may be able to give it 100% of the time it wants without any system impact; if, on the other hand, it can't tell the kernel, it has to busy-wait through a lot more processor time in order to get any change of being running then, and load the system much more heavily.

The right design is to assume that programs want everything, and let them say what they don't want. Then you don't give them anything they don't want. Then the usual fairness and best effort goals essentially work again: if you have a batch process and a realtime process of the same priority, it is equally bad to miss the realtime process's window once as to not run the batch process at all for 1 ms; that is, the scheduler should try equally hard to avoid either happening, and fail about equally often under random load. Of course, writing a scheduler that does this optimally is hard, but the theory shows that it is possible to give userspace controls such that a program can benefit by decreasing its demands on the system.

Group scheduling and alternatives

Posted Dec 10, 2010 1:23 UTC (Fri) by dtlin (subscriber, #36537) [Link] (4 responses)

If an application can only make use of the first 1 us of every 1 ms, and asks to run only then, the kernel may be able to give it 100% of the time it wants without any system impact; if, on the other hand, it can't tell the kernel, it has to busy-wait through a lot more processor time in order to get any change of being running then, and load the system much more heavily.

What's wrong with nanosleep?

Group scheduling and alternatives

Posted Dec 10, 2010 2:46 UTC (Fri) by iabervon (subscriber, #722) [Link] (3 responses)

Last time I tried it, if the kernel happened to start running a different process 98 us before your sleep was scheduled to complete, and the other process was getting a 100 us slice, it would wake you 2 us after the start of the millisecond, when your process can't do anything other than record the loss of a sample and go back to sleep for 998 us. nanosleep is fine for telling the kernel you can't do anything until a given time, but it doesn't tell the kernel that, in 999 us, there will be 1 us that you can use, followed by another 999 us during which your either done or too late to do anything.

The scales in my example are different from what I was actually doing at the time, but I was trying to sample an accelerometer attached to an i2c bus attached to a serial port at 20 Hz; I needed to send a few bytes at the right time, which would cause the accelerometer to take a sample then. (The accelerometer device didn't support automatic periodic sampling.) It turned out that the only way to get data that I could analyze was to sleep until 1 ms before the time I wanted to sample and busy-wait until the right time; that meant I was generally running by the sample time, and generally hadn't used up my time slice. On the other hand, I was burning ~2% of the CPU on a power-limited system busy-waiting.

Group scheduling and alternatives

Posted Dec 10, 2010 18:08 UTC (Fri) by njs (subscriber, #40338) [Link] (2 responses)

I'm surprised -- on modern Linux, if you make use of RT scheduling (and especially if you can use the -rt branch) then I think you should be able to get <1 ms wakeup resolution.

Group scheduling and alternatives

Posted Dec 10, 2010 18:38 UTC (Fri) by iabervon (subscriber, #722) [Link] (1 responses)

This was, admittedly, a while ago, on a vanilla kernel. But I also didn't want to need any special scheduling capabilities. I'd be okay with dropping the occasional sample under heavy load (so it's not really a real-time critical task), but just sleeping was causing me to drop lots of samples under minimal load.

Group scheduling and alternatives

Posted Dec 20, 2010 1:44 UTC (Mon) by kevinm (guest, #69913) [Link]

sched_setscheduler(SCHED_FIFO) combined with the nanosleep() should get you there.

Group scheduling and alternatives

Posted Dec 16, 2010 20:36 UTC (Thu) by oak (guest, #2786) [Link]

Doesn't work as developers are "lazy". I think default needs to be some middle ground so that some processes can say they need more and some can say then need less (priority, scheduling accuracy...). Former group because they've been found to have problems and latter because they've been found to cause problems. I.e. minimize the number of programs that need to be modified to get system behave properly.

Group scheduling and alternatives

Posted Dec 9, 2010 18:48 UTC (Thu) by madscientist (subscriber, #16861) [Link] (1 responses)

Thank goodness they ditched the TTY idea. That was just... not right.

Group scheduling and alternatives

Posted Dec 10, 2010 4:59 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Agreed!

For GUI-heavy users, nothing is bound to a TTY anyway! There aren't any processes bound to TTYs on my wife's machine for example, other than some lonely gettys and the X server itself.

Group scheduling and alternatives

Posted Dec 4, 2016 10:52 UTC (Sun) by mkerrisk (subscriber, #1978) [Link]

"But changing something as fundamental as the effects of niceness would be, in a very real sense, an ABI change. There may not be many users of nice, but installations which depend on it would not appreciate a change in its semantics."

Ironically, changing the traditional semantics of niceness was exactly what the "group scheduling" feature (a.k.a. autogroup) did bring about. When autogrouping is on (which is the default in various distributions), then in many usages (e.g., when applied to one of two CPU bound jobs that is running in two different terminal windows), nice(1) becomes a no-op. See this note and details on the autogroup feature in the (soon to be released) revised sched(7) manual page. A web search easily finds many users who got surprised by the change.