A group scheduling demonstration

By Jonathan Corbet
March 16, 2011

There has been much talk of the per-session group scheduling patch which is part of the 2.6.38 kernel, but it can be hard to see that code in action if one isn't doing a 20-process kernel build at the time. Recently, your editor inadvertently got a demonstration of group scheduling thanks to some unexpected results from a Rawhide system upgrade. The way the scheduler works was clearly shown in a way that could be captured at the time.

Rawhide users know that surprises often lurk behind the harmless-looking yum upgrade command. In this particular case, something in the upgrade (related to fonts, possibly) caused every graphical process in the system to decide that it was time to do some heavy processing. The result can be seen in this output from the top command:

The per-session heuristic had put most of the offending processes into a single control group, with the effect that they were mostly competing against each other for CPU time. These processes are, in the capture above, each currently getting 5.3% of the available CPU time. Two processes which were not in that control group were left essentially competing for the second core in the system; they each got 46%. The system had a load average of almost 22, and the desktop was entirely unresponsive. But it was possible to log into the system over the net and investigate the situation without really even noticing the load.

This isolation is one of the nicest features of group scheduling; even when a large number of processes go totally insane, their ability to ruin life for other tasks on the machine is limited. That, alone, justifies the cost of this feature.

Index entries for this article
Kernel	Group scheduling
Kernel	Scheduler/Group scheduling

What about i/o ?

Posted Mar 17, 2011 8:28 UTC (Thu) by Lennie (subscriber, #49641) [Link] (1 responses)

I know we have nice and ionice. And things running that are in the idle i/o scheduling class will not impact the i/o of other process to much.

Does the groupscheduling also handle i/o scheduling ? Or maybe just when asked to do so ?

What about i/o ?

Posted Mar 17, 2011 12:56 UTC (Thu) by corbet (editor, #1) [Link]

I/O scheduling is a different controller, and it's not quite as far along as group CPU scheduling, but it's getting there. Some recent coverage in this article and this one too.

A group scheduling demonstration

Posted Mar 17, 2011 9:36 UTC (Thu) by rvfh (guest, #31018) [Link] (10 responses)

I'd like the same for IOs so my machine can swap like mad because of a runaway process and still let me login and kill that process...

A group scheduling demonstration

Posted Mar 17, 2011 11:08 UTC (Thu) by intgr (subscriber, #39733) [Link]

+1 Linux's low-memory thrashing behavior is a huge problem, much more than CPU starvation.

I've frequently wondered whether throttling memory allocations for memory-intensive apps would be a better solution, than the current system-wide thrashing that takes down *all* processes -- including the X server and ssh.

And now it occurred to me that a per-session memory cgroup could already solve this problem. Not sure how much overhead it adds, though.

group scheduling and I/O

Posted Mar 17, 2011 23:28 UTC (Thu) by giraffedata (guest, #1954) [Link] (5 responses)

I'd like the same for IOs so my machine can swap like mad because of a runaway process and still let me login and kill that process

It is much harder to charge a disk I/O to a process than to charge a CPU second to a process. If I read a page in from swap, do you charge that I/O to me or to the process that stole the page frame from me that used to contain that page, making me read it in again?

But the group scheduling which is the subject of this article is irrelevant to the problem you mention anyway, because you're talking about only one selfish process.

The problem you're talking about is not a distribution of resources problem. It's thrashing. Thrashing is not caused by high demand for resource, but by poor scheduling. It's where Linux doesn't have enough memory to meet the demands of all the running processes, but it keeps dispatching all of them anyway, round robin, so that no process ever has memory long enough to get anything done. If the memory is oversubscribed by a factor of two, the processes don't run half as fast, but a thousandth as fast.

The solution to thrashing is to let a process keep its memory long enough to make substantial progress before dispatching a competing one, even if that means the CPU is idle while there are processes ready to run.

How to maintain access to the machine

Posted Mar 18, 2011 7:31 UTC (Fri) by rvfh (guest, #31018) [Link] (4 responses)

Interesting read indeed, thanks for that.
I think this has been discussed before, but we would basically need two things to keep access to the machine then:
* the processes we need to be pinned to the physical memory (ssh, bash, ...)
* the other processes to stop I/O then the processes above are being accessed, or better, our processes to not need I/O at all

Correct? If yes, then we could have a process group that we declare unswappable, unstoppable (better be very reliable then!), or would I still be missing a very difficult-to-solve part?

How to maintain access to the machine

Posted Mar 18, 2011 17:02 UTC (Fri) by giraffedata (guest, #1954) [Link] (3 responses)

It doesn't really have to be that hard. It isn't necessary to identify a set of high priority processes; all it takes is a more intelligent scheduler that senses thrashing and makes different choices as to what process to run next. It places some processes on hold for a few seconds at a time, in rotation. That way, the shell from which you're trying to kill that runaway process would have several second response time on some commands instead of several minutes, and that's good enough to get the job done.

This kind of scheduling was common ages ago with batch systems. It's easier there, because if you put a batch job on ice for even ten minutes, nobody is going to complain. In modern interactive computers, it's much less common. Maybe because very slow response is almost as bad as total stoppage due to thrashing in most cases (just not in the one we're talking about).

The key point is that thrash avoidance isn't about giving some processes priority for scarce resources; it's about making every process in the system run faster.

How to maintain access to the machine

Posted Mar 24, 2011 19:49 UTC (Thu) by jospoortvliet (guest, #33164) [Link] (2 responses)

Ok, now I get what the problem is and what (theoretically) the solution would be. Makes me even more sad, considering this issue bites me about twice a day, forcing me to do a hard reset :(

(mostly firefox & kmail eating too much ram)

How to maintain access to the machine

Posted Mar 24, 2011 20:49 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

this issue bites me about twice a day, forcing me to do a hard reset :( (mostly firefox and kmail eating too much ram)

You at least should have a memory rlimit on those processes, then. That way, when the program tries to grab an impractical amount of memory, it just dies immediately.

Rlimits are full of holes (your quota gets replicated every time you fork; I/O address space and shared memory gets counted against it) and are hard to set (you have to set it from inside the process), but they help a lot. I run with an address space rlimit of half my real memory size on every process (with some special exceptions). Most people use the default, which is no limit on anything.

The shell 'ulimit' command is the common way to set an rlimit. I don't know what it takes to set that up for your firefox and kmail processes.

How to maintain access to the machine

Posted Mar 26, 2011 9:40 UTC (Sat) by efexis (guest, #26355) [Link]

Memory cgroups will solve that problem for you, they are the number 1 thing I have found that improves system stability in *years*. Very simple to implement, assume cgroups is mounted under /cgroup with memory controller enabled (or for separate control, I mount my memory controller under /cgroup/memory so I can put tasks under memory control groups without putting them also under others)

Create shell script wrapper for what you want to run:

#!/bin/bash
CG="/cgroup/browser"
[[ -d "$CG" ]] || {
mkdir -p "$CG"
echo $(( 1048576 * 1200 )) > "$CG/memory.limit_in_bytes"
}
echo $$ > "$CG/tasks"
exec /path/to/browser "$@"

That puts it into a 1200meg group, no matter how many processes it forks, the entire lot cannot go over that 1200, and if they do, an OOM killer will kick in within only that group. You can also put similar lines at the top of scripts in /etc/init.d for example (obviously not needing the 'exec' line if you're adding to an existing startup script).

As long as you don't give any group 100% memory (I tend to put everything in 80% groups by default) no single runaway process or set of processes can ever bring the entire system down because there's always that 20% left it cannot touch.

A group scheduling demonstration

Posted Mar 25, 2011 14:39 UTC (Fri) by kabloom (guest, #59417) [Link] (2 responses)

+1 It's my opinion that swap space is really not worthwhile these days, for precisely that reason. I usually have enough real RAM to do what I want, and on the rare occasions that I don't, it's the result of a bug in some development project of mine that needs to be killed and fixed. Once the system starts swapping it's hard to get in and kill the process myself, so it would be better to let the OOM killer get to it sooner rather than later.

If this can be fixed with a patch to the scheduler, rather than eliminating swap space, that might be better.

A group scheduling demonstration

Posted Mar 28, 2011 21:56 UTC (Mon) by Duncan (guest, #6647) [Link] (1 responses)

There are two problems with disabling swap entirely, however.

1) Suspend-to-disk (aka hibernate) uses swap-space for the suspend image. No swap-space, no hibernate.

This can be worked around by enabling the desired suspend image swapspace in the hibernate script itself. I use a custom hibernate script here that does that, with the partition it uses and whether it enables it or not part of the configurables. Of course if it enabled it, it disables it after resume, too.

2) In some cases, disabling swap at least /used/ to disable memory transfer between various memory zones, etc. It's not as critical for 64-bit, but for 32-bit especially, it's quite possible to have a shortage of memory within one zone while the system as a whole still has plenty of memory, so having the flexibility to transfer memory between zones, where the memory ranges overlap and it can be used for either zone, can be very important.

Meanwhile, this won't work for everyone, but on machines with multiple storage devices (normally disks, but with SSD these days...), setting up multiple swap partitions, one per device, and specifically setting equivalent swap priority on all partitions, directs the kernel to stripe swap across all partitions at that priority, effectively RAID-0-ing them. I have four SATA drives here with most of the system in RAID-1, but have arranged swap as above, across all four devices. I can be a half-gig or more into swap before I really start to notice it, and at one point I had a runaway process that I let eat all 8 gigs of physical memory plus 15+ gigs of the 16 gigs of swap (4-gig each device, as they were laid out identically and that gave me a single partition of 4 gigs, half the size of memory, for the suspend image, as that was the biggest suspend image allowed), before I killed it, just because it was fascinating to watch. After a couple gigs of swap or so, there was a definite drag on responsiveness, but the system never did totally lock up, and when I realized what was happening, I opened a shell and had the kill command all typed in ready to hit enter, before it hit double-digit gigs of swap. After that it was just watch it happen and wait until it got as close as I dared to the limit, before I hit enter and killed the offender.

Also, while everybody else writing about it seems to recommend reducing vm.swappiness to something like 20, from the default 60, on the above striped swap config, I invert that, running vm.swappiness=100, as I'd rather dump into the 4-way-striped swap than dump hard-earned RAID-1 cache. Works great here! =:^)

Duncan

A group scheduling demonstration

Posted Mar 29, 2011 7:35 UTC (Tue) by rvfh (guest, #31018) [Link]

> 1) Suspend-to-disk (aka hibernate) uses swap-space for the suspend image. No swap-space, no hibernate.

According to the kernel documentation, you can set up a swap file for this. Not tried yet though...

systemd and autogroup scheduling

Posted Mar 19, 2011 17:47 UTC (Sat) by meyert (subscriber, #32097) [Link] (3 responses)

By the way, why has rawhide the autogroup scheduling flag set at all? Doesn't systemd create a control group for each service, i.e. prefdm.service, etc.? How do both interact (as you cannot see the autogroups in the mounted cgroup filesystem)?

Rawhide

Posted Mar 19, 2011 23:43 UTC (Sat) by corbet (editor, #1) [Link] (2 responses)

I don't know if Rawhide has this feature enabled or not, actually; I run my own kernels most of the time.

Rawhide

Posted Mar 20, 2011 11:23 UTC (Sun) by meyert (subscriber, #32097) [Link] (1 responses)

okay. what I wanted to say was, that you probably didn't see the effect of the autogroup functionality here, but the systemd functionality of creating a cpu cgroup for each service. As far as I understood, when autogroup is active and a cpu cgroup for a given tasks exists, the cpu cgroup takes precedence.
and as a user of rawhide you probably use systemd as init daemon.

Rawhide

Posted Mar 28, 2011 22:09 UTC (Mon) by Duncan (guest, #6647) [Link]

Perhaps, but cpu cgroup for each /service/ won't affect regular apps and their grouping. The kernel's auto-grouping will.

As can be seen from the top screenshot and as explained in the text, all those X apps were splitting the time available to their group, which should have been the one set by the kernel as they aren't services. At least, if systemd had altered their scheduling, it did so in the same way that the kernel autogrouping would have, because the autogrouping would have put the entire X session in the same group, and that's the scheduling behavior observed.

Duncan (on Gentoo, no systemd, mainline kernel 2.6.38, autogrouping on, so what systemd might or might not do doesn't presently affect me in the slightest)

A group scheduling demonstration

Posted Mar 27, 2011 12:26 UTC (Sun) by poelzi (guest, #14953) [Link] (1 responses)

My conclusion to autoscheduling is that it is simply broken. If you look at 'ps xaw -O session' you will see that nearly everything will end up in the same group. Process group 'ps xaw -O pgrp' shows a little bit better distribution but still needs some fixes in at least the major desktop UIs GNOME/KDE.
As maintainer of ulatencyd https://github.com/poelzi/ulatencyd I came the conclusion that there is no way to optimize a system based on such simple rule. You need workarounds and heuristics and also contact to the X server etc.

A group scheduling demonstration

Posted Mar 28, 2011 22:57 UTC (Mon) by Duncan (guest, #6647) [Link]

Running Gentoo (so no systemd) with mainline 2.6.38, auto-grouping on, kde 4.6.1, here. I hadn't tried tracking sessions and wasn't sure how to (tho I could have RTFMPed), so thanks for the ps commands.

But they show a reasonably logical grouping, here. Sure they aren't perfect, but the grouping is reasonably good given the simpleness of the algorithm. Each tty gets its own session as does each pts. The kernel threads are all session 0. init is session 1. Various daemons each get their session, shared with forks. startx, etc, has a session. kdeinit4 has a session that includes most of X. akonadi gets a session. nepomuk does too.

The only thing that seems "wrong", if you can call it that, is that most of KDE (or more generally X, tho here, X itself has its own session) and its apps share a session. But that's a given with the algorithm used, and as this LWN article demonstrates, even then there are benefits, as it tends to isolate all of X into its own little corner so it can't interfere with the rest of the system and there's still enough interactivity available outside X to take down a runaway process. Sure, splitting the X group further would be nice, especially for "CLI-challenged" individuals, but autogrouping really does vastly improve things.

I do find it comforting that KDE's so-called "semantic desktop" bits are isolated to their own sessions, however. Neither nepomuk nor akonadi can run away with things. Similarly, CLI sessions are just that, pts or tty, they're in their own session, thus isolating system-update jobs (I DID say Gentoo after all =:^) into their own corner.

Duncan