A group scheduling demonstration
Rawhide users know that surprises often lurk behind the harmless-looking yum upgrade command. In this particular case, something in the upgrade (related to fonts, possibly) caused every graphical process in the system to decide that it was time to do some heavy processing. The result can be seen in this output from the top command:
The per-session heuristic had put most of the offending processes into a single control group, with the effect that they were mostly competing against each other for CPU time. These processes are, in the capture above, each currently getting 5.3% of the available CPU time. Two processes which were not in that control group were left essentially competing for the second core in the system; they each got 46%. The system had a load average of almost 22, and the desktop was entirely unresponsive. But it was possible to log into the system over the net and investigate the situation without really even noticing the load.
This isolation is one of the nicest features of group scheduling; even when
a large number of processes go totally insane, their ability to ruin life
for other tasks on the machine is limited. That, alone, justifies the cost
of this feature.
Index entries for this article | |
---|---|
Kernel | Group scheduling |
Kernel | Scheduler/Group scheduling |
Posted Mar 17, 2011 8:28 UTC (Thu)
by Lennie (subscriber, #49641)
[Link] (1 responses)
Does the groupscheduling also handle i/o scheduling ? Or maybe just when asked to do so ?
Posted Mar 17, 2011 12:56 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Mar 17, 2011 9:36 UTC (Thu)
by rvfh (guest, #31018)
[Link] (10 responses)
Posted Mar 17, 2011 11:08 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
I've frequently wondered whether throttling memory allocations for memory-intensive apps would be a better solution, than the current system-wide thrashing that takes down *all* processes -- including the X server and ssh.
And now it occurred to me that a per-session memory cgroup could already solve this problem. Not sure how much overhead it adds, though.
Posted Mar 17, 2011 23:28 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (5 responses)
It is much harder to charge a disk I/O to a process than to charge a CPU second to a process. If I read a page in from swap, do you charge that I/O to me or to the process that stole the page frame from me that used to contain that page, making me read it in again?
But the group scheduling which is the subject of this article is irrelevant to the problem you mention anyway, because you're talking about only one selfish process.
The problem you're talking about is not a distribution of resources problem. It's thrashing. Thrashing is not caused by high demand for resource, but by poor scheduling. It's where Linux doesn't have enough memory to meet the demands of all the running processes, but it keeps dispatching all of them anyway, round robin, so that no process ever has memory long enough to get anything done. If the memory is oversubscribed by a factor of two, the processes don't run half as fast, but a thousandth as fast.
The solution to thrashing is to let a process keep its memory long enough to make substantial progress before dispatching a competing one, even if that means the CPU is idle while there are processes ready to run.
Posted Mar 18, 2011 7:31 UTC (Fri)
by rvfh (guest, #31018)
[Link] (4 responses)
Correct? If yes, then we could have a process group that we declare unswappable, unstoppable (better be very reliable then!), or would I still be missing a very difficult-to-solve part?
Posted Mar 18, 2011 17:02 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (3 responses)
It doesn't really have to be that hard. It isn't necessary to identify a set of high priority processes; all it takes is a more intelligent scheduler that senses thrashing and makes different choices as to what process to run next. It places some processes on hold for a few seconds at a time, in rotation. That way, the shell from which you're trying to kill that runaway process would have several second response time on some commands instead of several minutes, and that's good enough to get the job done.
This kind of scheduling was common ages ago with batch systems. It's easier there, because if you put a batch job on ice for even ten minutes, nobody is going to complain. In modern interactive computers, it's much less common. Maybe because very slow response is almost as bad as total stoppage due to thrashing in most cases (just not in the one we're talking about).
The key point is that thrash avoidance isn't about giving some processes priority for scarce resources; it's about making every process in the system run faster.
Posted Mar 24, 2011 19:49 UTC (Thu)
by jospoortvliet (guest, #33164)
[Link] (2 responses)
(mostly firefox & kmail eating too much ram)
Posted Mar 24, 2011 20:49 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (1 responses)
You at least should have a memory rlimit on those processes, then. That way, when the program tries to grab an impractical amount of memory, it just dies immediately.
Rlimits are full of holes (your quota gets replicated every time you fork; I/O address space and shared memory gets counted against it) and are hard to set (you have to set it from inside the process), but they help a lot. I run with an address space rlimit of half my real memory size on every process (with some special exceptions). Most people use the default, which is no limit on anything.
The shell 'ulimit' command is the common way to set an rlimit. I don't know what it takes to set that up for your firefox and kmail processes.
Posted Mar 26, 2011 9:40 UTC (Sat)
by efexis (guest, #26355)
[Link]
Create shell script wrapper for what you want to run:
#!/bin/bash
That puts it into a 1200meg group, no matter how many processes it forks, the entire lot cannot go over that 1200, and if they do, an OOM killer will kick in within only that group. You can also put similar lines at the top of scripts in /etc/init.d for example (obviously not needing the 'exec' line if you're adding to an existing startup script).
As long as you don't give any group 100% memory (I tend to put everything in 80% groups by default) no single runaway process or set of processes can ever bring the entire system down because there's always that 20% left it cannot touch.
Posted Mar 25, 2011 14:39 UTC (Fri)
by kabloom (guest, #59417)
[Link] (2 responses)
If this can be fixed with a patch to the scheduler, rather than eliminating swap space, that might be better.
Posted Mar 28, 2011 21:56 UTC (Mon)
by Duncan (guest, #6647)
[Link] (1 responses)
1) Suspend-to-disk (aka hibernate) uses swap-space for the suspend image. No swap-space, no hibernate.
This can be worked around by enabling the desired suspend image swapspace in the hibernate script itself. I use a custom hibernate script here that does that, with the partition it uses and whether it enables it or not part of the configurables. Of course if it enabled it, it disables it after resume, too.
2) In some cases, disabling swap at least /used/ to disable memory transfer between various memory zones, etc. It's not as critical for 64-bit, but for 32-bit especially, it's quite possible to have a shortage of memory within one zone while the system as a whole still has plenty of memory, so having the flexibility to transfer memory between zones, where the memory ranges overlap and it can be used for either zone, can be very important.
Meanwhile, this won't work for everyone, but on machines with multiple storage devices (normally disks, but with SSD these days...), setting up multiple swap partitions, one per device, and specifically setting equivalent swap priority on all partitions, directs the kernel to stripe swap across all partitions at that priority, effectively RAID-0-ing them. I have four SATA drives here with most of the system in RAID-1, but have arranged swap as above, across all four devices. I can be a half-gig or more into swap before I really start to notice it, and at one point I had a runaway process that I let eat all 8 gigs of physical memory plus 15+ gigs of the 16 gigs of swap (4-gig each device, as they were laid out identically and that gave me a single partition of 4 gigs, half the size of memory, for the suspend image, as that was the biggest suspend image allowed), before I killed it, just because it was fascinating to watch. After a couple gigs of swap or so, there was a definite drag on responsiveness, but the system never did totally lock up, and when I realized what was happening, I opened a shell and had the kill command all typed in ready to hit enter, before it hit double-digit gigs of swap. After that it was just watch it happen and wait until it got as close as I dared to the limit, before I hit enter and killed the offender.
Also, while everybody else writing about it seems to recommend reducing vm.swappiness to something like 20, from the default 60, on the above striped swap config, I invert that, running vm.swappiness=100, as I'd rather dump into the 4-way-striped swap than dump hard-earned RAID-1 cache. Works great here! =:^)
Duncan
Posted Mar 29, 2011 7:35 UTC (Tue)
by rvfh (guest, #31018)
[Link]
According to the kernel documentation, you can set up a swap file for this. Not tried yet though...
Posted Mar 19, 2011 17:47 UTC (Sat)
by meyert (subscriber, #32097)
[Link] (3 responses)
Posted Mar 19, 2011 23:43 UTC (Sat)
by corbet (editor, #1)
[Link] (2 responses)
Posted Mar 20, 2011 11:23 UTC (Sun)
by meyert (subscriber, #32097)
[Link] (1 responses)
Posted Mar 28, 2011 22:09 UTC (Mon)
by Duncan (guest, #6647)
[Link]
As can be seen from the top screenshot and as explained in the text, all those X apps were splitting the time available to their group, which should have been the one set by the kernel as they aren't services. At least, if systemd had altered their scheduling, it did so in the same way that the kernel autogrouping would have, because the autogrouping would have put the entire X session in the same group, and that's the scheduling behavior observed.
Duncan (on Gentoo, no systemd, mainline kernel 2.6.38, autogrouping on, so what systemd might or might not do doesn't presently affect me in the slightest)
Posted Mar 27, 2011 12:26 UTC (Sun)
by poelzi (guest, #14953)
[Link] (1 responses)
Posted Mar 28, 2011 22:57 UTC (Mon)
by Duncan (guest, #6647)
[Link]
But they show a reasonably logical grouping, here. Sure they aren't perfect, but the grouping is reasonably good given the simpleness of the algorithm. Each tty gets its own session as does each pts. The kernel threads are all session 0. init is session 1. Various daemons each get their session, shared with forks. startx, etc, has a session. kdeinit4 has a session that includes most of X. akonadi gets a session. nepomuk does too.
The only thing that seems "wrong", if you can call it that, is that most of KDE (or more generally X, tho here, X itself has its own session) and its apps share a session. But that's a given with the algorithm used, and as this LWN article demonstrates, even then there are benefits, as it tends to isolate all of X into its own little corner so it can't interfere with the rest of the system and there's still enough interactivity available outside X to take down a runaway process. Sure, splitting the X group further would be nice, especially for "CLI-challenged" individuals, but autogrouping really does vastly improve things.
I do find it comforting that KDE's so-called "semantic desktop" bits are isolated to their own sessions, however. Neither nepomuk nor akonadi can run away with things. Similarly, CLI sessions are just that, pts or tty, they're in their own session, thus isolating system-update jobs (I DID say Gentoo after all =:^) into their own corner.
Duncan
What about i/o ?
I/O scheduling is a different controller, and it's not quite as far along as group CPU scheduling, but it's getting there. Some recent coverage in this article and this one too.
What about i/o ?
A group scheduling demonstration
A group scheduling demonstration
group scheduling and I/O
I'd like the same for IOs so my machine can swap like mad because of a runaway process and still let me login and kill that process
How to maintain access to the machine
I think this has been discussed before, but we would basically need two things to keep access to the machine then:
* the processes we need to be pinned to the physical memory (ssh, bash, ...)
* the other processes to stop I/O then the processes above are being accessed, or better, our processes to not need I/O at all
How to maintain access to the machine
How to maintain access to the machine
How to maintain access to the machine
this issue bites me about twice a day, forcing me to do a hard reset :(
(mostly firefox and kmail eating too much ram)
How to maintain access to the machine
CG="/cgroup/browser"
[[ -d "$CG" ]] || {
mkdir -p "$CG"
echo $(( 1048576 * 1200 )) > "$CG/memory.limit_in_bytes"
}
echo $$ > "$CG/tasks"
exec /path/to/browser "$@"
A group scheduling demonstration
A group scheduling demonstration
A group scheduling demonstration
systemd and autogroup scheduling
I don't know if Rawhide has this feature enabled or not, actually; I run my own kernels most of the time.
Rawhide
Rawhide
and as a user of rawhide you probably use systemd as init daemon.
Rawhide
A group scheduling demonstration
As maintainer of ulatencyd https://github.com/poelzi/ulatencyd I came the conclusion that there is no way to optimize a system based on such simple rule. You need workarounds and heuristics and also contact to the X server etc.
A group scheduling demonstration