LWN: Comments on "Another OOM killer rewrite"

Another OOM killer rewrite

jayen — Sun, 09 Jan 2011 08:57:55 +0000

I have repeatedly had the oom killer kill KDE when run on my laptop in the presence of a memory hogging task (firefox) that is restored when the session is restored. :(

Another OOM killer rewrite

biji — Sun, 02 Jan 2011 12:31:07 +0000

The new OOM killer make project like zram/compcache unusable, usually i can run 4 virtual machine on my 2G. Now it get killed :( :( I even have use oom_score_adj. Any tips on this?

             total       used       free     shared    buffers     cached
Mem:          1998       1652        346          0         29        272
-/+ buffers/cache:       1350        648
Swap:          999        424        575
Total:        2998       2077        921

thanks

Another OOM killer rewrite: GUI pop-up

oak — Sat, 21 Aug 2010 19:37:05 +0000

That's way too fragile.

Use memory cgroups and their new OOM-kill handler/notifier and put the GUI, X and anything else the GUI program uses to a higher priority cgroup and anything you might want to kill to another cgroup.

Note that normal users don't know what all the (important!) background daemons are so the GUI should probably list for killing only the GUI processes which user has himself opened and which importance he knows.

Another OOM killer rewrite: GUI pop-up

AnswerGuy — Thu, 19 Aug 2010 20:18:33 +0000

Clearly the way to avoid the memory shortage at the time that you're OOM is to pre-create the GUI dialog, pre-allocating all it's memory and then have it tied into some IPC (even an open FS on a named PIPE, a la /dev/initctl). Even a signal handler might work)

Then the OOM code simply posts an event to the IPC or sends the signal.

Now the GUI un-hides itself (this might trigger some memory utilization in the X server's backing store but that's very likely to already be available from X' heap and if any malloc() fails I'd hope that the X server would be robust enough to simply throw away the backing. Backing store is a caching feature that should fail gracefully).

The trick now is for the code filling in the GUI dialog to traverse the process table, displaying entries and allowing the selection of death all within the memory it pre-allocated. It must be prepared to page through the process listing in relatively small (let's say 4KB) fragments.

Another OOM killer rewrite

nix — Sun, 20 Jun 2010 20:15:25 +0000

a OOM killer that pops up a little GUI

Yeah, but doing that while avoiding (or absolutely minimising) memory allocations, that's hard.

More memory - less OOM situations - rlimits

giraffedata — Sun, 20 Jun 2010 17:16:13 +0000

I guess I don't really understand the internals of how rlimits work

But what you're asking about is externals. I don't know much about the internals myself.

Some of this is hard to document because the Linux virtual memory system keeps changing. But I think this particular area has been neglected enough over the years that the answer that I researched about 10 years ago is still valid.

The rlimit is on vmsize (aka vsize). Vmsize is the total amount of virtual address space in the process. That includes shared memory, memory mapped files, memory mapped devices, and pages that use no memory or swap space because they're implied zero. Procps 'ps' tells you this figure with --format=vsize (it's also "sz" in ps -l, but in pages, while "vsize" is in KiB).

A new process inherits all its parent's rlimits, so a parent with a 10M limit can use 5M and fork two children that do likewise and use a total of 15M just fine.

The user-wide process rlimit is an exception to the idea that a process can extend its resource allocation through the use of children, but it's not an exception to the basic idea that a child inherits the parent's full limit -- it's just a bizarre definition of limit intended to hack around the basic weakness of rlimit that we've been talking about. Apparently, fork bombs were a big enough problem at some time to deserve a special hack.

More memory - less OOM situations

efexis — Sun, 20 Jun 2010 08:53:19 +0000

Yep I used to set it in my own code to protect the rest of the system from my mistakes, and still do when am forced to use a system without memory cgroups, but many of the servers I help to maintain host sites for people developing in php, who 'coincidentally' (seeing as this is respectable lwn not slashdot!) have no clue about the underlying system or scalability, and they do all kinds of crazy things!

Also I guess I don't really understand the internals of how rlimits work, what they apply to, things like the allocating of costs of shared memory etc, whereas I followed the development of cgroups so have a better understanding. So do processes inherit the parents limits but get their own counters? So you set say 10M data limit, you can end up with three processes with 5M each just fine? I think that's what you meant by new set of limits, or do you mean forked procs get reset to the hard limit or no limit? I guess this is with the exception of the user wide process limit?

It's hard to tell the answers to these just from the man pages which is all the looking I've done so far, 'tho a few mins experimenting I could figure it out, but any main/quick details you wouldn't mind sharing off the top of your head I would be grateful, there're still a couple of older webservers without cgroup support I'm trying to help family with so rlimits would no doubt be helpful!

Cheers :-)

More memory - less OOM situations

giraffedata — Sat, 19 Jun 2010 01:52:32 +0000

The old way to do this is with rlimits. I've always set rlimits on vsize of every process -- my default is half of real memory (there's a bunch of swap space in reserve too). Before Linux, rlimits (under a different name) were the norm, but on Linux the default is unlimited and I think I'm the only one who changes it.

Rlimits have a severe weakness in that a process just has to fork to get a whole fresh set of limits, but they do catch the common case of the single runaway process.

More memory - less OOM situations

efexis — Thu, 17 Jun 2010 23:38:52 +0000

Is it definitely leaky, as opposed to perhaps it coincides with backups or something io intensive that can fill your page cache, vacating stuff out to swap.

One horrible but quick fix for that is to have something like this set to run in the morning so everythings loaded back into memory ready for you to use it:

#!/bin/bash
getnum() {echo $2}
cached=$( getnum `grep ^Cached: /proc/meminfo` )
memFree=$( getnum `grep ^MemFree: /proc/meminfo` )
swapTotal=$( getnum `grep ^SwapTotal: /proc/meminfo` )
swapFree=$( getnum `grep ^SwapFree: /proc/meminfo` )
swapped=$(( $swapTotal - $swapFree ))
available=$(( $cached + memFree ))
[[ "$available" > "$swapped" && "$swapped" > 0 ]] && {
	echo "Loading everything back from swap"
	swapoff -a
	echo "Enabling swap again"
	swapon -a
}

And of course if it isn't that then it won't bother doing anything so at least there should be nothing to lose! More elegant fixes depend on your setup.

More memory - less OOM situations

efexis — Thu, 17 Jun 2010 22:47:42 +0000

I do, either disable or severely limit swap. The only time I ever tend to hit OOM is when something's misbehaving (infinite loop + memory leak, fork bomb etc) so I get my system back once swap's exhausted, no matter how much swap there is, it'll never be bigger than infinite, so making it big seems to have the single effect of delaying how quickly I can recover the system.

But best thing I've found to do is just put everything in seperate cgroups with memory limits set at around 80% by default, so no single thing can take out the whole system. OOM killer works just great then, killing where needs to be killed. Example cgroup report:

$ cgroup_report
CGroup                           Mem: Used       %Full
apache                            244,396K      24.43%
compile                               524K       0.03%
mysql51                            81,340K       8.32%
mysql55                            85,756K       8.78%
netserv/courier/authdaemond        16,444K       3.28%
netserv/courier/courier            52,452K       5.24%
netserv/courier/esmtpd-msa            260K  (no limit)
netserv/courier/esmtpd              2,768K       0.55%
netserv/courier/imapd             785,568K      78.55%
netserv/courier/pop3d               8,492K       0.84%
netserv/courier                          0
netserv/courier/webmaild                 0
netserv                           176,320K      12.03%
network                               300K       0.03%
rsyncd                                264K       0.10%
sshd                              282,904K      28.29%
system/incron                            0
system                             15,588K       1.55%
system/watchers/sif                 1,144K       3.66%
system/watchers                          0
.                                   6,936K  (no limit)

Systems without cgroups though, yikes, when they go wrong they make my head hurt. I'm sure cgroups are the answer to everything :-p

Another OOM killer rewrite

Zizzle — Thu, 17 Jun 2010 16:24:48 +0000

In the age of Kernel Mode Setting, I was just thinking that the desktop distros should consider a OOM killer that pops up a little GUI that gives a chance to kill off an offender or two. Often the user has context and preferences that are quite dynamic.

Have the GUI timeout and default to the existing OOM code that the machine stays functional if the user is not around.

Or maybe something along the lines of GNOME's low disk space warning. A userspace monitor that either notices, or gets hints from the kernel that memory+swap is running low and pops up a window to allow clean shutdown of applications.

At the very least it would be great if the desktop distros could display a pop up or something letting the user know that the OOM killer has run, what it selected and why.

I'm wondering how many apps wrongly get the blame for crashing because they have been selected by the OOM.

OOM, This toilet cleaner smells good

aigarius — Mon, 14 Jun 2010 21:13:15 +0000

As commented above, the OOM killer is not designed to fix the swap-trashing problem you describe, the OOM killer is designed to fix the problem when your system is so active that you run out of *both* RAM and swap. Only then will OOM killer be activated at all.

For your problem, the solution is to greatly reduce you swap size, reduce the 'swappiness' value of your kernel and also implement the "don't cache disc access by this process" feature and mark you overnight cron jobs with that bit.

Another OOM killer rewrite

cuboci — Mon, 14 Jun 2010 13:13:44 +0000

Highly amusing read :)

More memory - less OOM situations

alsuren — Mon, 14 Jun 2010 10:58:27 +0000

I've always blindly followed the old recommendation of having twice as much swap as RAM, so I don't think I've ever hit the OOM condition. Swappy-swappy-death is no fun though. Is there a way to make the OOM killer detect swappy-swappy-death and kick in then?

Maybe we should all just disable swap entirely, and then forkbomb our machines periodically to see how good the OOM killer really is.

Are there any unit tests for this?

Oddscurity — Sun, 13 Jun 2010 23:22:06 +0000

I'm thinking the unit to be tested suggested by GP is the aforementioned badness() function.

Are there any unit tests for this?

bronson — Sat, 12 Jun 2010 17:42:50 +0000

* outputs the PID to kill first.

Sounds good.

* takes as input the system state

Is that all? :) Still having a hard time picturing a discrete unit to test here.

More memory - less OOM situations

giraffedata — Fri, 11 Jun 2010 20:45:29 +0000

This would suggest that the reason people are happier in general is that margin between where swap gets used at all and where the system is thrashing is bigger, and the size of the one-time delay on waking up is big enough that people notice and respond.

It doesn't look to me like that margin is the difference. I think it's just the level at which swap gets used at all. In the old days, you operated normally in the aforementioned margin, so never knew you were close to the upper edge of it, and also had no separate reason to go shoot memory hogs manually. Today, as soon as you enter the margin, you're both bothered and warned of memory excesses, so you do the manual OOMK.

OOM, This toilet cleaner smells good

midg3t — Fri, 11 Jun 2010 15:35:19 +0000

This sounds like one toilet cleaner that I'll be happy to receive! I just look forward to an OOM killer that will do its job in less time than it takes me to log in to a swap-thrashing system, get frustrated an hit the reset button.

Kudos to Andrew Morton for ignoring "nacks" with no supporting comments.

Are there any unit tests for this?

walles — Fri, 11 Jun 2010 14:15:48 +0000

The unit that would be nice to have tests for would be whatever it is that:
* takes as input the system state
* outputs the PID to kill first.

Regards /Johan

Are there any unit tests for this?

bronson — Thu, 10 Jun 2010 18:37:46 +0000

Not quite sure what you're picturing as the unit that would be tested in this case...?

More memory - less OOM situations

iabervon — Thu, 10 Jun 2010 18:17:44 +0000

Isn't that situation actually more like that you've got bad processes which have used up nearly all of RAM, but they're asleep, and other processes that use some memory have run overnight, and pushed parts of sleeping processes into swap? That is, you're noticing at a point in memory consumption when the total usage is more than RAM and less than swap, but the system's active set fits in RAM, so the only symptom is that things have to be swapped back in, and you notice and fix the problem. This would suggest that the reason people are happier in general is that margin between where swap gets used at all and where the system is thrashing is bigger, and the size of the one-time delay on waking up is big enough that people notice and respond.

More memory - less OOM situations

NAR — Thu, 10 Jun 2010 14:33:27 +0000

Actually yes. Usually it happens on one morning when I turn off the screen lock and I notice that it takes too long time to change workspaces. CPU usage is practically 0 and the system has just started to swap, so it's not that hard to kill firefox and opera. I don't like to lose the shell histories, the 100+ opened files in vim, etc. so I try to avoid logging out, even though I have to do it every other month when the desktop environment leaked too much memory.

Another OOM killer rewrite

sean.hunter — Thu, 10 Jun 2010 14:24:25 +0000

The way it's written it's trivial to wrap your postgresql startup in a shellscript that pokes /proc and thus makes the oom killer not whack it. This flexibility of local configuration is always going to be better than a bunch of in-kernel heuristics.

More memory - less OOM situations

zlynx — Thu, 10 Jun 2010 13:38:31 +0000

Have you ever tried to operate a Linux system that is out of memory and swapping?

It isn't impossible but it can easily take 15 seconds and more between commands. It can take a minute to switch to a text console, another minute to log in, then more minutes to locate and kill the offending process.

Usually its easier to reboot the machine.

More memory - less OOM situations

NAR — Thu, 10 Jun 2010 13:12:45 +0000

complaints about poor choices are less common than they once were

I think the main reason for this is that nowadays we have more memory in our desktops than the KDE/GNOME applications plus firefox uses and even when gnome-panel manages to allocate half GB and the system starts swapping, the user has a good chance to kill the process manually instead of relying on the OOM killer.

Another OOM killer rewrite

ringerc — Thu, 10 Jun 2010 11:15:12 +0000

Naaargh! That implementation appears to still fail to consider shared memory, so it's *still* going to clobber poor innocent PostgreSQL backends instead of nailing the true culprits like Firefox.

Are there any unit tests for this?

walles — Thu, 10 Jun 2010 09:25:56 +0000

I don't know what kind of unit testing frameworks are in place in the kernel, but the OOMK's victim selection process sounds like something that should be *really* suitable for unit testing.

With unit testing in place bug reports could be converted into unit tests ("on a system with these processes using these amounts of memory, don't start with KDE") and be run on future generations of OOMKs as well.

What kind of unit testing frameworks *are* available in the kernel?

Another OOM killer rewrite

evgeny — Thu, 10 Jun 2010 07:43:26 +0000

Great article, Jon! Makes me almost forget that the Grumpy editor hasn't been surfaced for a year ;-).