LWN.net Logo

Delaying the OOM killer

By Jonathan Corbet
March 9, 2011
The out-of-memory (OOM) killer is charged with killing off processes in response to a severe memory shortage. It has been the source of considerable discussion and numerous rewrites over the years. Perhaps that is inevitable given its purpose; choosing the right process to kill at the right time is never going to be an easy thing to code. The extension of the OOM killer into control groups has added to its flexibility, but has also raised some interesting issues of its own.

Normally, the OOM killer is invoked when the system as a whole is catastrophically out of memory. In the control group context, the OOM killer comes into play when the memory usage by the processes within that group exceeds the configured maximum and attempts to reclaim memory from those processes have failed. An out-of-memory situation which is contained to a control group is bad for the processes involved, but it should not threaten the rest of the system. That allows for a little more flexibility in how out-of-memory situations are handled.

In particular, it is possible for user space to take over OOM-killer duties in the control group context. Each group has a control file called oom_control which can be used in a couple of interesting ways:

  • Writing "1" to that file will disable the OOM killer within that group. Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.

  • Through the use of a special eventfd() file descriptor, a process can use the oom_control file to sign up for notifications of out-of-memory events (see Documentation/cgroups/memory.txt for the details on how that is done). That process will be informed whenever the control group runs out of memory; it can then respond to address the problem.

There are a number of ways that this user-space OOM killer can fix a memory issue that affects a control group. It could simply raise the limit for that group, for example. Alternatives include killing processes or moving some processes to a different control group. All told, it's a reasonably flexible way of allowing user space to take over the responsibility of recovering from out-of-memory disasters.

At Google, though, it seems that it's not quite flexible enough. As has been widely reported, Google does not have very many machines to work with, so the company has a tendency to cram large numbers of tasks onto each host. That has led to an interesting problem: what happens if the user-space OOM killer is, itself, so starved for memory that it is unable to respond to an out-of-memory condition? What happens, it turns out, is that things just come to an unpleasant halt.

Google operations is not overly fond of unpleasant halts, so an attempt has been made to find another solution. The outcome was a patch from David Rientjes adding another control file to the control group called oom_delay_millisecs. Like oom_control, it holds off the kernel's OOM killer in favor of a user-space alternative. The difference is that the administrator can provide a time limit for the kernel OOM killer's patience; if the out-of-memory situation persists after that much time, the kernel's OOM killer will step in and resolve the situation with as much prejudice as necessary.

To David, this delay looks like a useful new feature for the memory control group mechanism. To Andrew Morton, instead, it looks like a kernel hack intended to work around user-space bugs, and he is not that thrilled by it. In Andrew's view, if user space has set itself up as the OOM handler for a control group, it needs to ensure that it is able to follow through. Adding the delay looks like a way to avoid that responsibility which could have long-term effects:

My issue with this patch is that it extends the userspace API. This means we're committed to maintaining that interface *and its behaviour* for evermore. But the oom-killer and memcg are both areas of intense development and the former has a habit of getting ripped out and rewritten. Committing ourselves to maintaining an extension to the userspace interface is a big thing, especially as that extension is somewhat tied to internal implementation details and is most definitely tied to short-term inadequacies in userspace and in the kernel implementation.

Andrew would rather see development effort put into fixing any kernel problems which might be preventing a user-space OOM killer from doing its job. David, though, doesn't see a way to work without this feature. If it doesn't get in, Google may have to carry it separately; he predicted, though, that other users will start asking for it as usage of the memory controller increases. As of this writing, that's where the discussion stands.


(Log in to post comments)

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 8:27 UTC (Thu) by epa (subscriber, #39769) [Link]

Writing "1" to that file will disable the OOM killer within that group. Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.
I understand some of the reasons for overcommitting memory when it's not known how much is really available or needed, but blocking on a definite out-of-memory seems just plain daft. If the kernel knows that no more memory is available why can't it pass that information on to user space?

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 8:28 UTC (Thu) by epa (subscriber, #39769) [Link]

why can't it pass that information on to user space?
I should have said, why can't it pass the info back to the process that requested more memory? (rather than a third userspace OOM-killer process)

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 11:06 UTC (Thu) by dgm (subscriber, #49227) [Link]

I think about the same. I see value in having a higher level control on which process get dumped, for instance, in systems composed of many processes where some are more expendable than others.

But you can "easily" simulate this with conventional means: in precious programs, use a custom version of malloc() that, on error, sends a kill signal to some of the processes on the expendable list.

Or I'm missing something?

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 11:56 UTC (Thu) by alonz (subscriber, #815) [Link]

Yes, you are missing something.

What you miss is that in modern Linux, malloc() practically never fails. It only allocates virtual space, which is not yet backed by physical memory.

Memory is only actually allocated when the process first touches it—which can be arbitrarily late. (Plus there are many other kinds of memory allocation: breaking of COW pages created by fork(), allocation of page structures in the kernel, skbs, …)

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 14:39 UTC (Thu) by epa (subscriber, #39769) [Link]

What you miss is that in modern Linux, malloc() practically never fails.
Right, but here is an instance where it does fail to allocate memory. It is known that the memory is not available right now (although of course it might become available at some point in the future). The malloc() API has provision for letting that be known to the application, by returning null. If the application wants to just hang until memory becomes available, that can easily be implemented in user-space (perhaps at the cost of a little busy-waiting or sleeping); but on the other hand if the application would like to find out when no more memory is available and do something, it's impossible to implement that on top of a malloc() interface that just blocks indefinitely.

Blocking on allocation failure - WTF?

Posted Mar 11, 2011 14:22 UTC (Fri) by droundy (subscriber, #4559) [Link]

The point is that malloc never does block indefinitely with overcommit enabled. It (essentially) always succeeds, and what blocks indefinitely is when you try to write to that shiny new virtual address space that was provided to you by malloc. At this point, the kernel realizes that there aren't any pages to provide you with, but it's too late for malloc to do anything to help you out, since malloc already succeeded.

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 16:15 UTC (Thu) by dgm (subscriber, #49227) [Link]

I see. So much complexity... but is it worth it? Is there someone really using overcommit?

On a production environment it has to be a maintenance nightmare to have random tasks killed without warning. I would disable it right away. And the same on desktops. I just looked and Ubuntu disables it by default.

Memory overcommit

Posted Mar 10, 2011 16:48 UTC (Thu) by rvfh (subscriber, #31018) [Link]

It's not just worth it, it's practically impossible to live without it. If you needed to have the total amount of real memory you malloc, most systems would need twice as much RAM.

Think about memory pools for example...

Memory overcommit

Posted Mar 10, 2011 21:49 UTC (Thu) by epa (subscriber, #39769) [Link]

If you needed to have the total amount of real memory you malloc, most systems would need twice as much RAM.
Or just twice as much swap space - which is not an issue on a typical desktop or laptop hard disk these days. (On an SSD or mobile phone it may be different.)

If you have a huge swapfile but no overcommit, and if some applications allocate lots of memory and then unexpectedly start using all the memory they asked for, then your system will start swapping and running slowly. If you just overcommit memory, then when apps start using all the memory they allocated the system will become unstable and processes will be killed without warning by the OOM killer. It's clear which is preferable.

Memory overcommit

Posted Mar 11, 2011 0:37 UTC (Fri) by droundy (subscriber, #4559) [Link]

If you just overcommit memory, then when apps start using all the memory they allocated the system will become unstable and processes will be killed without warning by the OOM killer. It's clear which is preferable.

Surely you've run into the situation where you've been unable to log into a machine because it's swapping like crazy, and are thus unable to kill the offending process? Is that really preferable to being able to go in and fix things immediately? Of course, things are easier when you've got a desktop and are already logged in, but even then I've seen situations where just switching to a terminal took many, many minutes, not to suggest the possibility of opening a new terminal.

The large majority of OOM-killer experiences I've had have been situations where there was a memory leak involved. In such cases, the OOM killer is usually quite good at identifying the culprit and killing it. If you add enough swap, then the system freezes up indefinitely (or until you're able to get a killall typed into a terminal). Not a huge improvement in my book. In any case, it's not clear which is preferable.

Memory overcommit

Posted Mar 16, 2011 12:17 UTC (Wed) by epa (subscriber, #39769) [Link]

You're right, it is sometimes preferable to have applications be killed rather than being unable to log into the machine. But even here the OOM killer seems like a useful sticking plaster rather than fixing the real problem. It would be better to have 5% of physical memory reserved for root (or for an interactive 'task manager' that lets you kill misbehaving apps), in the same way that 5% of disk space was traditionally reserved.

If the I/O scheduler were a bit smarter, then the swapping activities of processes would count against their I/O usage, so a single bloated Firefox process would not be able to monopolize the disk to the exclusion of everything else. Similarly there could be more fairness in physical RAM allocation, so it wouldn't be possible for one app to consume all the physical memory pushing everything else into swap; it would be limited to say 80%. (This is reasonable for desktop systems, of course for servers or number-crunching you don't care so much about interactive performance so you'd increase that figure.)

Memory overcommit

Posted Mar 16, 2011 17:43 UTC (Wed) by nix (subscriber, #2304) [Link]

I'd say that you want to require 5% or whatever for root-owned apps *that belong to a terminal*. This stops a maddened root-owned daemon from bloating up and leaving you unable to log in again.

Memory overcommit

Posted Mar 11, 2011 7:10 UTC (Fri) by rvfh (subscriber, #31018) [Link]

Why use loads of swap space when I have 3GB of RAM to run a bunch of services?

A quick look at top:
Mem: 3095608k total, 2405948k used, 689660k free, 292560k buffers
Swap: 722920k total, 43508k used, 679412k free, 1034276k cached

So there are 689 MB of unused memory! Not to mention 1 GB+ of of cached stuff for the system to grab in case of need.

Now let's see the memory usage (top two):
238m 126m 26m /usr/bin/systemsettings
294m 95m 23m /usr/lib/firefox-3.6.14/firefox-bin

So the two main memory users (systemsettings!?!) have committed more than half a gig of RAM, but use less than a quarter (I know these numbers are not perfect, but the idea remains).

And you want me to get rid of overcommit???

Memory overcommit

Posted Mar 16, 2011 12:19 UTC (Wed) by epa (subscriber, #39769) [Link]

Isn't that the point? You have three gigabytes of RAM, more than enough to give every application all it needs and do so in a guaranteed way - not 'you can probably have this but you might be randomly killed later on depending on what else happens'.

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 21:50 UTC (Thu) by nix (subscriber, #2304) [Link]

My firewall is a swapless embedded system with 512Mb RAM. Do I want overcommit turned on? I thought for perhaps as long as a microsecond before deciding 'hell, yes'.

Blocking on allocation failure - WTF?

Posted Mar 11, 2011 1:05 UTC (Fri) by dgm (subscriber, #49227) [Link]

I hate to reply to myself, but I just got it wrong. Overcommit in Ubuntu uses the kernel default, that is, do overcommit. As an experiment I turned it off (setting /proc/sys/vm/overcommit_memory to 2) and after that I couldn't start a new instance of chromium in a not very loaded system (a couple of terminals and Nautilus).

So apparently it's not only useful but critical.

Blocking on allocation failure - WTF?

Posted Mar 11, 2011 7:19 UTC (Fri) by Darkmere (subscriber, #53695) [Link]

Blame chrome there. They allocate a fair chunk of RAM to start with for their javascript memorypool. ( That wants to be continuous, iirc)

Basically, Chrome is designed around Overcommit. As are a few other apps.

Blocking on allocation failure - WTF?

Posted Mar 11, 2011 23:57 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

So apparently it's not only useful but critical.

Isn't the problem just that you adjusted /proc/sys/vm/overcommit_memory without making the corresponding adjustment in the amount of swap space?

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 14:17 UTC (Thu) by Tuna-Fish (subscriber, #61751) [Link]

How would you pass that information? Remember that memory is not allocated on malloc, but when you access a page. Any memory access can fail due to lack of memory -- how do you handle that in userspace?

Blocking on allocation failure - WTF?

Posted Mar 10, 2011 14:42 UTC (Thu) by epa (subscriber, #39769) [Link]

How would you pass that information?
By returning null, as is the documented interface for malloc().
Any memory access can fail due to lack of memory -- how do you handle that in userspace?
This is true, you can get errors on accessing memory that was previously 'allocated'. But that's not a reason for doing the wrong thing in this specific case. If it is quite certain that the memory isn't available, why not return that to the application using the documented interface? If the app wants to just sleep waiting for memory to become available, it can do so at its own choice.

Blocking on allocation failure - WTF?

Posted Mar 12, 2011 0:15 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

This is true, you can get errors on accessing memory that was previously 'allocated'. But that's not a reason for doing the wrong thing in this specific case.

Which specific case are you referring to? The article is about the OOM killer, which comes into play when a process accesses memory (for example, executes a STORE instruction), not when a process does malloc(). In that case, how would you have the kernel notify the application program?

And even if we're talking about a case where the the user space program could be told that the system is out of memory and given the option to do something other than block, that wouldn't be acceptable because the user space programs have already been written. We want to do the best possible thing given existing programs. And even if we were talking about programs not yet written, there's something to be said for freeing the coder from worrying about these tedious, extremely rare situations.

Blocking on allocation failure - WTF?

Posted Mar 16, 2011 12:22 UTC (Wed) by epa (subscriber, #39769) [Link]

Which specific case are you referring to?
I was referring to this from the article (my italics):
Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.
Rather than blocking indefinitely on malloc(), it would make more sense to just return null when there is no new memory available. The application can then decide whether it wants to keep retrying indefinitely, report the error to the user, just die in a big cloud of smoke, or do something else like freeing some of its own data (e.g. the JVM could do a garbage collection pass). If malloc() just blocks forever, the app doesn't have that choice.

Blocking on allocation failure - WTF?

Posted Mar 16, 2011 15:22 UTC (Wed) by giraffedata (subscriber, #1954) [Link]

Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.
Right. The process here does not block in malloc(). It blocks typically on a store instruction, but also on any of various system calls, such as open(). A malloc() at this time would succeed.

The process is attempting to allocate memory, as it is the process that is doing the system call or triggering the page fault in which kernel code attempts to allocate physical memory. malloc(), in contrast, doesn't, from the kernel's point of view, allocate memory — just addresses for it.

The article probably should have made a clearer distinction between memory as seen by user space code and memory as seen by the kernel.

Blocking on allocation failure - WTF?

Posted Mar 17, 2011 11:51 UTC (Thu) by epa (subscriber, #39769) [Link]

Thanks for the explanation.

Blocking on allocation failure - WTF?

Posted Mar 12, 2011 0:22 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

I understand some of the reasons for overcommitting memory when it's not known how much is really available or needed, but blocking on a definite out-of-memory seems just plain daft. If the kernel knows that no more memory is available why can't it pass that information on to user space?

The idea is that the out of memory situation is only temporary. The OOM killer or one of its user space henchmen will make more memory available eventually, probably by killing some process the administrator didn't say should be immune.

Delaying the OOM killer

Posted Mar 10, 2011 12:41 UTC (Thu) by misiu_mp (guest, #41936) [Link]

"To Andrew Morton, instead, it looks like a kernel hack intended to work around user-space bugs, and he is not that thrilled by it."

I tend to agree. Wouldn't it be possible for a process that implements an OOM killer, to physically allocate the needed memory on startup? A proper out-of-memory handler should be limited to some (conservative) exact memory requirement.

The simplest way is to run malloc followed by a memset (for systems with overcommit). This would guarantee the pages are physically allocated or terminate/OOM at an early stage.
Now if such handler would use a kernel call or a system library that requires memory (which would be out of control of the handler's programmer), that call should simply not be made, or the said call be fixed in the kernel/library, possibly also by reserving memory for this. This behavior should also be documented. E.g. free() should never fail.

Delaying the OOM killer

Posted Mar 10, 2011 12:52 UTC (Thu) by johill (subscriber, #25196) [Link]

I might actually need mlockall(), but even that seems feasible.

Delaying the OOM killer

Posted Mar 10, 2011 13:07 UTC (Thu) by nix (subscriber, #2304) [Link]

It seems very dangerous to have anything whose purpose is to respond to out-of-memory conditions and *not* lock it into memory. (What we really need is a way to say 'make this process less overcommitted'. If you do that and then do all the work in a thread, or do a bit of mprotect() work, you can even avoid stack expansions causing OOM.)

Delaying the OOM killer

Posted Mar 10, 2011 15:18 UTC (Thu) by nowster (subscriber, #67) [Link]

> As has been widely reported, Google does not have very many machines to work with...

Irony not marked?

Delaying the OOM killer

Posted Mar 10, 2011 21:49 UTC (Thu) by nix (subscriber, #2304) [Link]

It's the trademark dry Corbet wit.

Delaying the OOM killer

Posted Mar 11, 2011 17:53 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

I'm sure it must be rough for Google, with their rack of aging 386SX's populated almost entirely of SIMMs ganked from retired LaserJets from the local Uni.

;-)

Delaying the OOM killer

Posted Mar 12, 2011 18:15 UTC (Sat) by Per_Bothner (subscriber, #7375) [Link]

It's the trademark dry Corbet wit.

Imagine 100 years from now somebody updating the HyperPedia article on Google, to note that Google ran on a small number of machines, and using this sentence as a reference.

History could be re-written because of careless sarcasm.

Delaying the OOM killer

Posted Mar 13, 2011 4:23 UTC (Sun) by dmag (subscriber, #17775) [Link]

> Imagine 100 years from now [..] using this sentence as a reference.

Riiight. A future in which LWN is the only known text of this time? And just this page so future historians don't notice his penchant for witty comments like this?

Besides, Google hasn't even hit 10M machines yet, so I think Corbet is spot on.

Delaying the OOM killer

Posted Mar 13, 2011 5:39 UTC (Sun) by neilbrown (subscriber, #359) [Link]

Agreed.

Further, it is widely reported that google go to some efforts to squeeze as much work as they possibly can out of the computers that they do have, which is the real point of background to this article.

So it certainly appears that they do not have spare computing resources.. is that the same as "not many computers" - It's hard to say. Many is a relative term:

I do have many computers at home - 7 or 8.
Google doesn't have many computers in their data centers - less than 10 million.

Delaying the OOM killer

Posted Mar 14, 2011 13:14 UTC (Mon) by alextingle (guest, #20593) [Link]

*fewer* than 10 million.

;-)

Delaying the OOM killer

Posted Mar 14, 2011 13:43 UTC (Mon) by nix (subscriber, #2304) [Link]

The future historians would also have to be unaware of the existence of Google, its size, and probably also of the existence of sarcasm. Given that we can spot sarcasm in the works of Chaucer and in Beowulf, I'd say that these future historians are remarkably implausible.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds