LWN.net Logo

Toward reliable user-space OOM handling

By Jonathan Corbet
June 5, 2013
A visit from the kernel's out-of-memory (OOM) killer is usually about as welcome as a surprise encounter with the tax collector. The OOM killer is called in when the system runs out of memory and cannot progress without killing off one or more processes; it is the embodiment of a frequently-changing set of heuristics describing which processes can be killed for maximum memory-freeing effect and minimal damage to the system as a whole. One would not think that this would be a job that is amenable to handling in user space, but there are some users who try to do exactly that, with some success. That said, user-space OOM handling is not as safe as some users would like, but there is not much consensus on how to make it more robust.

User-space OOM handling

The heaviest user of user-space OOM handling, perhaps, is Google. Due to the company's desire to get the most out of its hardware, Google's internal users tend to be packed tightly into their servers. Memory control groups (memcgs) are used to keep those users from stepping on each others' toes. Like the system as a whole, a memcg can go into the OOM condition, and the kernel responds in the same way: the OOM killer wakes up and starts killing processes in the affected group. But, since an OOM situation in a memcg does not threaten the stability of the system as a whole, the kernel allows a bit of flexibility in how those situations are handled. Memcg-level OOM killing can be disabled altogether, and there is a mechanism by which a user-space process can request notification when a memcg hits the OOM wall.

Said notification mechanism is designed around the needs of a global, presumably privileged process that manages a bunch of memcgs on the system; that process can respond by raising memory limits, moving processes to different groups, or doing some targeted process killing of its own. But Google's use case turns out to be a little different: each internal Google user is given the ability (and responsibility) to handle OOM conditions within that user's groups. This approach can work, but there are a couple of traps that make it less reliable than some might like.

One of those is that, since users are doing their own OOM handling, the OOM handler process itself will be running within the affected memcg and will be subject to the same memory allocation constraints. So if the handler needs to allocate memory while responding to an OOM problem, it will block and be put on the list of processes waiting for the OOM situation to be resolved; this is, essentially, a deadlocking of the entire memcg. One can try to avoid this problem by locking pages into memory and such, but, in the end, it is quite hard to write a user-space program that is guaranteed not to cause memory allocations in the kernel. Simply reading a /proc file to get a handle on the situation can be enough to bring things to a halt.

The other problem is that the process whose allocation puts the memcg into an OOM condition in the first place may be running fairly deeply within the kernel and may hold any number of locks when it is made to wait. The mmap_sem semaphore seems to be especially problematic, since it is often held in situations where memory is being allocated — page fault handling, for example. If the OOM handler process needs to do anything that might acquire any of the same locks, it will block waiting for exactly the wrong process, once again creating a deadlock.

The end result is that user-space OOM killing is not 100% reliable and, arguably, can never be. As far as Google is concerned, somewhat unreliable OOM handling is acceptable, but deadlocks when OOM killing goes bad are not. So, back in 2011, David Rientjes posted a patch establishing a user-configurable OOM killer delay. With that delay set, the (kernel) OOM killer will wait for the specified time for an OOM situation to be resolved by the user-space handler before it steps in and starts killing off processes. This mechanism gives the user-space handler a window within which it can try to work things out; should it deadlock or otherwise fail to get the job done in time, the kernel will take over.

David's patch was not merged at that time; the general sentiment seemed to be that it was just a workaround for user-space bugs that would be better fixed at the source. At the time, David said that Google would carry the patch internally if need be, but that he thought others would want the same functionality as the use of memcgs increased. More than two years later, he is trying again, but the response is not necessarily any friendlier this time around.

Alternatives to delays

Some developers responded that running the OOM handler within the control group it manages is a case of "don't do that," but, once David explained that users are doing their own OOM handling, they seemed to back down a bit on that one. There does still seem to still be a bit of a sentiment that the OOM handler should be locked into memory and should avoid performing memory allocations. In particular, OOM time seems a bit late to be reading /proc files to get a picture of which processes are running in the system. The alternative, though, is to trace process creation in each memcg, which has performance issues of its own.

Some constructive thoughts came from Johannes Weiner, who had a couple of ideas for improving the current situation. One of those was a patch intended to solve the problem of processes waiting for OOM resolution while holding an arbitrary set of locks. This patch makes two changes, the first of which comes into play when a problematic allocation is the direct result of a system call. In this case, the allocating process will not be placed in the OOM wait queue at all; instead, the system call will simply fail with an ENOMEM error. This solves most of the problem, but at a cost: system calls that might once have worked will start returning an error code that applications might not be expecting. That could cause strange behavior, and, given that the OOM situation is rare, such behavior could be hard to uncover with testing.

The other part of the patch changes the page fault path. In this case, just failing with ENOMEM is not really an option; that would result in the death of the faulting process. So the page fault code is changed to make a note of the fact that it hit an OOM situation and return; once the call stack has been unwound and any locks are released, it will wait for resolution of the OOM problem. With these changes in place, most (or all) of the lock-related deadlock problems should hopefully go away.

That doesn't solve the other problem, though: if the OOM handler itself tries to allocate memory, it will be put on the waiting list with everybody else and the memcg will still deadlock. To address this issue, Johannes suggested that the user-space OOM handler could more formally declare its role to the kernel. Then, when a process runs into an OOM problem, the kernel can check if it's the OOM handler process; in that case, the kernel OOM handler would be invoked immediately to deal with the situation. The end result in this case would be the same as with the timeout, but it would happen immediately, with no need to wait.

Michal Hocko favors Johannes's changes, but had an additional suggestion: implement a global watchdog process. This process would receive OOM notifications at the same time the user's handler does; it would then start a timer and wait for the OOM situation to be resolved. If time runs out, the watchdog would kill the user's handler and re-enable kernel-provided OOM handling in the affected memcg. In his view, the problem can be handled in user space, so that's where the solution should be.

With some combination of these changes, it is possible that the problems with user-space OOM-handler deadlocks will be solved. In that case, perhaps, Google's delay mechanism will no longer be needed. Of course, that will not be the end of the OOM-handling discussion; as far as your editor can tell, that particular debate is endless.


(Log in to post comments)

Toward reliable user-space OOM handling

Posted Jun 6, 2013 3:35 UTC (Thu) by butlerm (subscriber, #13312) [Link]

Why can't the kernel notify user space OOM handlers when a certain threshold is reached rather than when it has actually run out of memory? The latter sounds far too late for any sane design. You naturally want to know about the problem while you have some reasonable headroom to do something about it.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 7:44 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

If so many problems come from a OOM handler having to allocate memory, why not to reserve some memory for it in advance? It worked nicely 15 ago in a Java application on a tight memory budget. It preallocated a buffer that was released on OutOfMemoryException which also put the whole application into release-memory mode. When memory recovered, the buffer was taken back.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 14:13 UTC (Thu) by nix (subscriber, #2304) [Link]

What's more, there's already similar code for 'emergency memory pools' to handle swap-over-NFS (which has similar problems with allocations of potentially unbounded size at times of high memory pressure). It seems like something could be done with those, perhaps allowing kernel allocations on behalf of a userspace OOM killer to dip into the emergency pool. However, that sort of infectious behaviour has a tendency to be rather invasive to implement...

Toward reliable user-space OOM handling

Posted Jun 6, 2013 7:59 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Seems like a nice case for hard vs. soft limits.

I'm working with cluster computing and have exactly the same problem, except that I also have to support outdated garbage like RHEL6. So modifying the kernel was not really a good solution for me.

What I've done was to allow memcgs use some swap (about 10G each) and dialed swappiness way back, so it's not actually used until the main RAM is exhausted. I also have a privileged watchdog process that checks all the memcgs for the current RAM usage and if it comes close to the memcg limit it notifies OOM handler process withing the memcg.

Swap allows individual memcgs to temporarily exceed their limit, while OOM handler does its job. It's a bit racy, because processes can fill it up faster than OOM handler can kill them but since swap is so slow it doesn't really pose a problem in practice. Also, the central watchdog monitors the swap usage and if it spikes too much after the OOM notification the whole memcg is summarily killed.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 13:44 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

Why not just marking candidates for OOM take down in advance? If the OOM killer takes down all marked processes and memory is still short it can fall back to its usual tricks.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 18:33 UTC (Thu) by alankila (subscriber, #47141) [Link]

This OOM killer thing is nuts. Maybe we should never overcommit memory and just fail allocations that can't be backed at allocation time.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 18:52 UTC (Thu) by tstover (subscriber, #56283) [Link]

Agreed. Setting /proc/sys/vm/overcommit_memory to 2 supposedly will go back to the traditional malloc() == NULL behavior. I have to ask the obvious questions. What is wrong with swap? Why is a user space OOM killer better than just not letting user space programs over commit, and actually do memory management? ...

Oh! I see .... java ... sad sad sad

How much do you think this is really about the android platform - ie no swap, java, etc?

Toward reliable user-space OOM handling

Posted Jun 7, 2013 4:22 UTC (Fri) by Ben_P (subscriber, #74247) [Link]

In the standard Java library, you need to explicitly request some mmap'd memory that isn't mlock()'d (MappedByteBuffer). Otherwise, the JVM will be moving around chunks of your heap during compactions and such. Perhaps it's difficult to maintain reasonable performance across all the jvm platforms if you allow parts of the heap to get paged out?

On a related note, until the API changes or the Java spec changes to 64 bit indexed arrays you'll be unable to use mmap64() from that class.

Toward reliable user-space OOM handling

Posted Jun 9, 2013 0:13 UTC (Sun) by nix (subscriber, #2304) [Link]

The problem is not swap. The problem is fork(). On Solaris I used to routinely have problems running tiny git instances under my Emacs because the Emacs (VSZ 860Mb) wanted to fork() and exec() a 15Mb git instance, and Solaris stopped it, because there was only 700Mb swap free, and you never know but that Emacs might want to COW *all* its pages after that fork()! It never ever does (and almost no programs ever do), but the no-overcommit policy stopped me doing useful work nonetheless, just in case.

No-overcommit sucks.

(If you want proper no-overcommit you'd better also think about interactions with mmap(), and about what to do with stack allocations that run out of memory. You can't return NULL for *those*... so the program has to be ready to be OOM-killed on any function call or any new basic block in any case. So I don't see what overcommit gives you, other than the really annoying behaviour described above. It certainly doesn't buy you safety, just an insane proliferation of NULL checks and error paths that never get tested. Of course I write them anyway, but I know they'll never be executed so they nearly all just exit(), faking an OOM killer in case some idiot turned it off...)

Toward reliable user-space OOM handling

Posted Jun 9, 2013 12:05 UTC (Sun) by cortana (subscriber, #24596) [Link]

Forgive me if the answer is obvious, but why didn't Emacs (and other programs) use the vfork system call?

Toward reliable user-space OOM handling

Posted Jun 9, 2013 12:27 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

Because generally when emacs forks, it's likely to want to capture the output and/or control the input of whatever program the child execs. This is nearly painless with fork(), and a hideous mess with vfork().

Toward reliable user-space OOM handling

Posted Jun 18, 2013 12:16 UTC (Tue) by nix (subscriber, #2304) [Link]

I suppose it could in theory have used posix_spawn*(), which were added to make no-MMU POSIX implementations happy (which must implement fork() via immediate copying) and are in themselves an indictment of the non-fork()/exec() model, with a huge proliferation of functions for every possible thing that you might want to do between fork() and exec(). Except of course that it's not every possible thing, it's just a few things, and if you want to add a new thing you need to change the libc implementation (at least; quite possibly the kernel implementation on some platforms) and jam it through the Austin Group and wait for years, while in the fork()/exec() model it's just a few lines of code and a few minutes' work.

fork()/exec() is so clearly the Right Thing in this respect that I can't imagine why anyone ever argues against it. But people still do! (Though not here. This has been your out-of-place rant for the day.)

Toward reliable user-space OOM handling

Posted Jun 10, 2013 15:40 UTC (Mon) by alankila (subscriber, #47141) [Link]

I don't think you have any idea how java works.

Toward reliable user-space OOM handling

Posted Jun 6, 2013 20:41 UTC (Thu) by khim (subscriber, #9252) [Link]

Overcommit and OOM killer are totally orthogonal issues. You can exhaust memory on a system without overcommit, too.

Toward reliable user-space OOM handling

Posted Jun 8, 2013 3:32 UTC (Sat) by tstover (subscriber, #56283) [Link]

yes, but the reason those issues overlap is as follows:

-Without overcommitted memory, userspace programs know when they are out of memory with malloc() == NULL, or say mmap() failure.

-With overcommitted memory, userspace does not know about the condition until after memory is already exhausted, thus the need for a userspace OOM killer scheme.

My point is that either way is still user space memory management, so why not use the less complex model. My guess is that higher level userspace can't (ie java).

Toward reliable user-space OOM handling

Posted Jun 8, 2013 4:04 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

It's massively easier to deal with reducing your existing memory consumption than it is to deal with introducing an error path for every single malloc() you make.

Toward reliable user-space OOM handling

Posted Jun 8, 2013 10:13 UTC (Sat) by smcv (subscriber, #53363) [Link]

Indeed. See also <http://blog.ometer.com/2008/02/04/out-of-memory-handling-...> and <http://developer.pardus.org.tr/people/ozan/blog/?p=14>.

I'm sure libdbus still has bugs in OOM situations (that are, in practice, never reached), despite our best efforts. I certainly fixed several, years after Havoc's blog post.

Toward reliable user-space OOM handling

Posted Jun 8, 2013 20:34 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Also, the OOM killer does a better job at picking a process to kill. malloc()==NULL just hits whoever is the first to ask for memory after it has been exhausted.

Another approach is rlimits. It is traditional in Linux to give every process unlimited virtual address space, but you can limit it per-process. I massively overcommit, but have never seen OOM killer because I use rlimits, mostly set at half of the system's virtual memory. I normally have just one process at a time running amok, and that process gets malloc()==NULL and dies.

It's worth noting that at least half the time, the death is not graceful but a segfault. And half of that time it's because effort wasn't put into handling the out of memory situation and the other half it's because the programmer thought malloc() cannot fail in Linux.

Seems like there ought to be an option to have rlimit violation just generate a signal.

Toward reliable user-space OOM handling

Posted Jun 9, 2013 0:17 UTC (Sun) by nix (subscriber, #2304) [Link]

IIRC, if you exceed rlimits, you don't get malloc()==NULL, you get SIGKILLed. This is, of course, safer, because SIGKILL has predictable effects, while malloc()==NULL depends on a huge pile of never-tested error paths working right. -- or, at least, for memory rlimits it's safer. It's always struck me as a crazy thing to do for e.g. cpu-time rlimits, but, that's the historical behaviour, I guess...

(I use rlimits, too, and have also never seen OOM, despite all that 'git gc' could do.)

OOM and rlimits

Posted Jun 9, 2013 1:57 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

IIRC, if you exceed rlimits, you don't get malloc()==NULL

I can assure you that at least sometimes, you do. A few years back, I was having inexplicable SIGSEGVs when trying to play an audio CD with Mplayer. I finally traced through the code of Cdparanoia (used by Mplayer) and found 1) it was using far more memory than it ought to have used; and 2) it was ignoring malloc()==NULL and dereferencing the null pointer.

I fixed it, to get on with my project, and then sent the patch to the maintainer of Cdparanoia, who rejected it saying that malloc()==NULL is a myth because of infinite overcommit. (I told him I was actually experiencing the problem, but I don't know if he believed me).

It's always struck me as a crazy thing to do for e.g. cpu-time rlimits,

You mean SIGKILL is a crazy thing to do for cpu-time rlimits? If so, what would you do instead?

A lot of times the problem isn't just that the error paths weren't tested, but they were never written because it's too tedious.

For the rlimits other than CPU time, I guess the ideal design would be that the program can declare its ability to deal with an "out of resource" system call failure, and if it doesn't so declare, SIGKILL.

OOM and rlimits

Posted Jun 9, 2013 3:31 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

if you exceed CPU limits, I would expect that the result would be that you get scheduled less frequently so that you drop back below the limits, not that you get killed.

OOM and rlimits

Posted Jun 9, 2013 21:32 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

if you exceed CPU limits, I would expect that the result would be that you get scheduled less frequently so that you drop back below the limits, not that you get killed.

CPU time rlimits as we know them have always been total CPU time, as opposed to CPU utilization, so getting killed is pretty much the only option. But I've often thought that a more useful resource limitation for some processes would be a rate limitation. And I usually dream about having rate limits for other things too (e.g. you get 5 gigabyte-minutes per minute of real memory).

One way to implement CPU rate limit would be to keep the existing CPU time rlimit but just have the kernel stop scheduling the process indefinitely when the process goes over. Then some resource allocator process could raise the process' CPU time limit on a regular basis.

OOM and rlimits

Posted Jun 9, 2013 22:48 UTC (Sun) by anselm (subscriber, #2796) [Link]

It seems to me that control groups (cgroups) rather than resource limits may be what you are looking for.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 15:52 UTC (Mon) by alankila (subscriber, #47141) [Link]

Hmh. I guess I'd just like to programs to do some kind of forward declaration of their usage intent, like "I will never need more than 50 MB of memory". That sort of thing -- you could codify it as a limit to the virtual memory space and be content at earmarking 50 MB of the combined physical ram and swap. If you can't make this earmark, then you don't have the memory to run this program. malloc() failing ranodmly because of what other programs do is indeed ugly business, but having guaranteed working space of whatever number you specified could avoid it.

The reason why this wouldn't be completely horrible from user experience point of view is that you would generally put a decent amount of swap on your system, say 100 % of the size of your physical memory. If these supposed per-process limits are even reasonably accurate, the system will be severely paging before it hits any issue. The upshot of this exercise would be the elimination of OOM killer which I really don't like. It is a dynamic answer to a problem that I think only has "static" real answers: ahead of time contemplation of the maximum usage possible, and limiting of the resources for individual programs according to their expected maximum needs.

I don't expect this to happen though. I get the sense that this sort of thing was what people did before and it was horrible, so now we just put enough hardware to the task and the most advanced of us put some ulimits, and that's how the problem is solved.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 15:57 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

Say I want to view a 20000x20000 image by clicking a link in Firefox. That's going to be a 1.5GB bitmap. My system has enough RAM to do that, so I'd expect it to succeed. Except that that means that Firefox would have had to declare that it needs at least 1.5GB of memory up front, which would prevent it from starting on systems with less RAM than that.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 20:14 UTC (Mon) by alankila (subscriber, #47141) [Link]

There are classes of programs that can't determine in advance how much memory they will need. I know that. But I would prefer these tasks only to be the kind of tasks who can get NULL out of malloc.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 16:57 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Yes, if people aren't even willing to earmark the virtual memory at malloc time, they certainly wouldn't be willing to earmark it at program startup time.

Of course, some (rare) people are willing to earmark at malloc time (they turn off overcommit), and some of them would probably appreciate being able to earmark it at startup time. I have seen that done with user space memory allocators: the program at startup time (or other convenient decision point) gets from the kernel a large enough allocation of virtual memory for its future needs, then suballocates from that.

This is orthogonal to eliminating the OOM killer. You do that by turning off overcommit. What this adds is that when your program fails randomly because of other processes' memory usage, it fails at startup in code designed to handle that eventuality instead of at some less convenient point where you're trying to allocate 80 bytes to build a string.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 20:17 UTC (Mon) by alankila (subscriber, #47141) [Link]

Yeah, preallocating the heap ahead of time and then touching it (to make sure it is backed by kernel) seems like a way to escape a later OOM and could be coalesced into a startup routine, though with the downside that those pages could be mostly zero but nevertheless become anonymous memory that the kernel must track. Oh well.

Toward reliable user-space OOM handling

Posted Jun 10, 2013 21:43 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Touching pages of the preallocated heap ahead of time to prevent OOM later is probably a poor way to go about it, because you're just trading a later necessary OOM kill for a sooner possibly unnecessary OOM kill, and that doesn't seem better. Also, unless every process in the system does this preallocation, even the ones doing it could get killed later by the OOM killer. No one is safe from the OOM killer.

Better just to disable overcommit (/proc/sys/vm/...). Then there's no OOM kill ever, there are no unnecessary page faults, and when the swap space runs out, the programs that preallocate the heap just get an allocation failure as they start up and can terminate gracefully.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds