Taming the OOM killer

Posted Feb 5, 2009 10:17 UTC (Thu) by epa (subscriber, #39769)
In reply to: Taming the OOM killer by dlang
Parent article: Taming the OOM killer

any process that forks and execs allocates more memory than it needs.

Quite. Which is why a single fork-and-exec-child-process system call is needed. With that, there would be much less need to overcommit memory and so a better chance of avoiding hacks like the OOM killer.

The classical Unix design of separate fork() and exec() is elegant at first glance, but in practice it has caused various unpleasant kludges to cope with the memory overcommit. (Another one was vfork(), which IIRC was a fork call that used less memory but only worked as long as you promise to call exec() immediately afterwards. Why they didn't make a single fork-plus-exec primitive rather than this crufty interface eludes me.)

Taming the OOM killer

Posted Feb 5, 2009 11:16 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (3 responses)

Since it is effectively the same as your all-in-one system call, since it would have to block till done to receive the necessary feedback. This way you can even implement a simple 'if exec(a) fails try exec(b) or otherwise flag error with (possibly program specific) meaningful data'...

I don't know if filehandle closing is allowed after vfork, but this would also be a great help to ensure the right file handles are passed (which would be a tedious operation or really complex/verbose in the case of a single fork-and-exec call).

Taming the OOM killer

Posted Feb 5, 2009 14:08 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

This way you can even implement a simple 'if exec(a) fails try exec(b) or otherwise flag error with (possibly program specific) meaningful data'...

Oh sure, sometimes you will want to do something more complex like that. In these cases vfork() (as in classic BSD) doesn't work, because the child process is using memory belonging to the parent. The traditional fork() then exec() is best.

But this is a small minority of the cases when an external command is run. And running an external command accounts for a large proportion of total fork()s. I'm just suggesting to make something more robust (avoiding the need to overcommit memory) for the common case.

Taming the OOM killer

Posted Feb 5, 2009 17:03 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

I benchmarked this a while back. The common case is pipeline invocation,
for which you need at least dup()s and filehandle manipulation between
fork() and exec(): and the nature of such manipulation differs for each
invoker...

Taming the OOM killer

Posted Feb 5, 2009 22:05 UTC (Thu) by epa (subscriber, #39769) [Link]

OK, I guess it's not as straightforward as I thought.

Perhaps another way to avoid the need for memory allocation would be to use a new vfork-like call (heck, it could even be called vfork) that has a fixed memory budget set as a matter of policy system-wide. So when you vfork(), the memory is set up as copy-on-write, but the child process has a budget of at most 1000 pages it can scribble on. That should be enough to set up the necessary file descriptors, but if it tries to dirty more than its allowance it is summarily killed.

That way, there is some upper limit to the amount of memory that needs to be allocated - when vfork()ing the kernel just needs to ensure 1000 free pages - and the kernel doesn't have to make a (possibly untrustworthy) promise that the whole process address space is available for normal use.

Taming the OOM killer

Posted Feb 5, 2009 11:49 UTC (Thu) by alonz (subscriber, #815) [Link] (2 responses)

<sarcasm>
So you would prefer the VMS/Win32 style “CreateProcess” system call, with its 30+ arguments—just in order to accommodate all possible behaviors expected from the parent process?
</sarcasm>

This isn't really simpler (except on block diagrams…)

Taming the OOM killer

Posted Feb 5, 2009 12:52 UTC (Thu) by mjthayer (guest, #39183) [Link] (1 responses)

Are you saying that nothing is possible in-between (v)fork plus exec and CreateProcess?

Taming the OOM killer

Posted Feb 5, 2009 20:00 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

(From SUSv2 / POSIX draft.)

The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.

On the other hand, pid_t child = clone(run_child_func, run_child_stack, CLONE_VM, run_child_data) would do the trick. The child would share memory with the parent, making overcommit unnecessary, but would have a different file descriptor table, allowing pipelines to be set up easily.

Taming the OOM killer

Posted Feb 5, 2009 16:55 UTC (Thu) by nix (subscriber, #2304) [Link]

fork()+other things+exec() are very common, to implement e.g. redirection
and piping. The call you want exists in POSIX, with the inelegant name
posix_spawn*(), but it's a nightmare of overdesign and a huge family of
functions precisely because it has to model all the things people usually
want to do between fork() and exec(). Its only real use is in embedded
MMUless systems in which fork() is impractical or impossible to implement.

Taming the OOM killer

Posted Feb 5, 2009 17:40 UTC (Thu) by martinfick (subscriber, #4455) [Link] (12 responses)

While the fork/exec example is a valid example of why over commit exists, it is not the only reason. Any attempt to propose a solution for this case only is a waste of time. The reality is that many fork only (not exec) situations allow processes to share huge amounts of memory through COW also. Eliminating over commit would make this impossible in many cases. Of course, if you don't like over commit, turn it off. But without it, many things simply aren't possible, this is probably why a large portion of people seem to like it's benefits.

Taming the OOM killer

Posted Feb 5, 2009 22:22 UTC (Thu) by epa (subscriber, #39769) [Link] (11 responses)

There may be some middle ground between eliminating overcommit altogether and continuing with the status quo. For example, fork() calls could continue to pretend that memory is available, on the assumption that the child process will soon exec() something or otherwise reduce its footprint; but if physical memory is tight then it's okay for the kernel to refuse memory allocation requests from a process (which is then passed up through the C library as a null return from malloc()). This might be more reliable than always having malloc() succeed, whether the OOM killer is turned on or not.

For those making embedded or high-availability systems who want to try harder and turn off overcommit altogether, fork-then-exec could be replaced in user space with posix_spawn or vfork-then-exec or similar.

Taming the OOM killer

Posted Feb 5, 2009 23:11 UTC (Thu) by martinfick (subscriber, #4455) [Link] (10 responses)

The OOM killer does not come into play when malloc is called. If malloc is called when there in no memory there is no need to kill any processes, malloc simply fails and return the appropriate error code.

The OOM killer kicks in when memory has been overcommitted through COW. Two processes are sharing the same memory region and one of them decides to write to that shared COW page requiring the page to now be copied. There is no memory allocating happening, simply a write to a memory page which is already allocated to a process (two of them actually).

Again, the fork then exec shortcut is not really the big deal, it is processes that fork and do not exec and then eventually write to a COW page.

Taming the OOM killer

Posted Feb 6, 2009 0:51 UTC (Fri) by nix (subscriber, #2304) [Link] (6 responses)

The OOM killer comes into play if memory is requested and is not
available, and the request is not failable. Several such allocations
spring to mind:

- when a process stack is grown

- when a fork()ed process COWs

- when a page in a private file-backed mapping is written to for the
first time

- when a nonswappable kernel resource needs to be allocated (other than a
cache) which cannot be discarded when memory pressure is high

- if overcommit_memory is set, if a page from the heap or an anonymous
mmap() is requested for the first time

So the OOM killer is *always* needed, even if overcommitting were disabled
as much as possible. (You can overcommit disk space, too: thanks to sparse
files, you can run out of disk space writing to the middle of a file. With
some filesystems, e.g. NTFS, you can run out of disk space by renaming a
file, triggering a tree rebalance and node allocation when there's not
enough disk space left. NTFS maintains an emergency pool for this
situation, but it's only so large...)

Taming the OOM killer

Posted Feb 6, 2009 0:58 UTC (Fri) by martinfick (subscriber, #4455) [Link] (5 responses)

Why when a process' stack is grown? In this case the process should fail (die?) just like when malloc would fail, but there should be no reason to upset other processes in the system!

Taming the OOM killer

Posted Feb 6, 2009 1:26 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

with malloc you can check the return code to see if it failed or not and handle the error

how would you propose that programmers handle an error when they allocate a variable? (which is one way to grow the stack)

Taming the OOM killer

Posted Feb 6, 2009 1:38 UTC (Fri) by brouhaha (subscriber, #1698) [Link] (1 responses)

The process should get a segfault or equivalent signal. If there is a handler for the signal, but the handler can't be invoked due to lack of stack space, the process should be killed. If the mechanism to signal the process in a potential out-of-stack situation is too complex to be practically implemented in the kernel, then the process should be killed without attempting to signal it.

At no point should the OOM killer become involved, because there is no reason to propagate the error outside the process (other than by another process noticing that the process in question has exited). A principle of reliable systems is confining the consequences of an error to the minimum area necessary, and killing some other randomly-selected (or even heuristically-selected) process violates that principle.

Taming the OOM killer

Posted Feb 6, 2009 5:26 UTC (Fri) by njs (subscriber, #40338) [Link]

> At no point should the OOM killer become involved, because there is no reason to propagate the error outside the process (other than by another process noticing that the process in question has exited).

This makes sense on the surface, but memory being a shared resource means that everything is horribly coupled no matter what and life isn't that simple.

You have 2 gigs of memory.

Process 1 and process 2 are each using 50 megabytes of RAM.

Then Process 1 allocates another 1948 megabytes.

Then Process 2 attempts to grow its stack by 1 page, but there is no memory.

The reason the OOM exists is that it makes no sense to blame Process 2 for this situation. And if you did blame Process 2, then the system would still be hosed and a few minutes later you'd have to kill off Process 3, Process 4, etc., until you got lucky and hit Process 1.

Taming the OOM killer

Posted Feb 7, 2009 17:52 UTC (Sat) by oak (guest, #2786) [Link] (1 responses)

Stack is actually yet another reason for overcommit. Current Linux
kernels map by default 8MB of stack for each thread (and usually threads
use only something like 4-8KB of that). Without overcommit, process with
16 threads couldn't run in 128MB RAM, unless you change this limit. I
think you can change it only from kernel source and it applies to all
processes/threads in the system?

Taming the OOM killer

Posted Feb 8, 2009 15:26 UTC (Sun) by nix (subscriber, #2304) [Link]

setrlimit (RLIMIT_STACK,...);

Taming the OOM killer

Posted Feb 12, 2009 14:32 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

The OOM killer does not come into play when malloc is called. If malloc is called when there in no memory there is no need to kill any processes, malloc simply fails and return the appropriate error code.

Ah, I didn't realize that. From the way people talk it sounded as though malloc() would always succeed and then the process would just blow up trying to use the memory. If the only memory overcommit is COW due to fork() then it's not so bad (though I still think some kind of vfork() would be a more hygienic practice).

Taming the OOM killer

Posted Feb 12, 2009 15:23 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

malloc() isn't implemented by the kernel, you might do better to listen to someone who knows what they're talking about. :/

Taming the OOM killer

Posted Feb 12, 2009 16:06 UTC (Thu) by dlang (guest, #313) [Link]

vfork tends to be strongly discouraged nowdays. it can be used safely, but it's easy to not use safely.

there are a growing number of such functions in C nowdays as people go back and figure out where programmers commonly get it wrong and provide functions that are much harder to misuse (the case that springs to mind are the string manipulation routines)