Applications absolutely need to be notified if they are running out of RAM.
They need to know it and need to be able to react to it.
Now whether or not applications actually do respond is something else
entirely. But you _MUST_ give application developers the chance to "do the
right thing", even if you don't expect them do to do it.
So out of memory errors to applications and hope that they use it and you
have OOM Killer if they don't.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 4, 2009 22:30 UTC (Wed) by vadim (subscriber, #35271)
[Link]
The OOM killer is never needed. Here's how things work:
With overcommit:
1. Application mallocs 200MB when there are 150MB available. Kernel hopes
it doesn't actually use it, and lets it happen anyway.
2. Applications work for a while.
3. At some point (not necessarily while running the app that malloced the
200MB), kernel realizes: crap, I'm out of memory. Process needs a page,
but there's no memory that can be freed and swap is full. Got to kill
something to make room. It picks a process and kills it. With SIGKILL.
Some process dies with no chance to react to it. It can't react because it
can get killed in the middle of absolutely anything, even something like a
for(;;); which doesn't allocate any memory. There's no way for it to react
sanely.
Without overcommit:
1. Application mallocs 200MB with 150MB free. Kernel says "nope, there
isn't that much" and malloc returns NULL.
2. At that point application can decide what to do about that. It may
abort, or refuse to open a document too large for memory but keep running,
decide it can work with a smaller internal cache, etc. If it's badly
written it doesn't check the malloc return value and crashes on the null
pointer, but even then, the application that goes is precisely the one
that wanted too much memory.
3. OOM killer isn't needed, because the kernel never gets to the "crap,
I'm out of memory" stage.
Done this way there's no need to hope for anything. You simply don't allow
the situation overcommit gets into to happen in the first place.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:17 UTC (Thu) by madscientist (subscriber, #16861)
[Link]
Unfortunately you're simplifying things a lot by only considering malloced memory. The system uses memory for many more things than an explicit malloc. The traditional reason that disabling overcommit is so painful is not that it causes malloc() to fail, but that it causes fork/exec problems. Consider what happens on fork: you get a complete copy of the parent process INCLUDING ALL ITS HEAP. Obviously that's painful, especially since 98% of the time you'll just turn around and call exec anyway, and then you won't need all that RAM. So what the system does is provide COW behavior... but what happens if you DON'T exec, and you start writing memory? Now the kernel has to come up with all that RAM it pretended that the child process had, that wasn't really committed. There's no way to catch that failure either.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:54 UTC (Thu) by epa (subscriber, #39769)
[Link]
Which is why fork/exec, while having a nice conceptual simplicity and a long Unix tradition, really needs to be replaced by a 'run child process' call that wouldn't double the memory usage for an instant only to throw it away again on exec.
(vfork() as in classical BSD is one answer, but still a bit crufty IMHO: rather than a special kind of fork that you can only use before exec, better to just say what you mean and have fork+exec be a single call.)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:55 UTC (Thu) by epa (subscriber, #39769)
[Link]
Ah... and I see that posix_spawn(3) does exist. Now we just need to fix userspace to use it.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:25 UTC (Thu) by nix (subscriber, #2304)
[Link]
posix_spawn*() is a bloody abomination. Nobody uses it because *despite*
being horrendously complicated it is *still* not flexible enough for
things that regular applications do all the time. And it never will be:
you'd have to implement a Turing-complete interpreter in there to approach
the flexibility of the fork/exec model...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 8:55 UTC (Fri) by epa (subscriber, #39769)
[Link]
At the moment if a 500Mbyte process forks itself the kernel has no idea whether it's about to exec() something else, in which case almost all those pages in the child process will be discarded, or if the child is going to continue on its way, in which case the pages are going to be needed and may well be written to. That ambiguity leads to a default policy of allowing the fork to succeed, but when that turns out to be the wrong judgement the OOM killer has to run.
It would be better for applications to give the kernel more clues about their intention, so the kernel can make better decisions on memory management.
I agree that posix_spawn, like almost anything that comes out of a committee, is a complicated monster. Perhaps a better answer would be to refine the distinction between fork() and vfork(), or to introduce a new fork-like call fork_intend_to_exec_soon(). Then the kernel could know that for an ordinary fork() it has to be cautious and check all the required memory is available, while fork_intend_to_exec_soon() has the current optimistic behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 13:43 UTC (Fri) by nix (subscriber, #2304)
[Link]
fork_intend_to_exec_soon() should be the default, because *most* forks are
rapidly followed by exec()s. Whatever you choose, getting it used by much
software would be a long slow slog :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 19:14 UTC (Fri) by dlang (✭ supporter ✭, #313)
[Link]
and if an application misuses this and does fork_intend_to_exec_soon() and then doesn't exec soon, what would the penalty be?
if applications can misuse this without a penalty they will never get it right (especially when using it wrong will let their app keep running in cases where the fork would otherwise fail)
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 21:21 UTC (Sun) by epa (subscriber, #39769)
[Link]
I don't think it matters much if a few slightly-buggy applications use the wrong variant. If 90% of userspace including the most important programs such as shells passes the right hint to the kernel, the kernel can make better decisions than it does now, and the need for the OOM killer will be reduced. It's a similar situation with raw I/O, for example: a disk-heavy program such as a database server might know that it will scan through a large file just once. Ordinarily this file's contents might clog up the page cache and evict more useful things. To help get more consistent performance, apps can be coded to hint to the kernel that it needn't bother to cache a particular I/O request. The default is still to cache it, and it's not catastrophic if one or two userspace programs haven't been tuned to use the new fancy hinting mechanism.
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
Very true, but of course there's no way for the kernel to know this. I expect most apps would prefer the fork to either succeed for sure, or fail at once if not enough memory can be guaranteed. There may be a few where optimistically hoping for the best and perhaps killing a random process later is the ideal behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 23:32 UTC (Sun) by nix (subscriber, #2304)
[Link]
Yeah, but forking to exec immediately afterwards is the common case. If
you make that some weird nonportable new variant, 90% of programs are
never going to use it, and none of the rest will until some considerable
time has passed (time for this call to percolate down into the kernel and
glibc --- and try getting this call past Ulrich, ho ho.)
(Anyway, we *have* fork_to_exec_soon(). It's called vfork().)
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:10 UTC (Thu) by helge.bahmann (subscriber, #56804)
[Link]
how many parameters do you want to expend for a "unified fork+exec" system
call?
not to mention the various clone flags (fs namespace etc.) and any wild
combination of the above
I think I have needed all of the above in various circumstances, sometimes
two or three things between fork+exec at a time
bonus question: how many more parameters do you want to add to a "combined
fork/exec" syscall to make it future proof for other things that might
need to be done before the new process image is executed?
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:45 UTC (Thu) by madscientist (subscriber, #16861)
[Link]
Exactly. The number of useful and important things you can, want, and need to do between a fork and an exec is far too great to make it one call, in all but the simplest cases. I had this _exact_ argument with a Windows API proponent who felt that fork+exec was crazy (not related to OOM issues, just having more than one call). But the number of flags and arguments you'd need to replicate that behavior is insane and you'd STILL run into situations where you can't do what you want.
Fork+exec definitely has its downsides in some of the technical implementation requirements, but from a higher level language perspective it's brilliant.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 9:07 UTC (Fri) by epa (subscriber, #39769)
[Link]
You're right, others pointed out the same thing; no single system call can handle all the things you might want to set up in the child process before exec()ing.
But that said, why does the whole child process (including, potentially, a complete copy of its parent's core pages, all ready to be written to) need to be created just to set a few uids or open some files? Perhaps it would work better to first prepare a new process structure, then set uids and open files for it, and as the last stage breathe life into it by giving a file to exec(). For example
pid_t child = new_waiting_process();
// Now child is an entry in the process table, but it is not running.
// Use the p_ variants of some system calls to set things up for
// this child process.
p_setuid(child, uid);
p_close(child, 0);
p_open(child, "infile");
// Finished setup, start it running.
p_exec_and_start(child, "/bin/cat");
wait(child);
This would give almost the same flexibility, but without the need to overcommit memory. The kernel would just need to create a new process in a not-runnable state, and the p_whatever system calls allow performing operations on another process rather than yourself. (Of course they would only allow manipulating your own not-yet-started child process, except perhaps for root.)
A process created with new_waiting_process() would inherit its parent's file descriptors, current directory, environment and so on as for fork(), but it would not inherit the parent's core.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 10:07 UTC (Fri) by helge.bahmann (subscriber, #56804)
[Link]
The idea in itself is workable, but the number of system calls you have to
duplicate is _huge_. It would perhaps be easier to create an "almost
empty" process image (with at least one stack and executable code page set
up) in suspended state, and then use ptrace or something similar to inject
system calls into the new process image -- this is tricky, but at least
the kernel is not burdened with an exploding number of system calls.
Alternatively, you could also provide a "fork" variant that explicitly
declares which pages of the address space are to be COWed into the new
process (if you are extra-smart, all you ever need to COW are the stack
pages, but calling library functions before execve is probably going to
spoil that -- but then, finding out which pages a library requires is by
no means easier, so you have to exercise a lot of discipline).
Might be an interesting research project to attempt any of the above in
Linux :)
posix_spawn is stupid as a system call
Posted Nov 6, 2009 13:51 UTC (Fri) by nix (subscriber, #2304)
[Link]
You could reduce the set of necessary syscalls to one:
int masquerade_as (pid_t pid)
which issues syscalls in 'pid' instead of the current process. ('pid' is a
process you'd be allowed to ptrace, so immediate children are permitted).
This is a per-thread attribute, and passing a pid of 0 flips back to the
parent again.
Then all you need is this (ignoring error checking just as the OP did,
what a horrible name that new_waiting_process() has got, vvfork() would
surely be better):
Note the subtleties here: execution always continues after execve()
because the execve() was done to another process image. Non-syscalls are
very dangerous to run because they might update userspace storage in the
wrong process: we'd really need support for this in libc for it to be
usable.
(In practice this latter constraint destroys the whole idea no matter how
good it might be: Ulrich would say no, as he does to every idea anyone
else originates. Personally I suspect this idea sucks in any case :) )
posix_spawn is stupid as a system call
Posted Nov 8, 2009 21:26 UTC (Sun) by epa (subscriber, #39769)
[Link]
From a purist point of view, all these 'new' calls are generalizations of the existing ones taking an extra pid argument, so they can just replace them, with the old ones provided by the C library; of course in the real world there is such a thing as backward compatibility :-p.
posix_spawn is stupid as a system call
Posted Nov 8, 2009 23:34 UTC (Sun) by nix (subscriber, #2304)
[Link]
Yeah, breaking the entire installed base of Linux apps would probably be a
*bad* move :) I think, if you wanted to do this, you'd have to introduce a
huge pile of new syscalls and reimplement the old ones as thin wrappers
(inside the kernel so as not to force everyone to upgrade glibc) calling
the new ones.
posix_spawn is stupid as a system call
Posted Nov 23, 2009 15:08 UTC (Mon) by jch (guest, #51929)
[Link]
This is analogous to the *at system calls (openat, fstatat, ...) that have been introduced in Linux and included in the latest revision of POSIX.
A suggestion
Posted Nov 12, 2009 5:17 UTC (Thu) by jlmassir (guest, #48904)
[Link]
Maybe then the solution to this problem would be:
1. Never allow overcommit when calling malloc
2. Allow overcommit on fork/exec, but kill the child process if it tries to
write to more than 10% of its virtual size.
This way, buggy programs that malloc too much memory and never use them
would be fixed and fork bombs would be killed, while still allowing to do do
system calls between fork and exec.
What do you think?
A suggestion
Posted Nov 14, 2009 20:52 UTC (Sat) by Gady (subscriber, #1141)
[Link]
Killing the child process if it uses more than 10% is kinda cruel. There are no rules against the child doing that. What should be done is that in this case the memory is allocated, and if that cannot be done, then the child is killed.
A suggestion
Posted Nov 15, 2009 20:03 UTC (Sun) by jlmassir (guest, #48904)
[Link]
Killing a child if there is no memory for a fork-exec is kinda cruel.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:27 UTC (Thu) by nix (subscriber, #2304)
[Link]
OK, so granted that... how do you plan to prevent processes allocating
stack space? A process with a lot of threads, mostly idle, could easily be
using gigabytes for stack address space, all potentially allocatable, but
only actually be using a tiny fraction of that (4K out of every 8Mb chunk,
say).
So overcommit doesn't just break programs that use fork/exec under high
load, forcing failure far sooner than necessary: it breaks programs that
use threads in the same way. Doesn't leave much, does it...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:28 UTC (Thu) by nix (subscriber, #2304)
[Link]
That is to say, *disabling* overcommit breaks these things. I hate negated
emphatic terms :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 9:10 UTC (Fri) by epa (subscriber, #39769)
[Link]
Somebody else will have to suggest a possible answer to the stack space problem :-(. It might not be possible to turn off overcommit entirely for desktop systems. But anything that can be done to make overcommit happen less often - or, equally, to make strict allocation usable for a normal workload - narrows the gap between what the kernel promises and what it can deliver, and makes the OOM killer less likely to run.
(Doesn't a process have some way to specify the max. stack size that it will use for each thread?)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:20 UTC (Thu) by mikov (subscriber, #33179)
[Link]
Overcommit is primarily about copy-on-write, not malloc(). For example, the kernel cannot predict how much actual memory will be needed after a fork(). Are you suggesting that when a 500MB process does a fork, the kernel should reserve another 500MB?
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:56 UTC (Thu) by JoeBuck (subscriber, #2330)
[Link]
Exactly. On a desktop Linux system, you might have a Firefox instance you've been using for ages and it's up to 1.5 gigabytes virtual memory. You have other processes running and your swap is mostly full, so that there's only another 0.5G available. Then you download a film clip and you want to fire up totem to view it. Firefox does a fork, followed by exec. But you don't have 1.5 additional gigabytes. Solaris would refuse to do the fork, even though you don't really need that additional 1.5G: you might dirty one page before doing an exec of totem, which is much smaller. Linux and AIX will issue a loan.
Solaris works around this problem by recommending that developers use posix_spawn rather than fork followed by exec, however they didn't add this call until Solaris 10.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 1:02 UTC (Thu) by mikov (subscriber, #33179)
[Link]
Very interesting. How does Solaris deal with mapping shared libraries without overcommitting?
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 23, 2009 15:19 UTC (Mon) by jch (guest, #51929)
[Link]
> How does Solaris deal with mapping shared libraries without overcommitting?
Shared libraries are backed by filesystem data, so a read-only map of a shared library does not involve overcommit.
why not posix_spawn()?
Posted Nov 5, 2009 8:02 UTC (Thu) by Cato (subscriber, #7643)
[Link]
So why doesn't Linux have something like posix_spawn() in standard distros (doesn't seem to be in Ubuntu at least but must be in some kernel builds: http://linux.die.net/man/3/posix_spawn )?
Fork/exec really works well with smaller processes (as in the original Unix tools / shell pipeline approach), but forking a 1.5 GB Firefox process is insane...
It's got to the point that I can't click on mailto: links any more in Firefox (currently 850 GB resident memory) because it will take so long to fork and then exec Thunderbird.
why not posix_spawn()?
Posted Nov 5, 2009 8:30 UTC (Thu) by mikov (subscriber, #33179)
[Link]
I can assure you that it is not the fork() that is causing the delay. In any modern Unix fork() doesn't actually copy anything - it creates a copy-on-write mapping of the process memory, which is a very fast operation (relatively speaking). This is where overcommit comes into play.
why not posix_spawn()?
Posted Nov 5, 2009 18:28 UTC (Thu) by khc (subscriber, #45209)
[Link]
actually the page table entries are still copied, which can take measurable amount of time even when the parent process is only using 100s of MB. Linux does have a posix_spawn, but it's implemented by fork/exec so it's not useful for this problem anyway.
why not posix_spawn()?
Posted Nov 5, 2009 20:44 UTC (Thu) by mikov (subscriber, #33179)
[Link]
fork() doesn't copy all page tables - for shared memory it defers copying it until fault time
Additionally, perhaps shared page tables will eventually improve on that. Does anybody know what the status of those patches is?
But anyway, I don't think that the slow starting of Thunderbird from Firefox, which the GP commented on, is caused by fork() copying the page tables.
why not posix_spawn()?
Posted Nov 5, 2009 23:21 UTC (Thu) by Cato (subscriber, #7643)
[Link]
You're right about the original problem, which must have been something else - just retested with mailto: link from a large Firefox process and it's fine.
why not posix_spawn()?
Posted Nov 6, 2009 1:17 UTC (Fri) by khc (subscriber, #45209)
[Link]
I am not sure if it copies the entire page table, but the effect is noticeable: http://hxbc.us/software/fork_bench.c (try running it with different c and n)
why not posix_spawn()?
Posted Nov 6, 2009 6:12 UTC (Fri) by mikov (subscriber, #33179)
[Link]
You are absolutely right. Here are my results:
malloc(1MB) 3235 ms 0.161750 ms/iter
malloc(100MB) 390 ms 3.900000 ms/iter
malloc(500MB) 1663 ms 16.630000 ms/iter
malloc(1024MB) 3329 ms 33.290000 ms/iter
With a heap of 1G it takes 33ms to do a fork() on my machine, which to me is surprisingly long (although not that surprising when you consider the mere size of the 4KB page tables). While, as I said initially, it would definitely not be noticeable for interactive process creation, it is significant. The mach maligned "slow" process creation on Windows is much faster for sure...
I did run a couple of more tests though, which improve the situation. First I confirmed that the page tables of shared memory mappings really are not copied. I replaced the malloc() with mmap( MAP_ANON | MAP_SHARED ):
mmap(1MB) 1204 ms 0.120400 ms/iter
mmap(100MB) 1201 ms 0.120100 ms/iter
mmap(500MB) 1231 ms 0.123100 ms/iter
mmap(1024MB) 1229 ms 0.122900 ms/iter
As you can see, there is no relation between fork() speed and mapping size.
Then I restored the malloc(), but replaced the fork() with vfork():
vfork+malloc(1MB) 102 ms 0.010200 ms/iter
vfork+malloc(100MB) 107 ms 0.010700 ms/iter
vfork+malloc(500MB) 105 ms 0.010500 ms/iter
vfork+malloc(1000MB) 106 ms 0.010600 ms/iter
The last result is really encouraging (and actually not surprising). Even though everybody seems to hate vfork(), for the case we are discussing (fork of a huge address space + exec), it should solve all problems, removing the need for the clumsy posix_spawn(), while preserving all the flexibility of fork(). Beat that, Windows!
Any good reasons why vfork() should be avoided?
why not posix_spawn()?
Posted Nov 6, 2009 19:12 UTC (Fri) by khc (subscriber, #45209)
[Link]
you are right that the speed of fork() is seldom noticeable in a GUI program, but it bites me all the time in daemons (big daemon wanting to launch many processes, one by one, do do some tasks). vfork() is too limiting because all you can do after is exec(), but sometimes you do want to extra flexibility that posix_spawn can provide.
I have to admit that I have never checked to see if posix_spawn fits my need, though. Since I only care about linux and posix_spawn on linux is the same as fork()/.../exec(), it's useless for me anyway.
why not posix_spawn()?
Posted Nov 6, 2009 19:28 UTC (Fri) by mikov (subscriber, #33179)
[Link]
I am not sure what you mean. Unless I am missing something, vfork() is much more flexible and easier to use than posix_spawn().
If your purpose is to call exec() after fork(), you should just be able to mechanically replace all forks() with vforks() and get a big boost.
why not posix_spawn()?
Posted Nov 6, 2009 22:55 UTC (Fri) by cmccabe (subscriber, #60281)
[Link]
> Any good reasons why vfork() should be avoided?
The manual page installed on my system says that copy-on-write makes vfork unecessary. It concludes with "it is rather unfortunate that Linux revived this specter from the past." :)
However... it seems like the results you've posted show quite a substantial performance gain for vfork + exec as opposed to fork + exec, for processes with large heaps.
Maybe the "preferred" way to do this on Linux would be using clone(2)??
C.
why not posix_spawn()?
Posted Nov 6, 2009 23:23 UTC (Fri) by cmccabe (subscriber, #60281)
[Link]
> Due to the implementation of the vfork() function, the parent process is
> suspended while the child process executes. If a user sends a signal to
> the child process, delaying its execution, the parent process (which is
> privileged) is also blocked. This means that an unprivileged process can
> cause a privileged process to halt, which is a privilege inversion
> resulting in a denial of service.
clone(CLONE_VM) + exec might be the win...
Colin
Memory required for fork()
Posted Nov 5, 2009 1:04 UTC (Thu) by vomlehn (subscriber, #45588)
[Link]
Exactly. The kernel reserves as much virtual memory for the child as the parent has. If overcommit is disabled, a parent with more memory than half of CommitLimit will not be able to fork() successfully. It will still be able to vfork(), though.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 3:11 UTC (Thu) by smoogen (subscriber, #97)
[Link]
I thought most modern OS's allow for the fundamental problem of allocating allowing more allocation than exists. They either put in place some sort of OOM or they lock up when ram really runs out. [I have fuzzy memories of seeing something like this on old SunOS and HPUX boxes long ago. And Windows does the lockup issue.] I am not sure what happens with MacOSx when it runs out.
Now the question is why do some of thesedo this? Is it a basic assumption of every non-embedded os to be sloppy with memory? Is it POSIX? And if every system were set up with overcommit turned off how much would break?
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 5:19 UTC (Thu) by mikov (subscriber, #33179)
[Link]
It is not sloppiness. The OS cannot predict the future, so it can either be optimistic or pessimistic with memory. This is a very deliberate design choice. Turns out that in practice being optimistic is much much more efficient.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 12, 2009 9:50 UTC (Thu) by rlhamil (guest, #6472)
[Link]
While optimistic overcommit might be _statistically_ better, it's not deterministic enough
for my liking. And my liking is that everything other than /dev/random is _totally_
deterministic (neglecting external input of course).
(I'd argue that overcommit-by-default is an invitation to denial of service attack, and, if
likely victims were more or less predictable, might be a "covert channel" as well.)
Solaris doesn't do overcommit, but does also offer MAP_NORESERVE, so that individual
mmap() operations can opt out of a reserve, in which case a write to a private mapping
(copy-on-write from a file) can cause the process to receive SIGSEGV or SIGBUS; see
I think that all that's missing is:
* a system call to turn on or off similar behavior for heap and/or stack, and to
turn on or off _implicit_ MAP_NORESERVE on all private mappings for that process
and its subsequently forked children (reset on exec)
* a shared library feature to implement system policy specifying which executables
should be be subject to overcommit, with a settable default for all not explicitly specified
* an OS default of no overcommit
* no OOM killer needed
Distros could supply default policy that opted for overcommit on chronically hoggish
(and typically not critical to system integrity) apps such as browsers. People might e.g.
not mind their browser dying a few more times than it would anyway, but might be very
glad to be sure that their X server (desktop user) or database server process was safe from
nondeterministic behavior possibly triggered by another process.
That gets overcommit out of the OS, and pushes the decisions into user space. A process
could always override policy with the system calls, but it would have to know what it was
doing to do that.
The only limitation with implementing the defaults for an executable in the dynamic linker
is that it wouldn't be able to allow overcommit for static executables. If that was a serious
limitation, a new mechanism would be needed to push the policy settings into the kernel,
and execve() (or equivalents) would have to implement them, which is IMO more comprehensive
but otherwise uglier.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 21:49 UTC (Thu) by anton (subscriber, #25547)
[Link]
If an application does not handle ENOMEM gracefully, it's better to
run it in overcommit mode. Hopefully it will never actually use all
that memory, then it will be better off than if it got ENOMEM. If it
gets OOM-killed, it won't be worse off. And being able to allocate
large amounts of memory without using it makes writing programs quite a bit
simpler in some cases.
OTOH, if an application is written to deal with ENOMEM gracefully,
it's better not to overcommit memory for this application, to give it
ENOMEM if there is no more committable memory, and then there is no
need to OOM kill such an application (instead, one of the other
overcommitting applications can be killed).
I have
written this up in more detail; there I suggested making it depend
on the process. In the meantime I have learned about the
MAP_NORESERVE flag, which makes it possible to do this
per-allocation. However, since the OOM-Killer kills a process, not an
allocation, it's probably better to use MAP_NORESERVE either on all or
no allocation in a process; but how to tell this to functions that
call mmap only indirectly (malloc(), fopen() etc.)?