Unfortunately you're simplifying things a lot by only considering malloced memory. The system uses memory for many more things than an explicit malloc. The traditional reason that disabling overcommit is so painful is not that it causes malloc() to fail, but that it causes fork/exec problems. Consider what happens on fork: you get a complete copy of the parent process INCLUDING ALL ITS HEAP. Obviously that's painful, especially since 98% of the time you'll just turn around and call exec anyway, and then you won't need all that RAM. So what the system does is provide COW behavior... but what happens if you DON'T exec, and you start writing memory? Now the kernel has to come up with all that RAM it pretended that the child process had, that wasn't really committed. There's no way to catch that failure either.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:54 UTC (Thu) by epa (subscriber, #39769)
[Link]
Which is why fork/exec, while having a nice conceptual simplicity and a long Unix tradition, really needs to be replaced by a 'run child process' call that wouldn't double the memory usage for an instant only to throw it away again on exec.
(vfork() as in classical BSD is one answer, but still a bit crufty IMHO: rather than a special kind of fork that you can only use before exec, better to just say what you mean and have fork+exec be a single call.)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:55 UTC (Thu) by epa (subscriber, #39769)
[Link]
Ah... and I see that posix_spawn(3) does exist. Now we just need to fix userspace to use it.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:25 UTC (Thu) by nix (subscriber, #2304)
[Link]
posix_spawn*() is a bloody abomination. Nobody uses it because *despite*
being horrendously complicated it is *still* not flexible enough for
things that regular applications do all the time. And it never will be:
you'd have to implement a Turing-complete interpreter in there to approach
the flexibility of the fork/exec model...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 8:55 UTC (Fri) by epa (subscriber, #39769)
[Link]
At the moment if a 500Mbyte process forks itself the kernel has no idea whether it's about to exec() something else, in which case almost all those pages in the child process will be discarded, or if the child is going to continue on its way, in which case the pages are going to be needed and may well be written to. That ambiguity leads to a default policy of allowing the fork to succeed, but when that turns out to be the wrong judgement the OOM killer has to run.
It would be better for applications to give the kernel more clues about their intention, so the kernel can make better decisions on memory management.
I agree that posix_spawn, like almost anything that comes out of a committee, is a complicated monster. Perhaps a better answer would be to refine the distinction between fork() and vfork(), or to introduce a new fork-like call fork_intend_to_exec_soon(). Then the kernel could know that for an ordinary fork() it has to be cautious and check all the required memory is available, while fork_intend_to_exec_soon() has the current optimistic behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 13:43 UTC (Fri) by nix (subscriber, #2304)
[Link]
fork_intend_to_exec_soon() should be the default, because *most* forks are
rapidly followed by exec()s. Whatever you choose, getting it used by much
software would be a long slow slog :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 19:14 UTC (Fri) by dlang (✭ supporter ✭, #313)
[Link]
and if an application misuses this and does fork_intend_to_exec_soon() and then doesn't exec soon, what would the penalty be?
if applications can misuse this without a penalty they will never get it right (especially when using it wrong will let their app keep running in cases where the fork would otherwise fail)
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 21:21 UTC (Sun) by epa (subscriber, #39769)
[Link]
I don't think it matters much if a few slightly-buggy applications use the wrong variant. If 90% of userspace including the most important programs such as shells passes the right hint to the kernel, the kernel can make better decisions than it does now, and the need for the OOM killer will be reduced. It's a similar situation with raw I/O, for example: a disk-heavy program such as a database server might know that it will scan through a large file just once. Ordinarily this file's contents might clog up the page cache and evict more useful things. To help get more consistent performance, apps can be coded to hint to the kernel that it needn't bother to cache a particular I/O request. The default is still to cache it, and it's not catastrophic if one or two userspace programs haven't been tuned to use the new fancy hinting mechanism.
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
Very true, but of course there's no way for the kernel to know this. I expect most apps would prefer the fork to either succeed for sure, or fail at once if not enough memory can be guaranteed. There may be a few where optimistically hoping for the best and perhaps killing a random process later is the ideal behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 23:32 UTC (Sun) by nix (subscriber, #2304)
[Link]
Yeah, but forking to exec immediately afterwards is the common case. If
you make that some weird nonportable new variant, 90% of programs are
never going to use it, and none of the rest will until some considerable
time has passed (time for this call to percolate down into the kernel and
glibc --- and try getting this call past Ulrich, ho ho.)
(Anyway, we *have* fork_to_exec_soon(). It's called vfork().)
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:10 UTC (Thu) by helge.bahmann (subscriber, #56804)
[Link]
how many parameters do you want to expend for a "unified fork+exec" system
call?
not to mention the various clone flags (fs namespace etc.) and any wild
combination of the above
I think I have needed all of the above in various circumstances, sometimes
two or three things between fork+exec at a time
bonus question: how many more parameters do you want to add to a "combined
fork/exec" syscall to make it future proof for other things that might
need to be done before the new process image is executed?
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:45 UTC (Thu) by madscientist (subscriber, #16861)
[Link]
Exactly. The number of useful and important things you can, want, and need to do between a fork and an exec is far too great to make it one call, in all but the simplest cases. I had this _exact_ argument with a Windows API proponent who felt that fork+exec was crazy (not related to OOM issues, just having more than one call). But the number of flags and arguments you'd need to replicate that behavior is insane and you'd STILL run into situations where you can't do what you want.
Fork+exec definitely has its downsides in some of the technical implementation requirements, but from a higher level language perspective it's brilliant.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 9:07 UTC (Fri) by epa (subscriber, #39769)
[Link]
You're right, others pointed out the same thing; no single system call can handle all the things you might want to set up in the child process before exec()ing.
But that said, why does the whole child process (including, potentially, a complete copy of its parent's core pages, all ready to be written to) need to be created just to set a few uids or open some files? Perhaps it would work better to first prepare a new process structure, then set uids and open files for it, and as the last stage breathe life into it by giving a file to exec(). For example
pid_t child = new_waiting_process();
// Now child is an entry in the process table, but it is not running.
// Use the p_ variants of some system calls to set things up for
// this child process.
p_setuid(child, uid);
p_close(child, 0);
p_open(child, "infile");
// Finished setup, start it running.
p_exec_and_start(child, "/bin/cat");
wait(child);
This would give almost the same flexibility, but without the need to overcommit memory. The kernel would just need to create a new process in a not-runnable state, and the p_whatever system calls allow performing operations on another process rather than yourself. (Of course they would only allow manipulating your own not-yet-started child process, except perhaps for root.)
A process created with new_waiting_process() would inherit its parent's file descriptors, current directory, environment and so on as for fork(), but it would not inherit the parent's core.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 10:07 UTC (Fri) by helge.bahmann (subscriber, #56804)
[Link]
The idea in itself is workable, but the number of system calls you have to
duplicate is _huge_. It would perhaps be easier to create an "almost
empty" process image (with at least one stack and executable code page set
up) in suspended state, and then use ptrace or something similar to inject
system calls into the new process image -- this is tricky, but at least
the kernel is not burdened with an exploding number of system calls.
Alternatively, you could also provide a "fork" variant that explicitly
declares which pages of the address space are to be COWed into the new
process (if you are extra-smart, all you ever need to COW are the stack
pages, but calling library functions before execve is probably going to
spoil that -- but then, finding out which pages a library requires is by
no means easier, so you have to exercise a lot of discipline).
Might be an interesting research project to attempt any of the above in
Linux :)
posix_spawn is stupid as a system call
Posted Nov 6, 2009 13:51 UTC (Fri) by nix (subscriber, #2304)
[Link]
You could reduce the set of necessary syscalls to one:
int masquerade_as (pid_t pid)
which issues syscalls in 'pid' instead of the current process. ('pid' is a
process you'd be allowed to ptrace, so immediate children are permitted).
This is a per-thread attribute, and passing a pid of 0 flips back to the
parent again.
Then all you need is this (ignoring error checking just as the OP did,
what a horrible name that new_waiting_process() has got, vvfork() would
surely be better):
Note the subtleties here: execution always continues after execve()
because the execve() was done to another process image. Non-syscalls are
very dangerous to run because they might update userspace storage in the
wrong process: we'd really need support for this in libc for it to be
usable.
(In practice this latter constraint destroys the whole idea no matter how
good it might be: Ulrich would say no, as he does to every idea anyone
else originates. Personally I suspect this idea sucks in any case :) )
posix_spawn is stupid as a system call
Posted Nov 8, 2009 21:26 UTC (Sun) by epa (subscriber, #39769)
[Link]
From a purist point of view, all these 'new' calls are generalizations of the existing ones taking an extra pid argument, so they can just replace them, with the old ones provided by the C library; of course in the real world there is such a thing as backward compatibility :-p.
posix_spawn is stupid as a system call
Posted Nov 8, 2009 23:34 UTC (Sun) by nix (subscriber, #2304)
[Link]
Yeah, breaking the entire installed base of Linux apps would probably be a
*bad* move :) I think, if you wanted to do this, you'd have to introduce a
huge pile of new syscalls and reimplement the old ones as thin wrappers
(inside the kernel so as not to force everyone to upgrade glibc) calling
the new ones.
posix_spawn is stupid as a system call
Posted Nov 23, 2009 15:08 UTC (Mon) by jch (guest, #51929)
[Link]
This is analogous to the *at system calls (openat, fstatat, ...) that have been introduced in Linux and included in the latest revision of POSIX.
A suggestion
Posted Nov 12, 2009 5:17 UTC (Thu) by jlmassir (guest, #48904)
[Link]
Maybe then the solution to this problem would be:
1. Never allow overcommit when calling malloc
2. Allow overcommit on fork/exec, but kill the child process if it tries to
write to more than 10% of its virtual size.
This way, buggy programs that malloc too much memory and never use them
would be fixed and fork bombs would be killed, while still allowing to do do
system calls between fork and exec.
What do you think?
A suggestion
Posted Nov 14, 2009 20:52 UTC (Sat) by Gady (subscriber, #1141)
[Link]
Killing the child process if it uses more than 10% is kinda cruel. There are no rules against the child doing that. What should be done is that in this case the memory is allocated, and if that cannot be done, then the child is killed.
A suggestion
Posted Nov 15, 2009 20:03 UTC (Sun) by jlmassir (guest, #48904)
[Link]
Killing a child if there is no memory for a fork-exec is kinda cruel.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:27 UTC (Thu) by nix (subscriber, #2304)
[Link]
OK, so granted that... how do you plan to prevent processes allocating
stack space? A process with a lot of threads, mostly idle, could easily be
using gigabytes for stack address space, all potentially allocatable, but
only actually be using a tiny fraction of that (4K out of every 8Mb chunk,
say).
So overcommit doesn't just break programs that use fork/exec under high
load, forcing failure far sooner than necessary: it breaks programs that
use threads in the same way. Doesn't leave much, does it...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:28 UTC (Thu) by nix (subscriber, #2304)
[Link]
That is to say, *disabling* overcommit breaks these things. I hate negated
emphatic terms :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 9:10 UTC (Fri) by epa (subscriber, #39769)
[Link]
Somebody else will have to suggest a possible answer to the stack space problem :-(. It might not be possible to turn off overcommit entirely for desktop systems. But anything that can be done to make overcommit happen less often - or, equally, to make strict allocation usable for a normal workload - narrows the gap between what the kernel promises and what it can deliver, and makes the OOM killer less likely to run.
(Doesn't a process have some way to specify the max. stack size that it will use for each thread?)