LWN.net Logo

Advertisement

Interested in hardware, diags, validation, Linux, C, ARM, Microcode and low level programming and blazing networks?

Advertise here

posix_spawn is stupid as a system call

posix_spawn is stupid as a system call

Posted Nov 5, 2009 13:10 UTC (Thu) by helge.bahmann (subscriber, #56804)
In reply to: The existence of OOM is one of the few really stupid things in Linux by epa
Parent article: Toward a smarter OOM killer

how many parameters do you want to expend for a "unified fork+exec" system
call?

fork + set*uid/set*gid + exec
fork + chroot + chdir + exec
fork + close(write_side_of_pipe/read_side_of_pipe) + dup2() + exec
fork + open(arbitrary set of files) + exec
fork + sched_setscheduler/sched_setparam/sched_setaffinity... + exec
fork + personality + exec
fork + setpgid + exec

not to mention the various clone flags (fs namespace etc.) and any wild
combination of the above

I think I have needed all of the above in various circumstances, sometimes
two or three things between fork+exec at a time

bonus question: how many more parameters do you want to add to a "combined
fork/exec" syscall to make it future proof for other things that might
need to be done before the new process image is executed?


(Log in to post comments)

posix_spawn is stupid as a system call

Posted Nov 5, 2009 13:45 UTC (Thu) by madscientist (subscriber, #16861) [Link]

Exactly. The number of useful and important things you can, want, and need to do between a fork and an exec is far too great to make it one call, in all but the simplest cases. I had this _exact_ argument with a Windows API proponent who felt that fork+exec was crazy (not related to OOM issues, just having more than one call). But the number of flags and arguments you'd need to replicate that behavior is insane and you'd STILL run into situations where you can't do what you want.

Fork+exec definitely has its downsides in some of the technical implementation requirements, but from a higher level language perspective it's brilliant.

posix_spawn is stupid as a system call

Posted Nov 6, 2009 9:07 UTC (Fri) by epa (subscriber, #39769) [Link]

You're right, others pointed out the same thing; no single system call can handle all the things you might want to set up in the child process before exec()ing.

But that said, why does the whole child process (including, potentially, a complete copy of its parent's core pages, all ready to be written to) need to be created just to set a few uids or open some files? Perhaps it would work better to first prepare a new process structure, then set uids and open files for it, and as the last stage breathe life into it by giving a file to exec(). For example

    pid_t child = new_waiting_process();
    // Now child is an entry in the process table, but it is not running.
    // Use the p_ variants of some system calls to set things up for
    // this child process.
    p_setuid(child, uid);
    p_close(child, 0);
    p_open(child, "infile");
    // Finished setup, start it running.
    p_exec_and_start(child, "/bin/cat");
    wait(child);
This would give almost the same flexibility, but without the need to overcommit memory. The kernel would just need to create a new process in a not-runnable state, and the p_whatever system calls allow performing operations on another process rather than yourself. (Of course they would only allow manipulating your own not-yet-started child process, except perhaps for root.)

A process created with new_waiting_process() would inherit its parent's file descriptors, current directory, environment and so on as for fork(), but it would not inherit the parent's core.

posix_spawn is stupid as a system call

Posted Nov 6, 2009 10:07 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]

The idea in itself is workable, but the number of system calls you have to
duplicate is _huge_. It would perhaps be easier to create an "almost
empty" process image (with at least one stack and executable code page set
up) in suspended state, and then use ptrace or something similar to inject
system calls into the new process image -- this is tricky, but at least
the kernel is not burdened with an exploding number of system calls.

Alternatively, you could also provide a "fork" variant that explicitly
declares which pages of the address space are to be COWed into the new
process (if you are extra-smart, all you ever need to COW are the stack
pages, but calling library functions before execve is probably going to
spoil that -- but then, finding out which pages a library requires is by
no means easier, so you have to exercise a lot of discipline).

Might be an interesting research project to attempt any of the above in
Linux :)

posix_spawn is stupid as a system call

Posted Nov 6, 2009 13:51 UTC (Fri) by nix (subscriber, #2304) [Link]

You could reduce the set of necessary syscalls to one:

int masquerade_as (pid_t pid)

which issues syscalls in 'pid' instead of the current process. ('pid' is a
process you'd be allowed to ptrace, so immediate children are permitted).
This is a per-thread attribute, and passing a pid of 0 flips back to the
parent again.

Then all you need is this (ignoring error checking just as the OP did,
what a horrible name that new_waiting_process() has got, vvfork() would
surely be better):

pid_t child = new_waiting_process();
masquerade_as (child);
setuid(uid);
close(0);
open("infile");
// Finished setup, start it running.
execve ("/bin/cat", "/bin/cat", environ);
masquerade_as (0);
wait(child);

Note the subtleties here: execution always continues after execve()
because the execve() was done to another process image. Non-syscalls are
very dangerous to run because they might update userspace storage in the
wrong process: we'd really need support for this in libc for it to be
usable.

(In practice this latter constraint destroys the whole idea no matter how
good it might be: Ulrich would say no, as he does to every idea anyone
else originates. Personally I suspect this idea sucks in any case :) )

posix_spawn is stupid as a system call

Posted Nov 8, 2009 21:26 UTC (Sun) by epa (subscriber, #39769) [Link]

From a purist point of view, all these 'new' calls are generalizations of the existing ones taking an extra pid argument, so they can just replace them, with the old ones provided by the C library; of course in the real world there is such a thing as backward compatibility :-p.

posix_spawn is stupid as a system call

Posted Nov 8, 2009 23:34 UTC (Sun) by nix (subscriber, #2304) [Link]

Yeah, breaking the entire installed base of Linux apps would probably be a
*bad* move :) I think, if you wanted to do this, you'd have to introduce a
huge pile of new syscalls and reimplement the old ones as thin wrappers
(inside the kernel so as not to force everyone to upgrade glibc) calling
the new ones.

posix_spawn is stupid as a system call

Posted Nov 23, 2009 15:08 UTC (Mon) by jch (guest, #51929) [Link]

This is analogous to the *at system calls (openat, fstatat, ...) that have been introduced in Linux and included in the latest revision of POSIX.

A suggestion

Posted Nov 12, 2009 5:17 UTC (Thu) by jlmassir (guest, #48904) [Link]

Maybe then the solution to this problem would be:

1. Never allow overcommit when calling malloc
2. Allow overcommit on fork/exec, but kill the child process if it tries to
write to more than 10% of its virtual size.

This way, buggy programs that malloc too much memory and never use them
would be fixed and fork bombs would be killed, while still allowing to do do
system calls between fork and exec.

What do you think?

A suggestion

Posted Nov 14, 2009 20:52 UTC (Sat) by Gady (subscriber, #1141) [Link]

Killing the child process if it uses more than 10% is kinda cruel. There are no rules against the child doing that. What should be done is that in this case the memory is allocated, and if that cannot be done, then the child is killed.

A suggestion

Posted Nov 15, 2009 20:03 UTC (Sun) by jlmassir (guest, #48904) [Link]

Killing a child if there is no memory for a fork-exec is kinda cruel.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds