Process creation in io_uring

By Jonathan Corbet
December 20, 2024

Back in 2022, Josh Triplett presented a plan to implement a "spawn new process" functionality in the io_uring subsystem. There was a fair amount of interest at the time, but developers got distracted, and the work did not progress. Now, Gabriel Krisman Bertazi has returned with a patch series updating and improving Triplett's work. While interest in this functionality remains, it may still take some time before it is ready for merging into the mainline.

A new process in Linux is created with one of the variants of the clone() system call. As its name suggests, clone() creates a copy of the calling process, running the same code. Much of the time, though, the newly created process quickly calls execve() or execveat() to run a different program, perhaps after performing a bit of cleanup. There has long been interest in a system call that would combine these operations efficiently, but nothing like that has ever found its way into the Linux kernel. There is a posix_spawn() function, but that is implemented in the C library using clone() and execve().

Arguably, part of the problem is that, while the clone()-to-execve() pattern is widespread, the details of what happens between those two calls can vary quite a bit. Some files may need to be closed, signal handling changed, scheduling policies tweaked, environment adjusted, and so on; the specific pattern will be different for every case. posix_spawn() tries to provide a general mechanism to specify these actions but, as can be seen by looking at the function's argument list, it quickly becomes complex.

Io_uring, meanwhile, is primarily thought of as a way of performing operations asynchronously. User space can queue operations in a ring buffer; the kernel consumes that buffer, executes the operations asynchronously, then puts the results into another ring buffer (the "completion ring") as each operation completes. Initially, only basic I/O operations were supported, but the list of operations has grown over the years. At this point, io_uring can be thought of as a sort of alternative system-call interface for Linux that is inherently asynchronous.

An important io_uring feature, for the purposes of implementing something like posix_spawn(), is the ability to create chains of linked operations. When the kernel encounters a chain, it will only initiate the first operation; the next operation in the chain will only run after the first completes. The failure of an operation in a chain will normally cause all remaining operations to be canceled, but a "hard link" between two operations will cause execution to continue regardless of the success of the first of the two. Linking operations in this way essentially allows simple programs to be loaded into the kernel for asynchronous execution; these programs can run in parallel with any other io_uring operations that have been submitted.

The new patch set creates two new io_uring operations, each with some special semantics. The first of those is IORING_OP_CLONE, which causes the creation of a new process to execute any operations that follow in the same chain. In a difference from a full clone() call, though, much of the calling task's context is unavailable to the process created by IORING_OP_CLONE. Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail. In practice, that means that operations like closing files can be executed, but complicated I/O operations are no longer possible. Krisman hopes to be able to at least partially lift that constraint in the future.

Once the chain completes, the new process will be terminated, with one important exception: if it invokes the second new operation, IORING_OP_EXEC, which performs the equivalent of an execveat() call, replacing the running program with a new executable. At this point, the new process is completely detached from the original, is running its own program, and the processing of the io_uring chain is complete; the process will, rather than being terminated, go off to run the new program. Placing any other operations after IORING_OP_EXEC in the chain usually makes no sense; any operations after a successful IORING_OP_EXEC will be canceled. It also does not make sense to use IORING_OP_EXEC in any context other than a new process created with IORING_OP_CLONE, so that usage is not allowed.

There is one case where it can be useful to link operations into the chain after IORING_OP_EXEC — efficiently implementing a path search in the kernel. Often, the execution of a new program involves searching for it in a number of directories, usually specified by the PATH environment variable. One way of doing this in the io_uring context, as shown in this test program, is to enqueue a series of IORING_OP_EXEC operations, each trying a different location in the path. If hard links are used to chain these operations, execution will continue past failed operations until the one that actually finds the target program succeeds; after that, any subsequent operations will be discarded. The entire search runs in the kernel, without the need to repeatedly switch between kernel and user space.

Most of the comments on the proposal so far have come from Pavel Begunkov, who has expressed some concerns about it. He did not like some aspects of the implementation, the special quirks associated with IORING_OP_CLONE and the process it creates, and the use of links, "which already a bad sign for a bunch of reasons" (he did not specify what the reasons are). He suggested that io_uring might not be the best place for this functionality; perhaps a list of operations could be passed to a future version of clone() instead, mirroring how the posix_spawn() interface works.

Krisman answered that combining everything into a single system call would add complexity while making the solution less flexible. Io_uring makes it easy to put together a set of operations to be run in the kernel in an arbitrary order. The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task. It is hard to see how combining all of this functionality into a single system call could work as well.

In any case, this is early-stage work; getting it to a point where it can be considered for the mainline will require smoothing a number of the rough edges and reducing the number of limitations. It will also certainly require wider review; this work is proposing a significant addition to the kernel's user-space ABI that would have to be supported indefinitely. The developers involved will surely want to get the details right before committing to that support.

Index entries for this article
Kernel	io_uring

BPF!

Posted Dec 20, 2024 16:32 UTC (Fri) by willy (subscriber, #9762) [Link] (23 responses)

Clearly the right solution is to load a BPF program into the kernel to do the clone and setup.

/s in case it wasn't clear.

BPF!

Posted Dec 20, 2024 17:47 UTC (Fri) by gutschke (subscriber, #27910) [Link] (20 responses)

I am not even sure the "/s" is warranted.

clone()/exec() is a very powerful pattern that nicely fits in with how POSIX has designed its API. The ability to customize the newly launched process prior to loading the binary is crucial in a lot of scenarios. And I don't see that going away.

But ever since the advent of threads (and possibly even in the presence of signals), this has gotten incredibly difficult to do correctly. There are just too many subtle race conditions that involve hidden state in the various run-time libraries or even in the dynamic link loader. If there was a way to do everything that you can currently do with systemcalls from userspace, but it instead moved entirely into the kernel, most of these problems would immediately go away. So, I see a lot of value with being able to call clone() and exec() from a BPF program, or maybe from io_uring. The elephant in the room with BPF is that this new API would then likely be limited to privileged processes.

You can approximate a solution in userspace by very carefully picking what system calls you invoke, and by avoiding any calls into libc, including accidental calls into the dynamic link loader. This involves some amount of assembly code to get 100% reliable. It's very tedious and extremely fragile. It is often not worth the effort and instead you have to live with the occasional random crash.

In some cases, a possible work-around is to launch a "zygote" helper process that executes before any threads are created. The latter is difficult to ensure though, as some libraries create threads when they are loaded into memory.

BPF!

Posted Dec 20, 2024 19:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> clone()/exec() is a very powerful pattern that nicely fits in with how POSIX has designed its API. The ability to customize the newly launched process prior to loading the binary is crucial in a lot of scenarios. And I don't see that going away.

POSIX's API is badly designed. clone() creates a copy of the entire VM and then just discards it. It's a lot of uselessly wasted work.

A better API would create an "empty shell" suspended process, then the calling process can poke it (using FD-based APIs), and finally un-suspend it. There's a strange aversion in Linux/UNIX land to this model (it's too sane), so we get closer and closer to it with these kinds of workarounds.

BPF!

Posted Dec 20, 2024 19:49 UTC (Fri) by epa (subscriber, #39769) [Link]

It’s not only wasted work, but it makes it hard not to overcommit memory (at least in the case of a full fork()). If a process with a gigabyte of address space forks, requiring a gigabyte of free memory is far too cautious if it will exec() shortly afterwards, yet if you assume it always exec()s you will get caught out if the child process starts to use the memory you promised it.

BPF!

Posted Dec 21, 2024 1:08 UTC (Sat) by comex (subscriber, #71521) [Link] (3 responses)

clone() is not POSIX. POSIX includes fork(), vfork(), and posix_spawn().

BPF!

Posted Dec 21, 2024 1:47 UTC (Sat) by gutschke (subscriber, #27910) [Link] (2 responses)

posix_spawn() is well-intentioned, but it doesn't really address the main problem with all of these APIs. As far as I can tell, POSIX doesn't guarantee for posix_spawn() to be thread-safe. And when I looked at the source code (admitted, this was years ago), the implementation in glibc most definitely didn't make any effort to ensure thread-safety. Also, posix_spawn() is just too limited to be a general solution. It's a fine response to the problem of Windows not having a fork()/exec() API. But it isn't really a solution for safely starting processes from any context.

fork() is a decent general solution for single-threaded applications, and that's why we've been using it for so many decades. The kernel-level API is amenable to writing thread-safe code using fork()/exec(). But that requires that after fork() returns in the client, no further entries into any libraries are allowed. In fact, I am not even convinced that it is always safe to call the glibc version of fork() instead of making a direct system call.

Both the various wrappers that glibc puts around system calls, and the hidden invocations of the dynamic link loader are potential sources for dead locks or crashes. Depending on how your program has been linked, this can even mean that you can no longer access any global symbols. Everything has to be on the local stack.

The upshot of all of this is that you not only need to carefully screen the system calls that you want to make for potential process-wide side-effects, you also have to call them from inlined assembly instead of deferring to glibc. In addition, fork() only really works with memory over-committing enabled, and for large programs this system call can be expensive.

vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux. Some amount of porting to different OS's will be involved, if you need to spawn a new process from within a multi-threaded environment.

clone() is the pragmatic solution. Once you come to the realization that this code is impossible to implement within the constraints of POSIX alone, you might as well take advantage of everything that Linux can provide to you. It's going to be hairy code to write, but there really is no way around it. Also, just to point out the obvious, the glibc wrapper around clone() is completely unsuitable for the purposes of what we need here. But a direct system call will work fine.

Of course, in 99% of the cases, you won't hit any of the race conditions. They are a little tricky to trigger accidentally, and a lot of them are relatively benign. Who cares about an occasional errno value that isn't set correctly, or a file descriptor that sometimes leaks to a child process. Only in very rare cases will you trigger a dead-lock, crash, or worse. So, many programs simply don't bother, and nobody ever notices that the code is buggy. It's the really big programs that everyone uses that need to worry about these things, as you suddenly have millions of running instances and countless numbers of spawned processes. If there is a way for something to go wrong, it eventually will.

A zygote process is a time-tested alternative. And that's great, assuming you can modify the startup phase of the program. If you can guarantee that your code executes before any threads are created, then a zygote that is fork'd() proactively will avoid all of these complications. But with bigger pieces of software that rely on lots of third-party libraries, that's not always feasible. These days, you should assume that all code is always multi-threaded -- if only because the graphics libraries decide to start threads as soon as they get linked into the program, or something similarly frustrating.

BPF!

Posted Dec 21, 2024 15:18 UTC (Sat) by khim (subscriber, #9252) [Link]

> A zygote process is a time-tested alternative.

Zygote solves an entirely different problem: how to start not one process, but many processes while executing an initialization part only once.

It works, but that's entirely different task.

> vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux.

It's also the simplest way to do everything reliably and efficiently on Linux.

For some unfathomable reason everyone's attentions is on an unsolvable problem: how to prepare a new process state using remnants of the old code that is interwined with the state of your program.

Just ditch all that! Start from the clean state! Create a new setup code, push whatever you need/want in there, then execute vfork/exec (with zero steps between them, using fexecve) and viola: no races, no possibility of corrupting anything, everything is very clear, simple and guaranteed.

The only downside: you have to develop that in arch-dependent way… but so what? If you compare that to insane amount of effort one would need to support all these bazillion zygote-based solutions then adding some kind of portable wrapper with arch-dependent guts even for 3-4 most popular architectures is not too hard.

Best property of that solution: it's not supposed to be perfect! If you would find out that it doesn't work – nobody stops you from redoing that portable API and adding or removing something to it. Because you ship it with your code or in a shared library it's replaceable without any in-kernel politics.

P.S. I think it can be called “double-exec” solution, and it requires Linux-specific syscalls, but the best part: all these syscalls are already there and are not even especially new.

BPF!

Posted Dec 27, 2024 1:52 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

I predict this will be an interesting discussion.

Can you please elaborate more specifically on thread-unsafety of posix_spawn implementations? POSIX might be not explicitly saying that posix_spawn is safe to use in MT programs, but it's main purpose is clearly to fix fork's problems with threads. So it has to be MT-safe.

Fork+exec and threads are too unsafe in practice. Even our esteemed editor made an error above. Here: "details of what happens between those two calls can vary quite a bit. <skiped>environment adjusted, and so on".

Thing is, updating process environment (e.g. via setenv) typically invokes malloc. And calling malloc in-between fork and exec is unsafe.

As per posix (quoting from man 3posix fork): "If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called."

In practice malloc implementations go to some lengths to make malloc() possible after fork by carefully setting up pthread_atfork or alternatives. But this is big enough can of worms. And for example "abseil" tcmalloc explicitly doesn't (https://github.com/search?q=repo%3Agoogle%2Ftcmalloc%20at...). As per Google's internal policy pthread_atfork is forbidden (which is another, but somewhat related topic).

So posix_spawn is the right thing IMO. And any exotic process setup things that might be missing in your favorite libc (e.g. stuff like unshare/prctl) you can always do in a small helper program. You exec into it. It gets clean slate, can do whatever syscalls and mallocs and what not. Single-threadly. And then exec into real thing.

As for original discussion, I am really hoping io_uring is kept only for perf-critical stuff. Spawning childs isn't.

Empty shell

Posted Dec 21, 2024 4:58 UTC (Sat) by IAmLiterallyABee (subscriber, #144892) [Link]

> A better API would create an "empty shell" suspended process, then the calling process can poke it (using FD-based APIs), and finally un-suspend it

IIRC, Fuchsia does something like that
https://fuchsia.dev/fuchsia-src/reference/kernel_objects/...

BPF!

Posted Dec 22, 2024 7:30 UTC (Sun) by joib (subscriber, #8541) [Link] (8 responses)

There was a nice article a few years ago that describes the problems with fork+exec, and indeed ends up recommending something like what you describe as a potential solution. https://www.microsoft.com/en-us/research/uploads/prod/201...

Generalizing system calls to operate on other processes

Posted Dec 27, 2024 17:01 UTC (Fri) by epa (subscriber, #39769) [Link] (3 responses)

That's a great read. I particularly liked the idea of making a system call that would apply to another process. So all system calls get an additional process id argument (or a pidfd, I guess) and, where reasonably possible, you are allowed to call them to apply to another process, as long as it's one of your children and executing as the same user, or you are root.

That means instead of fork() and in the child process opening file handles, the parent process could take care of all this. Create the child, which is initially not schedulable, make any system calls you want to set up the child's execution environment, then finally an exec_pidfd() to apply to the child process and mark it schedulable. That's a great way to apply "the Unix philosophy", composing the existing simple system calls rather than creating a kitchen sink like posix_spawn(), while avoiding the known problems of forking.

Existing code should be translatable to the new scheme without too much trouble. (Indeed you could even have a fork emulation layer in the C library which, on returning from fork() or vfork(), acts as though you were in the child process, so that calling open() actually calls into open_pidfd(), and then the final exec() call runs exec_pidfd() and then continues with the parent process's code. That's a bit kooky but might be a quick way to migrate older code which just wants to spawn a subprocess.)

Generalizing system calls to operate on other processes

Posted Dec 28, 2024 14:01 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (2 responses)

> Indeed you could even have a fork emulation layer in the C library which, on returning from fork() or vfork(), acts as though you were in the child process, so that calling open() actually calls into open_pidfd(), and then the final exec() call runs exec_pidfd() and then continues with the parent process's code.

How would that work? The "magic" of `fork()` (and related functions) is that it returns twice: once with a `0` return value and once with a `pid`. How would a library do any kind of emulation to allow taking *both* sides of the `if` condition it (eventually) leads to without some kind of language magic?

Generalizing system calls to operate on other processes

Posted Dec 28, 2024 14:11 UTC (Sat) by daroc (editor, #160859) [Link]

You can write your own magical twice-returning functions with setjmp() and longjmp(), although (as always in C) there are caveats around using those correctly.

Generalizing system calls to operate on other processes

Posted Dec 28, 2024 20:58 UTC (Sat) by epa (subscriber, #39769) [Link]

I imagined it would first return from fork() in the parent process, returning zero so you think you are in the child process. But now, any call to open() is actually open_pidfd() applied to the child. And exec() is also redirected so that it calls exec_pidfd() for the child and then jumps back to the end of the fork() call, this time returning the child's pid, so the parent continues executing. That could maybe be done with setjmp/longjmp or with some even darker magic that the C standard library is able to perform, perhaps with the help of inline assembly.

I wouldn't be surprised if similar hacks have existed to help port Unix code to single-tasking operating systems like MS-DOS.

BPF!

Posted Dec 29, 2024 18:42 UTC (Sun) by ma4ris8 (subscriber, #170509) [Link] (3 responses)

Gitaly's experience of fork overhead was an amazing read:
https://about.gitlab.com/blog/2018/01/23/how-a-fix-in-go-...

Thus one of Go's performance secrets is to use posix_spawn() since 2017.

Linux Java from fork() into posix_spawn() near 2018, using "jspawnhelper" as a clean up process:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-...

Oracle's Java for Solaris took the change earlier, in 2013:
https://bugs.java.com/bugdatabase/view_bug?bug_id=5049299

Rust language: glibc uses sometimes posix_spawn():
https://kobzol.github.io/rust/2024/01/28/process-spawning...

Fork/Exec major problem:
- Memory overcommit (Gitaly article): large programs clone resources during fork. After exec,
memory have to be cleaned up by the Kernel, and recycled for re-use.

Fork/Exec benefits:
- Threaded process: After fork, file descriptor set is vague. Forked process can investigate and clean up the state,
so that process doing exec() does not need to know about the caller process.
- This benefit is actually a work around: why to leak file descriptors, just to search for, and remove those before exec()?
- Could forking thread simply enlist the interesting set of fds, and then skip copying the uninteresting ones?

posix_spawn() design:
- New process uses parent's memory. No memory overcommit.
Caller thread sleeps, until new process does "exec()".
New process's thread must do as little work as possible, and, and exec into middle process.
Caller thread continues.
Middle helper process (jspawnhelper) closes leaked file descriptors, re-maps stdin,stdout,stderr,
and then exec's to the final process.
- This also copies all file descriptors, thus the fd clean up must be done with a helper program.
Speed increase comes from avoiding the memory "Copy on Write" work though for big memory programs.

Thus the optimal way (bpf, io_uring) solution would be to declare, what needs to be cloned, re-mapped,
and changed. Best is, if nothing unnecessary need to be created, and then destroyed (by middle process).

The world for big memory programs (Go, Java), has already moved from fork() into posix_spawn.

Thus there is big amount of Kernel work to be avoided, if the io_uring approach (and/or BPF enhancement)
can be used to avoid first cloning resources, just to tearing those down, and to call a child process
with given arguments, re-mapping stdin, stdout, stderr, passing some extra file descriptors (individual, tail range),
and setting child process working directory.

So instead of cloning everything, and tearing down, and making caller thread to sleep until child process is launched,
we could have something simple, which defines (declarative, programmatically, hybrid of those) the configuration for
a new process, and does a clean process launch with the single (0-1) io_uring queue submit,
without enforcing caller thread to sleep, until child process is launched.

BPF!

Posted Dec 29, 2024 18:56 UTC (Sun) by ma4ris8 (subscriber, #170509) [Link]

Here is Rust Maelstrom analysis of the memory usage overhead in fork().
It shows, how the overhead rises, when caller process has bigger memory mapping.
https://maelstrom-software.com/blog/spawning-processes-on-linux/

BPF!

Posted Dec 29, 2024 19:17 UTC (Sun) by bluca (subscriber, #118303) [Link]

> Thus one of Go's performance secrets is to use posix_spawn() since 2017.

I switched systemd to use pidfd_spawn (which is posix_spawn but with clone3(), which gets back a pidfd instread of a pid, and to clone into the target cgroup atomically) in v255 last year for similar reasons, as the copy-on-write trap overhead was hitting hard the azure fleet. I should probably do a write up about that at some point...

BPF!

Posted Dec 31, 2024 7:37 UTC (Tue) by izbyshev (guest, #107996) [Link]

> Thus one of Go's performance secrets is to use posix_spawn() since 2017.

On Linux, Go uses raw syscalls for almost all standard functionality, and Go programs usually don't even link to a C library. So, Go uses a vfork() equivalent followed by execve(), not posix_spawn() library function [1].

> Linux Java from fork() into posix_spawn() near 2018

No, it migrated from vfork() [2] (which has been used by default since forever). So, overcommit issues weren't present in the first place.

The CPython issue [3] for migrating subprocess from fork() to vfork() contains a lot of useful links on the topic. In some parts, it's outdated [4].

[1] https://github.com/golang/go/blob/194de8fbfaf4c3ed54e1a3c...
[2] https://github.com/openjdk/jdk/commit/e21cb12d358c22350cb...
[3] https://github.com/python/cpython/issues/80004
[4] https://github.com/python/cpython/issues/113117

BPF!

Posted Dec 31, 2024 15:19 UTC (Tue) by surajm (subscriber, #135863) [Link]

That sounds like the fuchsia api. I'm sure it's popular on many microkernels as well.

Re: empty shell

Posted Jan 12, 2025 2:33 UTC (Sun) by chexo4 (subscriber, #169500) [Link]

Do you think we'll get there at some point? What do you mean by aversion, exactly? This seems like a really useful mechanism to me and I don't see why it wouldn't be a great option to have for the many cases where you don't want/need to inherit the parent process' address space.

BPF!

Posted Dec 20, 2024 19:33 UTC (Fri) by magfr (subscriber, #16052) [Link]

I have been intrigued by the BeOS variant since I first saw it.
They have some variant of posix_spawn which always can be called and they also have fork/exec but only allows those system calls in single threaded environments.

To further mess with people this clone abstraction isn't strong enough to handle all cases - I have a little variation on tee which forks, sets up the child as a daemon process which does the writing, and then execs in the parent in order to keep the parent/child link with the grandparent.
(The child terminates on end of input)

BPF!

Posted Dec 21, 2024 3:05 UTC (Sat) by geofft (subscriber, #59789) [Link]

> The elephant in the room with BPF is that this new API would then likely be limited to privileged processes.

Yeah, that was also my concern. It seems like people are not going to be comfortable making eBPF available to unprivileged users any time soon.

On the other hand, classic BPF is still around and is accessible to unprivileged users in a few ways, most notably via seccomp mode 2, but also by creating an unprivileged user+net namespace (allowed by default in the upstream kernel and in most but not all distros) and using it for its original purpose of packet filtering. Could you allow userspace to upload a cBPF program and some data for its use and have that be enough to make system calls?

I think my specific proposal would be to extend clone3's struct clone_args with three fields: a pointer to a cBPF program in user memory, and a pointer and length of memory to copy-on-write into the new process. So if you want traditional behavior for some reason, you can specify NULL and ~0 and deal with the overcommit issues of doing that, but more likely you just need a page or two of memory for the filename, argv, maybe the value of $PATH, and maybe some additional info like how to reorder file descriptors. Add a new cBPF opcode BPF_SYSCALL that is only valid in this context, which makes the syscall stored in the BPF accumulator with the arguments in the BPF registers and returns a value to the accumulator. This syscall is treated as a real syscall (it is not eBPF's BPF_CALL, pointer arguments point to userspace, etc.). When it calls execve, normal behavior resumes. If the cBPF program returns instead of calling either execve or exit, then it returns to the userspace instruction pointer where clone3 was originally called, so you can use it just like a normal use of clone if you want. If that address is no longer mapped, the process dies with a segfault.

BPF!

Posted Dec 20, 2024 17:50 UTC (Fri) by edeloget (subscriber, #88392) [Link] (1 responses)

Right now, the commands really look like a set of instructions which are executed by a specific in-kernel VM, so my guess is more that with enough time, the complexity of the subsystem will grow enough to warrant the creation of an "io uring language" of some sort.

Which will /then/ be interpreted by a BPF program :)

BPF!

Posted Dec 20, 2024 18:08 UTC (Fri) by adobriyan (subscriber, #30858) [Link]

> in-kernel VM

It will be incomplete until it is possible to create new uring with uring interface
__attribute__((sarcasm)).

Why not just have a one-step spawn?

Posted Dec 20, 2024 18:44 UTC (Fri) by jbills (subscriber, #161176) [Link] (40 responses)

Dumb question: why can't we just have a single step function that starts a new process with a clean state without needing to do a whole load of operations in that process's context? Other operating systems get away with process creation without a magic dance.

Why not just have a one-step spawn?

Posted Dec 20, 2024 23:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (7 responses)

The basic problem is that, historically, the standard behavior is that "everything" is inherited, unless explicitly listed at [1]. If you change the rules now, lots of old libraries will not handle it gracefully. There are also a lot of awkward questions about miscellaneous process-wide state, such as the umask and working directory. Do those get "zeroed out" in some sensible way, or do you just inherit them?

The other basic problem is that systemd --user has already solved quite a lot of practical use cases anyway, so there is reduced motivation to expand the kernel's semantics when we already have code that works today.

[1]: https://pubs.opengroup.org/onlinepubs/9799919799/function...

Why not just have a one-step spawn?

Posted Dec 21, 2024 0:11 UTC (Sat) by jbills (subscriber, #161176) [Link] (2 responses)

I mean it makes sense to keep the legacy API the way it is, but if we are designing an entire new API surface in io_uring, why not do something better?

Why not just have a one-step spawn?

Posted Dec 21, 2024 1:18 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

The question is, if you build it, will they come? Not if it's flagrantly incompatible with everything... unless you combine it with exec and end up with posix_spawn, of course, but then you need umpteen different flags to tell posix_spawn how to do its job, which is not fun either.

Why not just have a one-step spawn?

Posted Dec 21, 2024 2:28 UTC (Sat) by josh (subscriber, #17465) [Link]

The design of the io_uring-based mechanism should allow using it to implement posix_spawn in many cases. (Some flags may require new uring operations.)

Why not just have a one-step spawn?

Posted Dec 21, 2024 15:25 UTC (Sat) by khim (subscriber, #9252) [Link] (3 responses)

> The basic problem is that, historically, the standard behavior is that "everything" is inherited, unless explicitly listed at.

And why is that an issue? New process can always ditch whatever it doesn't need.

Heck, you may supply it with all the information needed to do that. Only need five syscalls: memfd_create/write/vfork/execveat/execveat

No need for io_uring, BPF or other madness, everything entirely in userspace using syscalls that exist for years and years.

Why not just have a one-step spawn?

Posted Dec 21, 2024 19:31 UTC (Sat) by roc (subscriber, #30627) [Link] (2 responses)

Writing arbitrary code into a memfd and then exec'ing it would get around security subsystems that try to prevent running unsigned/unvalidated binaries. So that's not a general-enough solution. Instead you would need to have a prebuilt helper binary (signed if necessary) that does the work based on some parameters. But then why not just inline the stub into the "carefully written threadsafe library code" to avoid the double exec? And that's basically posix_spawn() today.

You might say that the "users write arbitrary code into a memfd" part is essential for flexibility. Even if we ignore the security issues, it would be nasty to program directly. People would inevitably wrap it in some kind of tiny, portable virtual machine for users to express their setup code ... and then again, you can implement that approach without doing the double exec.

Why not just have a one-step spawn?

Posted Dec 21, 2024 20:09 UTC (Sat) by khim (subscriber, #9252) [Link] (1 responses)

> But then why not just inline the stub into the "carefully written threadsafe library code" to avoid the double exec?

Precisely because then your sign-verifying machinery couldn't verify your code. You are executing things in the context that's polluted by gigabytes of long-living code that may affect your carefully prepared binary.

Even if it was sign-verified and correct when process was started it's not guaranteed to stay sign-verified and correct by the time you [try to] execute it.

> Instead you would need to have a prebuilt helper binary (signed if necessary) that does the work based on some parameters.

You could do what Turbo Pascal did decades ago: concatenate binary and parameters for said binary into one executable.

So there would be signed part and unsigned parameters, signature can be easily checked when binary is loaded, even if it's loaded from memfd.

> and then again, you can implement that approach without doing the double exec.

It's possible in theory but it's not done today. And it doesn't eliminate issues of that code being corrupted and abused before new binary is spawn.

And if we are not fighting that with io_uring proposal then I don't even have an idea what we are fighting for and against.

The big problem of article that we are discussing here is that it carefully describes the answer to some issues, but it entirely neglects to list the issues that we are trying to fix!

Not as bad as infamous “42” as the answer to “the ultimate question of life, the universe, and everything”, but very close to it: sure, that's a mechanism with a certain properties… but what it tries to do? What's the problem that couldn't be easily solved with it but is impossible to solve without it?

I have no idea and as long as I have no idea I couldn't even say if that's a good idea or not!

That's why I'm talking about “buzzword compliance”: simply because if “hey, it's io_uring, new and shiny” is not the goal then what is the goal? Where does that solution is supposed to send us? And why couldn't we arrive there via simpler means?

Why not just have a one-step spawn?

Posted Dec 22, 2024 11:29 UTC (Sun) by ballombe (subscriber, #9523) [Link]

> I have no idea and as long as I have no idea I couldn't even say if that's a good idea or not!

Well, I am glad to see that this does not impair your ability to write essay-sized posts about it.

Why not just have a one-step spawn?

Posted Dec 21, 2024 0:18 UTC (Sat) by josh (subscriber, #17465) [Link] (28 responses)

I think this would be a good idea. At the very least, there might be value in having a CLONE_ flag that makes the new process have an empty memory map rather than inheriting the caller's memory map.

However, typically, you do want to inherit at least some state from the current process. You *have* to inherit permissions (though root could override them), you may want to inherit at least some file descriptors, and so on.

And in practice you *may* want the option of having access to your memory map before doing the exec, at least for some operations.

It might well be useful to have pidfd operations to set up a new process from an existing one, but there's value in batching those operations, in the style of uring.

Why not just have a one-step spawn?

Posted Dec 21, 2024 0:39 UTC (Sat) by willy (subscriber, #9762) [Link] (26 responses)

It's generally considered good form to have at least one text segment in your address space ... you can try to munmap(NULL, -1) if you want, but it will not end well.

Why not just have a one-step spawn?

Posted Dec 21, 2024 0:46 UTC (Sat) by josh (subscriber, #17465) [Link] (25 responses)

If you don't have any userspace (yet), and your userspace is going to get completely replaced by an execveat, what would go wrong if you have zero pages mapped in the address space?

Why not just have a one-step spawn?

Posted Dec 21, 2024 0:55 UTC (Sat) by willy (subscriber, #9762) [Link] (24 responses)

Oh, when you said CLONE flag, I thought you were talking about clone(). From your reply it seems like you're talking about some other operation where the caller operates on its child.

Why not just have a one-step spawn?

Posted Dec 21, 2024 1:12 UTC (Sat) by josh (subscriber, #17465) [Link] (23 responses)

I was talking about clone(), but I was imagining a mode in which you combine "no initial memory map" with "don't start running yet". You'd then do your setup remotely, and then make some pidfd call to allow the process to start running.

That would work well in the io_uring case too, where you could keep the pidfd in an in-ring file descriptor and do a series of operations on it.

Why not just have a one-step spawn?

Posted Dec 21, 2024 2:11 UTC (Sat) by gutschke (subscriber, #27910) [Link] (22 responses)

We don't have a full set of system calls for remotely doing everything that a process can do by itself.

Every once in while, there has been talk of a new system call to inject system calls into child processes. But it never seems to go far.

Until then, you need to at least have some memory that is already mapped into the child. And presumably, you could then use ptrace() to make the child do what you need to do. But by the time you jump through all these hoops, you might as well create a new process that has some pre-mapped memory pages that the parent filled out before starting the child.

It's still a major pain to program, but better than starting with no initial memory map. I could see working with a version of clone() that takes an aligned memory address and number of pages to preserve. It won't be fun to program, but that's something that could be implemented in a library once and then nobody else needs to worry about it. It'll solve a number of the concerns that people have with (v)fork() and clone()

Why not just have a one-step spawn?

Posted Dec 21, 2024 2:13 UTC (Sat) by josh (subscriber, #17465) [Link] (19 responses)

This is exactly the motivation that led me to propose io_uring as the primary mechanism here. That way, we don't have to add a distinct set of system calls for remotely manipulating a process, we can use the same set of io_uring operations we already have.

Why not just have a one-step spawn?

Posted Dec 21, 2024 15:39 UTC (Sat) by khim (subscriber, #9252) [Link] (18 responses)

A much simpler approach would be to just add some code that would do that setup in the empty process. And we already have memfd_create/execveat combo that can do that.

If you want – add flag to the clone that would call execveat. And then new code in an entirely empty image can do whatever it needs to prepare for the execution of the real binary that you want to execution.

Why shove io_uring into something that already can be done entirely from userspace? Buzzword compliance?

Why not just have a one-step spawn?

Posted Dec 21, 2024 16:10 UTC (Sat) by corbet (editor, #1) [Link] (17 responses)

Khim, if you have a better idea, please submit a patch showing it. But please stop insulting the work of others, that does not help anybody.

Why not just have a one-step spawn?

Posted Dec 21, 2024 16:44 UTC (Sat) by khim (subscriber, #9252) [Link] (1 responses)

> But please stop insulting the work of others, that does not help anybody.

Where do you see insults? I've faced the need to mangle simple and easy to understand and implement ideas into pretzels to include all the right buzzwords at my $DAYJOBs often enough that I can easily see buzzword compliance as explicit, or more likely, implicit part of the requirements.

And very often it's even the most important one: if you couldn't cause enough buzz around your idea then it would die (except if there are some concrete tasks for concrete customers that may need it) even if it's pretty good, but with enough buzz around your idea you may push it even if it's totally stupid and would hurt everyone in the long run.

> Khim, if you have a better idea, please submit a patch showing it.

There are no patch because in-kernel parts are already done… years ago, in fact.

And to discuss userspace part we need some idea about who, why and how plans to use that mechanism.

The list of interested parties is not in the article thus it's hard for me to offer anything concrete because it's not clear to me how much flexibility is needed or wanted.

Implementation of posix_spawn is doable but would be significant amount of work without any clear benefits: do we have lots of users of that syscall? If yes, then where are they, if not then why are they so rare?

IOW: I don't see enough of a picture related to that work to judge it fairly and if “buzzword-compliance” is part of reasoning (even if an implicit one) then it could be that io_uring-based solution is the best way forward. Especially if it's a solution-in-a-search-of-a-problem: it's much easier to make someone excited about io_uring solution than about solution that just combines well-known syscalls in a way that makes posix_spawn safer.

Why not just have a one-step spawn?

Posted Dec 21, 2024 16:48 UTC (Sat) by corbet (editor, #1) [Link]

"Buzzword compliance" takes the work of people who are trying to improve the system and casts it as something useless. If it were my work, I would find that insulting. I do not believe that the people working on this are concerned about buzzwords, they are trying to solve real problems. Please try being a bit more respectful toward them.

Why not just have a one-step spawn?

Posted Dec 23, 2024 9:26 UTC (Mon) by gutschke (subscriber, #27910) [Link] (14 responses)

I am still not 100% convinced that khim's solution is necessarily easier nor more robust. But since I was curious whether the proposal to use existing kernel API's in somewhat unconventional ways would be viable at all, I wrote proof-of-concept code and uploaded it to: https://github.com/gutschke/safeexec/blob/main/safeexec.c

Not surprisingly, since we are doing things that weren't quite intended to be done this way, there are warts and pit-falls. If my code was to be turned into a production-quality library, a good amount of additional polishing is necessary. But as is, this is evidence that khim's suggestion can address several of the concerns raised in these comments.

(The best way to play with the code is to run it under the control of "strace". All it does is call "/bin/true" in a very round-about fashion.)

Why not just have a one-step spawn?

Posted Dec 23, 2024 14:39 UTC (Mon) by khim (subscriber, #9252) [Link] (12 responses)

TL;DR version: this approach is better because in case of adoption failure (which is quite likely) one may just throw it away and forget about it, whileas if similar failure would happen with io_uring solution the code and special properties of these chained operations would have to stay in kernel forever.

> I am still not 100% convinced that khim's solution is necessarily easier nor more robust.

It's “easier” in a sense that you can use it in applications for RHEL8+ and Android8+ (and most other distributions also have kernels with memfd_create, too).

That means that you could model your io_uring solution and test it on wide set of real-world tasks (since it can be used in production).

Even if final solution would be to add either a dedicated syscall or set of io_uring operations (plus set of special “chaining” rules needed to make them usable) you would collect lots of data which would tell you what works and what doesn't work.

If you start with addition to the kernel, on the other hand, then all these “large parent processes” deployed in various places wouldn't be able to use it for many years – and by the time when you would have real data collected from real apps… kernel API would be long-established and, most likely, not used (just like posix_spawn is barely used today).

P.S. Of course if you plan to eventually go with io_uring, anyway, then it would be good idea to have API of your safeexec designed in way that would make it easy to switch to io_uring, at some point. Apps wouldn't even need to know that they stopped using “double exec” trick and switched to io_uring on kernels that have io_uring support, it would all be transparent for them.

Why not just have a one-step spawn?

Posted Dec 23, 2024 16:29 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

> It's “easier” in a sense that you can use it in applications for RHEL8+ and Android8+

Actually I don't think you could use it in either, given it requires writable + executable memory, which is blocked by SELinux by default. Most sandboxing systems restrict that as well, as it's a very commonly used attack vector.

Why not just have a one-step spawn?

Posted Dec 23, 2024 16:36 UTC (Mon) by khim (subscriber, #9252) [Link]

That can be solved if you would crate two mappings: read/write one and read/execute one. Or even just create read/write mapping, then fill it and then change to read/execute before vfork/spawn.

These tricks are already used by JITs and most distributions, even very “enterprise” ones, have knobs to allow JITs, only iOS disabled that completely (and I don't think iOS is in scope for that project).

Changing SELinux settings is needed in any solution, even if you introduce new syscall it's highly unlikely that SELinux wouldn't stop that till you retune it.

Why not just have a one-step spawn?

Posted Dec 23, 2024 17:00 UTC (Mon) by gutschke (subscriber, #27910) [Link]

No writable/executable mapping is used in my proof of concept. Once the ephemeral ELF image has been exec()'d, there is only a single readable/executable mapping.

I use a single mapping for both code and read-only data. That approach slightly simplified the already painfully complicated open-coded serialization of the various data structures that need to be passed into the child. But that could be split into two separate mappings for a production release.

Or instead of passing data as part of the ELF image, all data could be passed into the ephemeral child over a pipe(). Those design details are certainly up for review.

Chained operations in io_uring

Posted Dec 24, 2024 11:57 UTC (Tue) by farnz (subscriber, #17727) [Link] (8 responses)

special properties of these chained operations would have to stay in kernel forever.

The neat thing about the proposed io_uring solution is that the special properties of these chained operations already exist for other reasons - in order to allow you to queue up an I/O operation with an appropriate response on error, chains and hard links already exist^[1], and to allow io_uring to operate asynchronously to process context, it already knows how to handle trying to return to a userspace that isn't running.

The only new things here are IORING_OP_CLONE that creates a new process (not able to run) and IORING_OP_EXEC that replaces the program text and turns it into a ready-to-run process. Everything else already exists in io_uring for I/O purposes.

^[1] The intent is that you can do something like write to a WAL, fsync the WAL if the WAL write succeeds, write to the final location if the WAL write and fsync succeed, and then regardless of success of the WAL and final location writes trigger a futex wake, all in a single submission to the kernel.

Chained operations in io_uring

Posted Dec 24, 2024 14:10 UTC (Tue) by khim (subscriber, #9252) [Link] (7 responses)

> The neat thing about the proposed io_uring solution is that the special properties of these chained operations already exist for other reasons

Have you actually read the article? That one, specifically: Krisman hopes to be able to at least partially lift that constraint in the future.

It's extremely clear to me that interface, as presented, it's not finished and not tested. Or, even worse, tested and is just feed to kernel developers in an insidious way to convince them to adopt huge hairball of API that would be immediately rejected if presented in it's full capacity… that's even worse then “unfinished and untested” API in my book (and I sincerely hope it's not that: Hazan's razor and all that).

> The only new things here are IORING_OP_CLONE that creates a new process (not able to run)

Which is something that Linux haven't supported till today. Currently “load new executable in a process” is one atomic operation that starts from the state where kernel have something mapped and executable in it's address space and ends in the state where kernel have something mapped and executable in it's address space.

An attempt to split that process in two looks innocuous enough, but it's entirely not clear what strange pitfalls it may hit.

> The intent is that you can do something like write to a WAL, fsync the WAL if the WAL write succeeds, write to the final location if the WAL write and fsync succeed, and then regardless of success of the WAL and final location writes trigger a futex wake, all in a single submission to the kernel.

Yes. And that's fine because code before and after comes from the exact same codebase. If some steps are omitted and/or failed then code that started the whole mess could, presumably, handle these failures gracefully.

Compare with io_uring attempt to do clone/exec attempt: you are doing some important cleanup work after clone which is, well, important (or we wouldn't worry so much about doing it in the first place) – and if it fails we execute foreign code, anyway.

This sounds, to me, like “hey, we have added nice security vulnerability to the kernel API, we just have no idea how to exploit it in the wild… contest is starting”!

The most likely consequence would be pile of special cases forbid some “likely exploitable” instructions in the sequence between IORING_OP_CLONE and IORING_OP_EXEC.

With ongoing maintenance when they would be discovered and open questions about what to do about apps that rely on these operations.

Of course safeexec also includes all the same issues (and probably some more), but there's a big difference: because it's not a kernel API and it can be easily embedded into your application by linking it statically there are no need to support all the warts of the first version indefinitely. In can be tuned and fixed [relatively] freely without commitment to support it forever (because each released version is self-contained and would work like it did on the day one).

Chained operations in io_uring

Posted Dec 24, 2024 14:41 UTC (Tue) by daroc (editor, #160859) [Link] (6 responses)

Let's please not get too heated; even though we try to make articles as clear as possible, it's easy to have slightly different understandings of complex technical topics, and the best way to resolve that is usually with examples and more explanations.

In any case — nothing obliges developers to use hard links between io_uring operations. If an important cleanup operation is necessary, and it is not safe to execute the new program if it fails, don't use a hard link. While it is arguably suboptimal to introduce yet another API that must be used correctly or risk security problems, it is hardly the first such API in the kernel. Nothing prevents a poorly-written program from leaking various kinds of state to another program with the current fork()/exec() workflow.

Chained operations in io_uring

Posted Dec 24, 2024 16:04 UTC (Tue) by khim (subscriber, #9252) [Link] (5 responses)

> In any case — nothing obliges developers to use hard links between io_uring operations.

How would anything work without hard links? After IORING_OP_CLONE your process is in “undead” state. It's neither alive nor entirely dead, but the important part: it doesn't have any userspace that may act and do some decisions.

That's the whole point of that patch series: to introduce a way to “clean up” that “undead” process by doing some operations when userspace is entirely gone.

If you wouldn't use hard links… what is supposed to happen? How would non-hardlinked operations work without userspace? What is supposed to happen if some operation would would fail? We don't have any agent that may receive information about failure!

Sure, we would know that combined operation would fail without running the program, but we would just make already stupid situation (where people try to execute programs with non-standard loader, get the message “file not found” and then spend hours trying to understand why program that's clearly there will all the proper permissions couldn't be execute) even worse.

> While it is arguably suboptimal to introduce yet another API that must be used correctly or risk security problems, it is hardly the first such API in the kernel.

It's worse than that: currently it's not “must be used correctly or risk security problems” but “it must be extended more before it would go beyond “proof of concept” phase, because in it's current form it's impossible to use it safely”.

And we have no idea how much more should it be extended to become actually usable. That's precisely why I have said that we need to know who, why and how plans to use that mechanism – because without such information we have no idea what needs to be added to it to make it actually usable.

> Nothing prevents a poorly-written program from leaking various kinds of state to another program with the current fork()/exec() workflow.

Sure, but correctly-written program can do everything correctly. And handle failures safely. Even if glibc fails to do that it's possible, at least in theory. That's currently impossible to do with the new mechanism.

Can it be extended to handle these things? Sure: you can make it possible to receive information about io_uring operations in the parent process. Or introduce some high-level “cleanup” operations. Or do many other extra extensions… but before we would do all that the main question needs to be answered: what we are actually trying to do with that mechanism?

> the best way to resolve that is usually with examples and more explanations.

Sure, but why are you directing that request to me? The main advantage that was touted in the original work way speed that's 6-10% faster than vfork() and 30+% faster than posix_spawn().

I have no idea who would really need that speedup (most of the time time spent in fork/exec is minuscule compared to the time needed to run dynamic loader, verify signatures and so on), but that sounded somewhat sensible.

But if all that complexity (including fight with a kernel corruption after a few spawns) and less reliability than what existing mechanisms provide is justified then it would be really nice to know who executes so many processes that they do care about fork/exec time, why the time needed to actually start process is not impeding their work and so on.

Because if it's some silly specialized unikernel or some kind of cluster management software – then it may very well be handled better with a more focused, more specialized API instead of jenga tower that this patch series starts to build.

Chained operations in io_uring

Posted Dec 24, 2024 19:12 UTC (Tue) by daroc (editor, #160859) [Link] (4 responses)

> How would anything work without hard links? After IORING_OP_CLONE your process is in “undead” state. It's neither alive nor entirely dead, but the important part: it doesn't have any userspace that may act and do some decisions.

It's possible that I've misunderstood how the patch series works, but I thought that if the whole series of operations fails, the program that originally started the operation is notified in the normal way (via an item in the io_uring completion queue). You can see that in this example: https://lwn.net/ml/all/20241209234421.4133054-3-krisman@s...

So a program submits the chain of io_uring operations, and then it either succeeds (and a new process is created) or it fails, and the program that submitted it can choose how and whether to retry. So hard links aren't needed, and it's perfectly possible to write a correct program that closes important files with the current patch series.

Chained operations in io_uring

Posted Dec 24, 2024 19:37 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

> You can see that in this example:

Which test is that? AFAICS the only test function that doesn't use linking, test_unlinked_clone_sequence just issues unlinked IORING_OP_CLONE and then expects this:

if (cqe->res != -EINVAL)
	… Unlinked clone should have failed …

That's it. All other examples use linked operations, as they should.

It's possible that I have misunderstood something, but at least at the first glance it's obvious why it have to be done that: after IORING_OP_CLONE is executed the whole io_uring machinery (I suspect 99% of Linux functionality) becomes, temporarily, “untouchable”… with some operations still permitted – only in linked form. And then it either succeeds (while silently ignoring errors leading to unknown state of the executed process) or fails – as whole.

> So a program submits the chain of io_uring operations, and then it either succeeds (and a new process is created) or it fails, and the program that submitted it can choose how and whether to retry.

Couldn't see that. At least in that patch series and set of examples.

And it's obvious why: if you add that machinery then, suddenly, instead of simple and non-invasive patch that just adds couple of io_uring commands one would need to design completely new machinery which can support inter-process io_uring support! With execution happening in the context of one process and communication channel opened to another process.

Sure, that's not impossible to create, but… do we really want to add so many new subtle features for 6% speedup?

> So hard links aren't needed, and it's perfectly possible to write a correct program that closes important files with the current patch series.

Show me the code. I couldn't find it. And I suspect that it's precisely as I have said: we only see 10% of the iceberg here, the majority of changes, 90% of iceberg is either doesn't exist or is not submitted for review.

Chained operations in io_uring

Posted Dec 24, 2024 19:49 UTC (Tue) by corbet (editor, #1) [Link] (2 responses)

You are really determined to sink this patch series, I'm not sure why.

Normally, you would not use hard-linked operations in the newly cloned child context. If one of the setup operations fails, the entire chain fails, with the status returned to the parent. No silently ignored failures. No unknown state.

Hard links can be used in the execveat() sequence to implement a search path. In that case, continuing after failure is the desired outcome; you want to go until you find something you can actually execute.

I am sorry if the article did not adequately convey that.

Doubtless there is interesting work to do to expand the range of actions available in the just-cloned child context; we will have to see what shape that takes. But I see no reason to suspect some sort of evil plot here.

Chained operations in io_uring

Posted Dec 24, 2024 20:36 UTC (Tue) by khim (subscriber, #9252) [Link]

> You are really determined to sink this patch series, I'm not sure why.

I want to understand what that patch series hopes to achieve, mainly. This part is not reassuring: Krisman hopes to be able to at least partially lift that constraint in the future. And this is even more worrying: The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task.

In essence we are supposed to accept some piece of the whole solution without us knowing where the whole thing even leads.

And, worse yet, it's not clear what problem this whole thing is even supposed to solve!

If it's safety of creation of a new process then it's one thing (there are no need for io_uring, we already have all the pieces), if it's 30% speedup for posix_spawn, then it's another thing.

> But I see no reason to suspect some sort of evil plot here.

Evil plot is unlikely. But it really looks like a solution in a search of a problem… and I want to see the problem and, more importantly, explanation why that's the best solution for it.

As was noted in article one alternative solution would be to just create a dedicated system call that would include all the required operations. Or “double exec” if we just want to safely implement posix_spawn.

And if it's “an attempt to see where it goes” then I don't really want to sink but more to “flesh it out”, understand how the full, final, solution would look like and, again, who, why and how would use it.

Because as it stands currently, it's not clear to me what's the goal of all that activity – and that matters much more then minor details of the implementation in the current form.

Even if we would achieve the final goal of being able to execute all io_uring commands in this sequence of these instructions between clone and execute why are we so sure it would be enough.

Where do we plan to arrive with that change and what do we plan to achieve?

> Hard links can be used in the execveat() sequence to implement a search path. In that case, continuing after failure is the desired outcome; you want to go until you find something you can actually execute.

Ah. I see. While this, again, looks like a solution in a search of a problem (why to look up for the executable before executing it? what's the point of moving this pretty much optional functionality into the kernel? do we really want to try to continue after finding “kinda-sorta-suitable” binary that would end up being broken, for some reason?) at least now I understand what I didn't understood about that patch set.

Thanks for explaining it: while I still am not sure how useful would it be to implement what it tries to implement (because, again, I couldn't see the end goal), at least some operations can be implemented in safe manner. That's better than how I understood it working. Thanks for explanation.

Chained operations in io_uring

Posted Dec 24, 2024 23:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Normally, you would not use hard-linked operations in the newly cloned child context. If one of the setup operations fails, the entire chain fails

How exactly is this going to be achieved for processes? As I understand, there's going to be a new visible intermediate state for the process, as the operations are being executed, unless the io_uring sequence locks the entire kernel.

This also can cause a problem for userspace process migration. How do you interrupt the io_uring sequence to suspend it? After reading the patch series, I don't see how it would prevent long-running operations like read() from being introduced into the middle of the sequence.

It really is a poorly-designed API. It is very much in line with the good old UNIX tradition of screwing up process management APIs.

Why not just have a one-step spawn?

Posted Jan 12, 2025 18:17 UTC (Sun) by mrugiero (guest, #153040) [Link]

I believe it would be easier (but would take more execs) to just use execline on your first exec to set up the environment correctly. Nice little scripting language which is already designed for that.

Why not just have a one-step spawn?

Posted Dec 21, 2024 3:14 UTC (Sat) by geofft (subscriber, #59789) [Link]

I think the idea of allowing a subset of pages to be preserved into the new program makes sense (I just suggested a variant of it in another comment).

Agree that the complexity can be dealt with in a library once, but also I think it's less hard to program than you'd fear - one approach that would make it relatively pleasant to implement would be to write a tiny standalone binary to do the post-fork actions, and embed that compiled binary as a big constant in this helper library. Then the pre-fork operation (which the library would do for you) is to mmap some new pages to hold the binary and its stack, copy over the binary and fill in the stack appropriately, and tell clone3 to start running that binary from its entry point. Then immediately munmap those pages in the parent. (Or, if you want to get fancy, make a clone3 flag to move pages from the parent to the child instead of CoWing them.) This lets you avoid thinking too much about how the compiler is laying out memory and what parts you need to preserve, because you're essentially running a new program in the child. (In other words, it basically gives you kernel support for the "zygote" approach.)

Why not just have a one-step spawn?

Posted Dec 21, 2024 19:30 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

> We don't have a full set of system calls for remotely doing everything that a process can do by itself.

In a world with a more regular system interface, *every* system call would require callers specify all object on which to operate, explicitly. We wouldn't have operations that work on "the current process" or "the current thread". In this world, the process bootstrapping the GP proposes would fall naturally out of the general shape of the API surface.

Why not just have a one-step spawn?

Posted Dec 21, 2024 15:32 UTC (Sat) by khim (subscriber, #9252) [Link]

> However, typically, you do want to inherit at least some state from the current process.

Just package it neatly and pass it into a new process, damn it!

> At the very least, there might be value in having a CLONE_ flag that makes the new process have an empty memory map rather than inheriting the caller's memory map.

That's not possible: you need something in the process that you may execute. You can not start from zero. But if you would instead pass fd number that contains image that should be loaded there then with simple, almost trivial in kernel change you would enable fully-userpace solutions.

But hey, that's too simple! There are not enough buzzwords in that approach! How can be accept something so sane? Nope, we need to push for io_uring, BPF or maybe even webasm! More complicated, more invasive, yet much more buzzword-compliant approach!

Why not just have a one-step spawn?

Posted Dec 21, 2024 1:07 UTC (Sat) by comex (subscriber, #71521) [Link] (2 responses)

That's essentially posix_spawn.

On Linux, posix_spawn is just a userland wrapper for a vfork/exec dance. But on macOS, posix_spawn is its own syscall. The kernel creates the new process without having to bother with forking the virtual memory space and all that.

Why not just have a one-step spawn?

Posted Dec 21, 2024 17:26 UTC (Sat) by ma4ris8 (subscriber, #170509) [Link] (1 responses)

Let's have a threaded program. It opens and closes file descriptors. Some of those have FD_CLOSE.
Task is to create a child program. Child program will have three file descriptors, parent's three fds
mapped as child's stdin, stdout and stderr. Close all unrelated file descriptors.
Perhaps Valgrind's file descriptors with fds near upper bound, 1024, are also allowed to pass thru.

One way is to fork, then open /proc/self/fd, close unrelated fds, remap related ones into 0,1 and 2.
After that, then exec the final child with a clean state. If parent has large memory foot print, this is heavy.

The other way is to do posix_spawn(). Spawn intermediate process, which closes unrelated fds, remaps related
ones into 0,1 and 2. After clean up, execute the final child process. Drawback is to have the middle process
to do the clean up, but if parent has large memory foot print, this is light, compared to fork.

Third way: how to do it so, that the cleanups could be done in an elegant and memory safe way,
without the separate middle process?

Why not just have a one-step spawn?

Posted Dec 23, 2024 13:44 UTC (Mon) by fweimer (guest, #111581) [Link]

You can use Solaris, which offers posix_spawn_file_actions_addclosefrom_np. It's always in glibc 2.34 or later, too. The other historically missing bits are posix_spawn_file_actions_addchdir_np and posix_spawn_file_actions_addfchdir_np (glibc 2.29 and later).

In general, this is an anti-pattern, though, because we have to keep adding stuff that's easily expressed elsewhere in code. One issue is that one gets just one error code for an entirely list of actions, and that's bad from a debugging point of view. And one day, we'll need to wrap something where a first action produces a value needed by a second action, and we cannot easily force the value to a caller-supplied choice (like we do for file descriptors today).

One silver lining is that vfork may not be as bad as we thought it was for a while. (The TCB sharing is empirically quite harmless for a wide variety of programs). Running compiled C code instead of walking an action list may be the better approach in the end.

zygotes

Posted Dec 21, 2024 0:20 UTC (Sat) by josh (subscriber, #17465) [Link]

Fun trick you could pull with this, once it has full support for arbitrary io_uring operations:

Clone a new process, do some initial setup, do a futex wait or blocking read or wait on a uring message, and when that completes, do an execveat of the new process. Now you can have a "pool" of ready-to-start processes, blocked in the kernel, waiting to exec.

More work to do for tracing execs

Posted Dec 21, 2024 1:57 UTC (Sat) by kxxt (subscriber, #172895) [Link]

I suppose this means more work to do for tracing execs in the future...

On x86_64, tracing execs is easy as hooking __x64_sys_execve{,at} and __ia32_compat_sys_execve{,at} for 32bit(execsnoop doesn't handle it, but my tracexec handles it).

Of course there is sched_process_exec but only for successful execs.

Missing a beat

Posted Dec 21, 2024 2:06 UTC (Sat) by marcH (subscriber, #57642) [Link] (1 responses)

> The first of those is IORING_OP_CLONE, which causes the creation of a new process to execute any operations that follow in the same chain. In a difference from a full clone() call, though, much of the calling task's context is unavailable to the process created by IORING_OP_CLONE.

and that is the whole point, right?

> Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail.

I'm afraid I'm missing a beat here. I mean, I miss the ... link (pun intended) between "without context" and "asynchronous". Could someone elaborate?

Missing a beat

Posted Dec 21, 2024 3:35 UTC (Sat) by geofft (subscriber, #59789) [Link]

My guess, and someone correct me if I'm wrong: once you've called IORING_OP_CLONE, there's no userspace for this process yet, and you can't return from a syscall like io_uring_enter if you don't have a userspace to return to. So all the operations you do in processing the ring have to be operations that are handled synchronously in kernelspace and keep the syscall happening; none of them can be an operation that would cause the kernel to return from the syscall and say "Yeah I'll do this asynchronously," until you've done an exec and loaded a new program into userspace. (And when you exec, you want to start the process from the beginning like normal, you don't want to act like you're returning from an io_uring system call, hence the rule that you can have no further io_uring operations.)

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

Posted Feb 3, 2025 12:17 UTC (Mon) by roblucid (guest, #48964) [Link] (2 responses)

Modern developers criticising an early 70's design, which gave clean & simple primitives solving problems shouldn't blame fork/exec
but the failure to establish clearly better alternatives as the computing environment changed. The multi-user machines of the day were continually swapping out whole processes to disk and reloading them, single processes DID not have vmtables. The fork(2) model allowed multi-tasking and the parent in the child process can set to NULL everything the child does not require with a small amount of memory copying and then overlay its own code via exec(2).

Environment solved the problem of child processes having to know about everything that could be set, so banishing that uncessarily, incurs a future maintenance problem. Just imagine if a DB style solution for user preferences hand been imposed with a single point of failure. Having used alternative OS solutions, vast amounts of variables had to be set up and constantly reset in practice, rather than just using a bit copy of an area in process memory.

A better question is why are huge monolithic programs with masses of VM mappings forking anyway? Why aren't small main programs, setting up co-operating processes which have memory isolation and then can fire up threads for their heavy weight processing?
Surely that's what you want for cache effieciency, so parts with tightly coupled gang processing can share L3, while loosely coupled parts can be scheduled seperately.

Seems to me the issue is the problem is the coordination between the program pieces, effectively an efficient message passing system so loosely coupled parts of a program can avoid sharing address spaces.

When I read about clone being inefficient because it "copies the whole VM which is mostly unused", I see poor application architecture.
Again people mentioned graphics libraries starting threads when linked, well that's what dynamic loading avoids.

When applications are statically linking huge amounts of common library code in multiple copies, that then may use threading which requires understanding of a memory model and safe operations, all in ONE big pot, it's a bit rich to moan about inefficiency caused by copying VM tables.
Who created this huge memory management problem putting everything in one large pot?

Browsers moved away from single process because of security requirements despite it causing duplication, they still manage to be snappy enough, even if it's NOT what you would do in a twitch shooter game.

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

Posted Feb 3, 2025 18:47 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> A better question is why are huge monolithic programs with masses of VM mappings forking anyway? Why aren't small main programs, setting up co-operating processes which have memory isolation and then can fire up threads for their heavy weight processing?

Why would you reinvent threading and shared memory?

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

Posted Mar 25, 2025 9:38 UTC (Tue) by gstrauss (guest, #176692) [Link]

A bit late to the party, similar to @roblucid

If the problem is that fork()/execve() is expensive **for large processes** with many memory mappings, file descriptors, etc, then a solution is for those expensive monolithic processes to avoid fork().

One user-space solution: a large process could use AF_UNIX sockets to contact a lightweight process-creation daemon already pre-configured to be ready to do some minimal setup, vfork() or clone(), and then execve(). The lightweight process-creation daemon does not even have to be running all the time. It could conceivably be started on-demand via xinetd or systemd socket triggers.

I wrote one such user-space process-creation daemon over a decade ago, called `proxyexec` (https://github.com/gstrauss/bsock)
A heavyweight or differently-privileged process can contact proxyexec on an AF_UNIX socket and then pass argv, env, and fds for stdin, stdout, stderr fds. proxyexec can even run under different credentials from the heavyweight process, and can even run in a different container connected by the AF_UNIX named socket. This is not theoretical: an earlier version of proxyexec was (and maybe still is) used by CloudLinux to remove setuid binaries from their containers, and to still provide privileged services -- running under different user accounts and outside the containers -- via proxyexec.

tl;dr: Alternative user-space application designs, e.g. possibly using a service oriented architecture (directly or via a process-execution proxy) should be compared and contrasted with extending io_uring to provide IORING_OP_CLONE, IORING_OP_EXEC.