Process creation in io_uring
A new process in Linux is created with one of the variants of the clone() system call. As its name suggests, clone() creates a copy of the calling process, running the same code. Much of the time, though, the newly created process quickly calls execve() or execveat() to run a different program, perhaps after performing a bit of cleanup. There has long been interest in a system call that would combine these operations efficiently, but nothing like that has ever found its way into the Linux kernel. There is a posix_spawn() function, but that is implemented in the C library using clone() and execve().
Arguably, part of the problem is that, while the clone()-to-execve() pattern is widespread, the details of what happens between those two calls can vary quite a bit. Some files may need to be closed, signal handling changed, scheduling policies tweaked, environment adjusted, and so on; the specific pattern will be different for every case. posix_spawn() tries to provide a general mechanism to specify these actions but, as can be seen by looking at the function's argument list, it quickly becomes complex.
Io_uring, meanwhile, is primarily thought of as a way of performing operations asynchronously. User space can queue operations in a ring buffer; the kernel consumes that buffer, executes the operations asynchronously, then puts the results into another ring buffer (the "completion ring") as each operation completes. Initially, only basic I/O operations were supported, but the list of operations has grown over the years. At this point, io_uring can be thought of as a sort of alternative system-call interface for Linux that is inherently asynchronous.
An important io_uring feature, for the purposes of implementing something like posix_spawn(), is the ability to create chains of linked operations. When the kernel encounters a chain, it will only initiate the first operation; the next operation in the chain will only run after the first completes. The failure of an operation in a chain will normally cause all remaining operations to be canceled, but a "hard link" between two operations will cause execution to continue regardless of the success of the first of the two. Linking operations in this way essentially allows simple programs to be loaded into the kernel for asynchronous execution; these programs can run in parallel with any other io_uring operations that have been submitted.
The new patch set creates two new io_uring operations, each with some special semantics. The first of those is IORING_OP_CLONE, which causes the creation of a new process to execute any operations that follow in the same chain. In a difference from a full clone() call, though, much of the calling task's context is unavailable to the process created by IORING_OP_CLONE. Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail. In practice, that means that operations like closing files can be executed, but complicated I/O operations are no longer possible. Krisman hopes to be able to at least partially lift that constraint in the future.
Once the chain completes, the new process will be terminated, with one important exception: if it invokes the second new operation, IORING_OP_EXEC, which performs the equivalent of an execveat() call, replacing the running program with a new executable. At this point, the new process is completely detached from the original, is running its own program, and the processing of the io_uring chain is complete; the process will, rather than being terminated, go off to run the new program. Placing any other operations after IORING_OP_EXEC in the chain usually makes no sense; any operations after a successful IORING_OP_EXEC will be canceled. It also does not make sense to use IORING_OP_EXEC in any context other than a new process created with IORING_OP_CLONE, so that usage is not allowed.
There is one case where it can be useful to link operations into the chain after IORING_OP_EXEC — efficiently implementing a path search in the kernel. Often, the execution of a new program involves searching for it in a number of directories, usually specified by the PATH environment variable. One way of doing this in the io_uring context, as shown in this test program, is to enqueue a series of IORING_OP_EXEC operations, each trying a different location in the path. If hard links are used to chain these operations, execution will continue past failed operations until the one that actually finds the target program succeeds; after that, any subsequent operations will be discarded. The entire search runs in the kernel, without the need to repeatedly switch between kernel and user space.
Most of the comments on the proposal so far have come from Pavel Begunkov,
who has expressed
some concerns about it. He did not like some aspects of the
implementation, the special quirks associated with IORING_OP_CLONE
and the process it creates, and the use of links, "which already a bad
sign for a bunch of reasons
" (he did not specify what the reasons
are). He suggested that io_uring might not be the best place for this
functionality; perhaps a list of operations could be passed to a future
version of clone() instead, mirroring how the
posix_spawn() interface works.
Krisman answered that combining everything into a single system call would add complexity while making the solution less flexible. Io_uring makes it easy to put together a set of operations to be run in the kernel in an arbitrary order. The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task. It is hard to see how combining all of this functionality into a single system call could work as well.
In any case, this is early-stage work; getting it to a point
where it can be considered for the mainline will require smoothing a number
of the rough edges and reducing the number of limitations. It will also
certainly require wider review; this work is proposing a significant
addition to the kernel's user-space ABI that would have to be supported
indefinitely. The developers involved will surely want to get the details
right before committing to that support.
Index entries for this article | |
---|---|
Kernel | io_uring |
Posted Dec 20, 2024 16:32 UTC (Fri)
by willy (subscriber, #9762)
[Link] (23 responses)
/s in case it wasn't clear.
Posted Dec 20, 2024 17:47 UTC (Fri)
by gutschke (subscriber, #27910)
[Link] (20 responses)
clone()/exec() is a very powerful pattern that nicely fits in with how POSIX has designed its API. The ability to customize the newly launched process prior to loading the binary is crucial in a lot of scenarios. And I don't see that going away.
But ever since the advent of threads (and possibly even in the presence of signals), this has gotten incredibly difficult to do correctly. There are just too many subtle race conditions that involve hidden state in the various run-time libraries or even in the dynamic link loader. If there was a way to do everything that you can currently do with systemcalls from userspace, but it instead moved entirely into the kernel, most of these problems would immediately go away. So, I see a lot of value with being able to call clone() and exec() from a BPF program, or maybe from io_uring. The elephant in the room with BPF is that this new API would then likely be limited to privileged processes.
You can approximate a solution in userspace by very carefully picking what system calls you invoke, and by avoiding any calls into libc, including accidental calls into the dynamic link loader. This involves some amount of assembly code to get 100% reliable. It's very tedious and extremely fragile. It is often not worth the effort and instead you have to live with the occasional random crash.
In some cases, a possible work-around is to launch a "zygote" helper process that executes before any threads are created. The latter is difficult to ensure though, as some libraries create threads when they are loaded into memory.
Posted Dec 20, 2024 19:05 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (17 responses)
POSIX's API is badly designed. clone() creates a copy of the entire VM and then just discards it. It's a lot of uselessly wasted work.
A better API would create an "empty shell" suspended process, then the calling process can poke it (using FD-based APIs), and finally un-suspend it. There's a strange aversion in Linux/UNIX land to this model (it's too sane), so we get closer and closer to it with these kinds of workarounds.
Posted Dec 20, 2024 19:49 UTC (Fri)
by epa (subscriber, #39769)
[Link]
Posted Dec 21, 2024 1:08 UTC (Sat)
by comex (subscriber, #71521)
[Link] (3 responses)
Posted Dec 21, 2024 1:47 UTC (Sat)
by gutschke (subscriber, #27910)
[Link] (2 responses)
fork() is a decent general solution for single-threaded applications, and that's why we've been using it for so many decades. The kernel-level API is amenable to writing thread-safe code using fork()/exec(). But that requires that after fork() returns in the client, no further entries into any libraries are allowed. In fact, I am not even convinced that it is always safe to call the glibc version of fork() instead of making a direct system call.
Both the various wrappers that glibc puts around system calls, and the hidden invocations of the dynamic link loader are potential sources for dead locks or crashes. Depending on how your program has been linked, this can even mean that you can no longer access any global symbols. Everything has to be on the local stack.
The upshot of all of this is that you not only need to carefully screen the system calls that you want to make for potential process-wide side-effects, you also have to call them from inlined assembly instead of deferring to glibc. In addition, fork() only really works with memory over-committing enabled, and for large programs this system call can be expensive.
vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux. Some amount of porting to different OS's will be involved, if you need to spawn a new process from within a multi-threaded environment.
clone() is the pragmatic solution. Once you come to the realization that this code is impossible to implement within the constraints of POSIX alone, you might as well take advantage of everything that Linux can provide to you. It's going to be hairy code to write, but there really is no way around it. Also, just to point out the obvious, the glibc wrapper around clone() is completely unsuitable for the purposes of what we need here. But a direct system call will work fine.
Of course, in 99% of the cases, you won't hit any of the race conditions. They are a little tricky to trigger accidentally, and a lot of them are relatively benign. Who cares about an occasional errno value that isn't set correctly, or a file descriptor that sometimes leaks to a child process. Only in very rare cases will you trigger a dead-lock, crash, or worse. So, many programs simply don't bother, and nobody ever notices that the code is buggy. It's the really big programs that everyone uses that need to worry about these things, as you suddenly have millions of running instances and countless numbers of spawned processes. If there is a way for something to go wrong, it eventually will.
A zygote process is a time-tested alternative. And that's great, assuming you can modify the startup phase of the program. If you can guarantee that your code executes before any threads are created, then a zygote that is fork'd() proactively will avoid all of these complications. But with bigger pieces of software that rely on lots of third-party libraries, that's not always feasible. These days, you should assume that all code is always multi-threaded -- if only because the graphics libraries decide to start threads as soon as they get linked into the program, or something similarly frustrating.
Posted Dec 21, 2024 15:18 UTC (Sat)
by khim (subscriber, #9252)
[Link]
Zygote solves an entirely different problem: how to start not one process, but many processes while executing an initialization part only once. It works, but that's entirely different task. It's also the simplest way to do everything reliably and efficiently on Linux. For some unfathomable reason everyone's attentions is on an unsolvable problem: how to prepare a new process state using remnants of the old code that is interwined with the state of your program. Just ditch all that! Start from the clean state! Create a new setup code, push whatever you need/want in there, then execute The only downside: you have to develop that in arch-dependent way… but so what? If you compare that to insane amount of effort one would need to support all these bazillion zygote-based solutions then adding some kind of portable wrapper with arch-dependent guts even for 3-4 most popular architectures is not too hard. Best property of that solution: it's not supposed to be perfect! If you would find out that it doesn't work – nobody stops you from redoing that portable API and adding or removing something to it. Because you ship it with your code or in a shared library it's replaceable without any in-kernel politics. P.S. I think it can be called “double-exec” solution, and it requires Linux-specific syscalls, but the best part: all these syscalls are already there and are not even especially new.
Posted Dec 27, 2024 1:52 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link]
Can you please elaborate more specifically on thread-unsafety of posix_spawn implementations? POSIX might be not explicitly saying that posix_spawn is safe to use in MT programs, but it's main purpose is clearly to fix fork's problems with threads. So it has to be MT-safe.
Fork+exec and threads are too unsafe in practice. Even our esteemed editor made an error above. Here: "details of what happens between those two calls can vary quite a bit. <skiped>environment adjusted, and so on".
Thing is, updating process environment (e.g. via setenv) typically invokes malloc. And calling malloc in-between fork and exec is unsafe.
As per posix (quoting from man 3posix fork): "If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called."
In practice malloc implementations go to some lengths to make malloc() possible after fork by carefully setting up pthread_atfork or alternatives. But this is big enough can of worms. And for example "abseil" tcmalloc explicitly doesn't (https://github.com/search?q=repo%3Agoogle%2Ftcmalloc%20at...). As per Google's internal policy pthread_atfork is forbidden (which is another, but somewhat related topic).
So posix_spawn is the right thing IMO. And any exotic process setup things that might be missing in your favorite libc (e.g. stuff like unshare/prctl) you can always do in a small helper program. You exec into it. It gets clean slate, can do whatever syscalls and mallocs and what not. Single-threadly. And then exec into real thing.
As for original discussion, I am really hoping io_uring is kept only for perf-critical stuff. Spawning childs isn't.
Posted Dec 21, 2024 4:58 UTC (Sat)
by IAmLiterallyABee (subscriber, #144892)
[Link]
IIRC, Fuchsia does something like that
Posted Dec 22, 2024 7:30 UTC (Sun)
by joib (subscriber, #8541)
[Link] (8 responses)
Posted Dec 27, 2024 17:01 UTC (Fri)
by epa (subscriber, #39769)
[Link] (3 responses)
That means instead of fork() and in the child process opening file handles, the parent process could take care of all this. Create the child, which is initially not schedulable, make any system calls you want to set up the child's execution environment, then finally an exec_pidfd() to apply to the child process and mark it schedulable. That's a great way to apply "the Unix philosophy", composing the existing simple system calls rather than creating a kitchen sink like posix_spawn(), while avoiding the known problems of forking.
Existing code should be translatable to the new scheme without too much trouble. (Indeed you could even have a fork emulation layer in the C library which, on returning from fork() or vfork(), acts as though you were in the child process, so that calling open() actually calls into open_pidfd(), and then the final exec() call runs exec_pidfd() and then continues with the parent process's code. That's a bit kooky but might be a quick way to migrate older code which just wants to spawn a subprocess.)
Posted Dec 28, 2024 14:01 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
How would that work? The "magic" of `fork()` (and related functions) is that it returns twice: once with a `0` return value and once with a `pid`. How would a library do any kind of emulation to allow taking *both* sides of the `if` condition it (eventually) leads to without some kind of language magic?
Posted Dec 28, 2024 14:11 UTC (Sat)
by daroc (editor, #160859)
[Link]
Posted Dec 28, 2024 20:58 UTC (Sat)
by epa (subscriber, #39769)
[Link]
I wouldn't be surprised if similar hacks have existed to help port Unix code to single-tasking operating systems like MS-DOS.
Posted Dec 29, 2024 18:42 UTC (Sun)
by ma4ris8 (subscriber, #170509)
[Link] (3 responses)
Thus one of Go's performance secrets is to use posix_spawn() since 2017.
Linux Java from fork() into posix_spawn() near 2018, using "jspawnhelper" as a clean up process:
Oracle's Java for Solaris took the change earlier, in 2013:
Rust language: glibc uses sometimes posix_spawn():
Fork/Exec major problem:
Fork/Exec benefits:
posix_spawn() design:
Thus the optimal way (bpf, io_uring) solution would be to declare, what needs to be cloned, re-mapped,
The world for big memory programs (Go, Java), has already moved from fork() into posix_spawn.
Thus there is big amount of Kernel work to be avoided, if the io_uring approach (and/or BPF enhancement)
So instead of cloning everything, and tearing down, and making caller thread to sleep until child process is launched,
Posted Dec 29, 2024 18:56 UTC (Sun)
by ma4ris8 (subscriber, #170509)
[Link]
Posted Dec 29, 2024 19:17 UTC (Sun)
by bluca (subscriber, #118303)
[Link]
I switched systemd to use pidfd_spawn (which is posix_spawn but with clone3(), which gets back a pidfd instread of a pid, and to clone into the target cgroup atomically) in v255 last year for similar reasons, as the copy-on-write trap overhead was hitting hard the azure fleet. I should probably do a write up about that at some point...
Posted Dec 31, 2024 7:37 UTC (Tue)
by izbyshev (guest, #107996)
[Link]
On Linux, Go uses raw syscalls for almost all standard functionality, and Go programs usually don't even link to a C library. So, Go uses a vfork() equivalent followed by execve(), not posix_spawn() library function [1].
> Linux Java from fork() into posix_spawn() near 2018
No, it migrated from vfork() [2] (which has been used by default since forever). So, overcommit issues weren't present in the first place.
The CPython issue [3] for migrating subprocess from fork() to vfork() contains a lot of useful links on the topic. In some parts, it's outdated [4].
[1] https://github.com/golang/go/blob/194de8fbfaf4c3ed54e1a3c...
Posted Dec 31, 2024 15:19 UTC (Tue)
by surajm (subscriber, #135863)
[Link]
Posted Jan 12, 2025 2:33 UTC (Sun)
by chexo4 (subscriber, #169500)
[Link]
Posted Dec 20, 2024 19:33 UTC (Fri)
by magfr (subscriber, #16052)
[Link]
To further mess with people this clone abstraction isn't strong enough to handle all cases - I have a little variation on tee which forks, sets up the child as a daemon process which does the writing, and then execs in the parent in order to keep the parent/child link with the grandparent.
Posted Dec 21, 2024 3:05 UTC (Sat)
by geofft (subscriber, #59789)
[Link]
Yeah, that was also my concern. It seems like people are not going to be comfortable making eBPF available to unprivileged users any time soon.
On the other hand, classic BPF is still around and is accessible to unprivileged users in a few ways, most notably via seccomp mode 2, but also by creating an unprivileged user+net namespace (allowed by default in the upstream kernel and in most but not all distros) and using it for its original purpose of packet filtering. Could you allow userspace to upload a cBPF program and some data for its use and have that be enough to make system calls?
I think my specific proposal would be to extend clone3's struct clone_args with three fields: a pointer to a cBPF program in user memory, and a pointer and length of memory to copy-on-write into the new process. So if you want traditional behavior for some reason, you can specify NULL and ~0 and deal with the overcommit issues of doing that, but more likely you just need a page or two of memory for the filename, argv, maybe the value of $PATH, and maybe some additional info like how to reorder file descriptors. Add a new cBPF opcode BPF_SYSCALL that is only valid in this context, which makes the syscall stored in the BPF accumulator with the arguments in the BPF registers and returns a value to the accumulator. This syscall is treated as a real syscall (it is not eBPF's BPF_CALL, pointer arguments point to userspace, etc.). When it calls execve, normal behavior resumes. If the cBPF program returns instead of calling either execve or exit, then it returns to the userspace instruction pointer where clone3 was originally called, so you can use it just like a normal use of clone if you want. If that address is no longer mapped, the process dies with a segfault.
Posted Dec 20, 2024 17:50 UTC (Fri)
by edeloget (subscriber, #88392)
[Link] (1 responses)
Which will /then/ be interpreted by a BPF program :)
Posted Dec 20, 2024 18:08 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link]
It will be incomplete until it is possible to create new uring with uring interface
Posted Dec 20, 2024 18:44 UTC (Fri)
by jbills (subscriber, #161176)
[Link] (40 responses)
Posted Dec 20, 2024 23:14 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
The other basic problem is that systemd --user has already solved quite a lot of practical use cases anyway, so there is reduced motivation to expand the kernel's semantics when we already have code that works today.
[1]: https://pubs.opengroup.org/onlinepubs/9799919799/function...
Posted Dec 21, 2024 0:11 UTC (Sat)
by jbills (subscriber, #161176)
[Link] (2 responses)
Posted Dec 21, 2024 1:18 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Dec 21, 2024 2:28 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted Dec 21, 2024 15:25 UTC (Sat)
by khim (subscriber, #9252)
[Link] (3 responses)
And why is that an issue? New process can always ditch whatever it doesn't need. Heck, you may supply it with all the information needed to do that. Only need five syscalls: memfd_create/write/vfork/execveat/execveat No need for
Posted Dec 21, 2024 19:31 UTC (Sat)
by roc (subscriber, #30627)
[Link] (2 responses)
You might say that the "users write arbitrary code into a memfd" part is essential for flexibility. Even if we ignore the security issues, it would be nasty to program directly. People would inevitably wrap it in some kind of tiny, portable virtual machine for users to express their setup code ... and then again, you can implement that approach without doing the double exec.
Posted Dec 21, 2024 20:09 UTC (Sat)
by khim (subscriber, #9252)
[Link] (1 responses)
Precisely because then your sign-verifying machinery couldn't verify your code. You are executing things in the context that's polluted by gigabytes of long-living code that may affect your carefully prepared binary. Even if it was sign-verified and correct when process was started it's not guaranteed to stay sign-verified and correct by the time you [try to] execute it. You could do what Turbo Pascal did decades ago: concatenate binary and parameters for said binary into one executable. So there would be signed part and unsigned parameters, signature can be easily checked when binary is loaded, even if it's loaded from It's possible in theory but it's not done today. And it doesn't eliminate issues of that code being corrupted and abused before new binary is spawn. And if we are not fighting that with The big problem of article that we are discussing here is that it carefully describes the answer to some issues, but it entirely neglects to list the issues that we are trying to fix! Not as bad as infamous “42” as the answer to “the ultimate question of life, the universe, and everything”, but very close to it: sure, that's a mechanism with a certain properties… but what it tries to do? What's the problem that couldn't be easily solved with it but is impossible to solve without it? I have no idea and as long as I have no idea I couldn't even say if that's a good idea or not! That's why I'm talking about “buzzword compliance”: simply because if “hey, it's
Posted Dec 22, 2024 11:29 UTC (Sun)
by ballombe (subscriber, #9523)
[Link]
Well, I am glad to see that this does not impair your ability to write essay-sized posts about it.
Posted Dec 21, 2024 0:18 UTC (Sat)
by josh (subscriber, #17465)
[Link] (28 responses)
However, typically, you do want to inherit at least some state from the current process. You *have* to inherit permissions (though root could override them), you may want to inherit at least some file descriptors, and so on.
And in practice you *may* want the option of having access to your memory map before doing the exec, at least for some operations.
It might well be useful to have pidfd operations to set up a new process from an existing one, but there's value in batching those operations, in the style of uring.
Posted Dec 21, 2024 0:39 UTC (Sat)
by willy (subscriber, #9762)
[Link] (26 responses)
Posted Dec 21, 2024 0:46 UTC (Sat)
by josh (subscriber, #17465)
[Link] (25 responses)
Posted Dec 21, 2024 0:55 UTC (Sat)
by willy (subscriber, #9762)
[Link] (24 responses)
Posted Dec 21, 2024 1:12 UTC (Sat)
by josh (subscriber, #17465)
[Link] (23 responses)
That would work well in the io_uring case too, where you could keep the pidfd in an in-ring file descriptor and do a series of operations on it.
Posted Dec 21, 2024 2:11 UTC (Sat)
by gutschke (subscriber, #27910)
[Link] (22 responses)
Every once in while, there has been talk of a new system call to inject system calls into child processes. But it never seems to go far.
Until then, you need to at least have some memory that is already mapped into the child. And presumably, you could then use ptrace() to make the child do what you need to do. But by the time you jump through all these hoops, you might as well create a new process that has some pre-mapped memory pages that the parent filled out before starting the child.
It's still a major pain to program, but better than starting with no initial memory map. I could see working with a version of clone() that takes an aligned memory address and number of pages to preserve. It won't be fun to program, but that's something that could be implemented in a library once and then nobody else needs to worry about it. It'll solve a number of the concerns that people have with (v)fork() and clone()
Posted Dec 21, 2024 2:13 UTC (Sat)
by josh (subscriber, #17465)
[Link] (19 responses)
Posted Dec 21, 2024 15:39 UTC (Sat)
by khim (subscriber, #9252)
[Link] (18 responses)
A much simpler approach would be to just add some code that would do that setup in the empty process. And we already have memfd_create/execveat combo that can do that. If you want – add flag to the
Why shove
Posted Dec 21, 2024 16:10 UTC (Sat)
by corbet (editor, #1)
[Link] (17 responses)
Posted Dec 21, 2024 16:44 UTC (Sat)
by khim (subscriber, #9252)
[Link] (1 responses)
Where do you see insults? I've faced the need to mangle simple and easy to understand and implement ideas into pretzels to include all the right buzzwords at my $DAYJOBs often enough that I can easily see buzzword compliance as explicit, or more likely, implicit part of the requirements. And very often it's even the most important one: if you couldn't cause enough buzz around your idea then it would die (except if there are some concrete tasks for concrete customers that may need it) even if it's pretty good, but with enough buzz around your idea you may push it even if it's totally stupid and would hurt everyone in the long run. There are no patch because in-kernel parts are already done… years ago, in fact. And to discuss userspace part we need some idea about who, why and how plans to use that mechanism. The list of interested parties is not in the article thus it's hard for me to offer anything concrete because it's not clear to me how much flexibility is needed or wanted. Implementation of IOW: I don't see enough of a picture related to that work to judge it fairly and if “buzzword-compliance” is part of reasoning (even if an implicit one) then it could be that
Posted Dec 21, 2024 16:48 UTC (Sat)
by corbet (editor, #1)
[Link]
Posted Dec 23, 2024 9:26 UTC (Mon)
by gutschke (subscriber, #27910)
[Link] (14 responses)
Not surprisingly, since we are doing things that weren't quite intended to be done this way, there are warts and pit-falls. If my code was to be turned into a production-quality library, a good amount of additional polishing is necessary. But as is, this is evidence that khim's suggestion can address several of the concerns raised in these comments.
(The best way to play with the code is to run it under the control of "strace". All it does is call "/bin/true" in a very round-about fashion.)
Posted Dec 23, 2024 14:39 UTC (Mon)
by khim (subscriber, #9252)
[Link] (12 responses)
TL;DR version: this approach is better because in case of adoption failure (which is quite likely) one may just throw it away and forget about it, whileas if similar failure would happen with It's “easier” in a sense that you can use it in applications for RHEL8+ and Android8+ (and most other distributions also have kernels with That means that you could model your Even if final solution would be to add either a dedicated syscall or set of If you start with addition to the kernel, on the other hand, then all these “large parent processes” deployed in various places wouldn't be able to use it for many years – and by the time when you would have real data collected from real apps… kernel API would be long-established and, most likely, not used (just like P.S. Of course if you plan to eventually go with
Posted Dec 23, 2024 16:29 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
Actually I don't think you could use it in either, given it requires writable + executable memory, which is blocked by SELinux by default. Most sandboxing systems restrict that as well, as it's a very commonly used attack vector.
Posted Dec 23, 2024 16:36 UTC (Mon)
by khim (subscriber, #9252)
[Link]
That can be solved if you would crate two mappings: read/write one and read/execute one. Or even just create read/write mapping, then fill it and then change to read/execute before These tricks are already used by JITs and most distributions, even very “enterprise” ones, have knobs to allow JITs, only iOS disabled that completely (and I don't think iOS is in scope for that project). Changing SELinux settings is needed in any solution, even if you introduce new syscall it's highly unlikely that SELinux wouldn't stop that till you retune it.
Posted Dec 23, 2024 17:00 UTC (Mon)
by gutschke (subscriber, #27910)
[Link]
I use a single mapping for both code and read-only data. That approach slightly simplified the already painfully complicated open-coded serialization of the various data structures that need to be passed into the child. But that could be split into two separate mappings for a production release.
Or instead of passing data as part of the ELF image, all data could be passed into the ephemeral child over a pipe(). Those design details are certainly up for review.
Posted Dec 24, 2024 11:57 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (8 responses)
The neat thing about the proposed io_uring solution is that the special properties of these chained operations already exist for other reasons - in order to allow you to queue up an I/O operation with an appropriate response on error, chains and hard links already exist[1], and to allow io_uring to operate asynchronously to process context, it already knows how to handle trying to return to a userspace that isn't running.
The only new things here are IORING_OP_CLONE that creates a new process (not able to run) and IORING_OP_EXEC that replaces the program text and turns it into a ready-to-run process. Everything else already exists in io_uring for I/O purposes.
[1] The intent is that you can do something like write to a WAL, fsync the WAL if the WAL write succeeds, write to the final location if the WAL write and fsync succeed, and then regardless of success of the WAL and final location writes trigger a futex wake, all in a single submission to the kernel.
Posted Dec 24, 2024 14:10 UTC (Tue)
by khim (subscriber, #9252)
[Link] (7 responses)
Have you actually read the article? That one, specifically: Krisman hopes to be able to at least partially lift that constraint in the future. It's extremely clear to me that interface, as presented, it's not finished and not tested. Or, even worse, tested and is just feed to kernel developers in an insidious way to convince them to adopt huge hairball of API that would be immediately rejected if presented in it's full capacity… that's even worse then “unfinished and untested” API in my book (and I sincerely hope it's not that: Hazan's razor and all that). Which is something that Linux haven't supported till today. Currently “load new executable in a process” is one atomic operation that starts from the state where kernel have something mapped and executable in it's address space and ends in the state where kernel have something mapped and executable in it's address space. An attempt to split that process in two looks innocuous enough, but it's entirely not clear what strange pitfalls it may hit. Yes. And that's fine because code before and after comes from the exact same codebase. If some steps are omitted and/or failed then code that started the whole mess could, presumably, handle these failures gracefully. Compare with This sounds, to me, like “hey, we have added nice security vulnerability to the kernel API, we just have no idea how to exploit it in the wild… contest is starting”! The most likely consequence would be pile of special cases forbid some “likely exploitable” instructions in the sequence between With ongoing maintenance when they would be discovered and open questions about what to do about apps that rely on these operations. Of course
Posted Dec 24, 2024 14:41 UTC (Tue)
by daroc (editor, #160859)
[Link] (6 responses)
In any case — nothing obliges developers to use hard links between io_uring operations. If an important cleanup operation is necessary, and it is not safe to execute the new program if it fails, don't use a hard link. While it is arguably suboptimal to introduce yet another API that must be used correctly or risk security problems, it is hardly the first such API in the kernel. Nothing prevents a poorly-written program from leaking various kinds of state to another program with the current fork()/exec() workflow.
Posted Dec 24, 2024 16:04 UTC (Tue)
by khim (subscriber, #9252)
[Link] (5 responses)
How would anything work without hard links? After That's the whole point of that patch series: to introduce a way to “clean up” that “undead” process by doing some operations when userspace is entirely gone. If you wouldn't use hard links… what is supposed to happen? How would non-hardlinked operations work without userspace? What is supposed to happen if some operation would would fail? We don't have any agent that may receive information about failure! Sure, we would know that combined operation would fail without running the program, but we would just make already stupid situation (where people try to execute programs with non-standard loader, get the message “file not found” and then spend hours trying to understand why program that's clearly there will all the proper permissions couldn't be execute) even worse. It's worse than that: currently it's not “must be used correctly or risk security problems” but “it must be extended more before it would go beyond “proof of concept” phase, because in it's current form it's impossible to use it safely”. And we have no idea how much more should it be extended to become actually usable. That's precisely why I have said that we need to know who, why and how plans to use that mechanism – because without such information we have no idea what needs to be added to it to make it actually usable. Sure, but correctly-written program can do everything correctly. And handle failures safely. Even if glibc fails to do that it's possible, at least in theory. That's currently impossible to do with the new mechanism. Can it be extended to handle these things? Sure: you can make it possible to receive information about Sure, but why are you directing that request to me? The main advantage that was touted in the original work way speed that's 6-10% faster than vfork() and 30+% faster than posix_spawn(). I have no idea who would really need that speedup (most of the time time spent in But if all that complexity (including fight with a kernel corruption after a few spawns) and less reliability than what existing mechanisms provide is justified then it would be really nice to know who executes so many processes that they do care about Because if it's some silly specialized unikernel or some kind of cluster management software – then it may very well be handled better with a more focused, more specialized API instead of jenga tower that this patch series starts to build.
Posted Dec 24, 2024 19:12 UTC (Tue)
by daroc (editor, #160859)
[Link] (4 responses)
It's possible that I've misunderstood how the patch series works, but I thought that if the whole series of operations fails, the program that originally started the operation is notified in the normal way (via an item in the io_uring completion queue). You can see that in this example: https://lwn.net/ml/all/20241209234421.4133054-3-krisman@s...
So a program submits the chain of io_uring operations, and then it either succeeds (and a new process is created) or it fails, and the program that submitted it can choose how and whether to retry. So hard links aren't needed, and it's perfectly possible to write a correct program that closes important files with the current patch series.
Posted Dec 24, 2024 19:37 UTC (Tue)
by khim (subscriber, #9252)
[Link] (3 responses)
Which test is that? AFAICS the only test function that doesn't use linking, test_unlinked_clone_sequence just issues unlinked That's it. All other examples use linked operations, as they should. It's possible that I have misunderstood something, but at least at the first glance it's obvious why it have to be done that: after Couldn't see that. At least in that patch series and set of examples. And it's obvious why: if you add that machinery then, suddenly, instead of simple and non-invasive patch that just adds couple of Sure, that's not impossible to create, but… do we really want to add so many new subtle features for 6% speedup? Show me the code. I couldn't find it. And I suspect that it's precisely as I have said: we only see 10% of the iceberg here, the majority of changes, 90% of iceberg is either doesn't exist or is not submitted for review.
Posted Dec 24, 2024 19:49 UTC (Tue)
by corbet (editor, #1)
[Link] (2 responses)
Normally, you would not use hard-linked operations in the newly cloned child context. If one of the setup operations fails, the entire chain fails, with the status returned to the parent. No silently ignored failures. No unknown state.
Hard links can be used in the execveat() sequence to implement a search path. In that case, continuing after failure is the desired outcome; you want to go until you find something you can actually execute.
I am sorry if the article did not adequately convey that.
Doubtless there is interesting work to do to expand the range of actions available in the just-cloned child context; we will have to see what shape that takes. But I see no reason to suspect some sort of evil plot here.
Posted Dec 24, 2024 20:36 UTC (Tue)
by khim (subscriber, #9252)
[Link]
I want to understand what that patch series hopes to achieve, mainly. This part is not reassuring: Krisman hopes to be able to at least partially lift that constraint in the future. And this is even more worrying: The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task. In essence we are supposed to accept some piece of the whole solution without us knowing where the whole thing even leads. And, worse yet, it's not clear what problem this whole thing is even supposed to solve! If it's safety of creation of a new process then it's one thing (there are no need for Evil plot is unlikely. But it really looks like a solution in a search of a problem… and I want to see the problem and, more importantly, explanation why that's the best solution for it. As was noted in article one alternative solution would be to just create a dedicated system call that would include all the required operations. Or “double exec” if we just want to safely implement And if it's “an attempt to see where it goes” then I don't really want to sink but more to “flesh it out”, understand how the full, final, solution would look like and, again, who, why and how would use it. Because as it stands currently, it's not clear to me what's the goal of all that activity – and that matters much more then minor details of the implementation in the current form. Even if we would achieve the final goal of being able to execute all Where do we plan to arrive with that change and what do we plan to achieve? Ah. I see. While this, again, looks like a solution in a search of a problem (why to look up for the executable before executing it? what's the point of moving this pretty much optional functionality into the kernel? do we really want to try to continue after finding “kinda-sorta-suitable” binary that would end up being broken, for some reason?) at least now I understand what I didn't understood about that patch set. Thanks for explaining it: while I still am not sure how useful would it be to implement what it tries to implement (because, again, I couldn't see the end goal), at least some operations can be implemented in safe manner. That's better than how I understood it working. Thanks for explanation.
Posted Dec 24, 2024 23:03 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
How exactly is this going to be achieved for processes? As I understand, there's going to be a new visible intermediate state for the process, as the operations are being executed, unless the io_uring sequence locks the entire kernel.
This also can cause a problem for userspace process migration. How do you interrupt the io_uring sequence to suspend it? After reading the patch series, I don't see how it would prevent long-running operations like read() from being introduced into the middle of the sequence.
It really is a poorly-designed API. It is very much in line with the good old UNIX tradition of screwing up process management APIs.
Posted Jan 12, 2025 18:17 UTC (Sun)
by mrugiero (guest, #153040)
[Link]
Posted Dec 21, 2024 3:14 UTC (Sat)
by geofft (subscriber, #59789)
[Link]
Agree that the complexity can be dealt with in a library once, but also I think it's less hard to program than you'd fear - one approach that would make it relatively pleasant to implement would be to write a tiny standalone binary to do the post-fork actions, and embed that compiled binary as a big constant in this helper library. Then the pre-fork operation (which the library would do for you) is to mmap some new pages to hold the binary and its stack, copy over the binary and fill in the stack appropriately, and tell clone3 to start running that binary from its entry point. Then immediately munmap those pages in the parent. (Or, if you want to get fancy, make a clone3 flag to move pages from the parent to the child instead of CoWing them.) This lets you avoid thinking too much about how the compiler is laying out memory and what parts you need to preserve, because you're essentially running a new program in the child. (In other words, it basically gives you kernel support for the "zygote" approach.)
Posted Dec 21, 2024 19:30 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link]
In a world with a more regular system interface, *every* system call would require callers specify all object on which to operate, explicitly. We wouldn't have operations that work on "the current process" or "the current thread". In this world, the process bootstrapping the GP proposes would fall naturally out of the general shape of the API surface.
Posted Dec 21, 2024 15:32 UTC (Sat)
by khim (subscriber, #9252)
[Link]
Just package it neatly and pass it into a new process, damn it! That's not possible: you need something in the process that you may execute. You can not start from zero. But if you would instead pass
But hey, that's too simple! There are not enough buzzwords in that approach! How can be accept something so sane? Nope, we need to push for
Posted Dec 21, 2024 1:07 UTC (Sat)
by comex (subscriber, #71521)
[Link] (2 responses)
On Linux, posix_spawn is just a userland wrapper for a vfork/exec dance. But on macOS, posix_spawn is its own syscall. The kernel creates the new process without having to bother with forking the virtual memory space and all that.
Posted Dec 21, 2024 17:26 UTC (Sat)
by ma4ris8 (subscriber, #170509)
[Link] (1 responses)
One way is to fork, then open /proc/self/fd, close unrelated fds, remap related ones into 0,1 and 2.
The other way is to do posix_spawn(). Spawn intermediate process, which closes unrelated fds, remaps related
Third way: how to do it so, that the cleanups could be done in an elegant and memory safe way,
Posted Dec 23, 2024 13:44 UTC (Mon)
by fweimer (guest, #111581)
[Link]
In general, this is an anti-pattern, though, because we have to keep adding stuff that's easily expressed elsewhere in code. One issue is that one gets just one error code for an entirely list of actions, and that's bad from a debugging point of view. And one day, we'll need to wrap something where a first action produces a value needed by a second action, and we cannot easily force the value to a caller-supplied choice (like we do for file descriptors today).
One silver lining is that vfork may not be as bad as we thought it was for a while. (The TCB sharing is empirically quite harmless for a wide variety of programs). Running compiled C code instead of walking an action list may be the better approach in the end.
Posted Dec 21, 2024 0:20 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Clone a new process, do some initial setup, do a futex wait or blocking read or wait on a uring message, and when that completes, do an execveat of the new process. Now you can have a "pool" of ready-to-start processes, blocked in the kernel, waiting to exec.
Posted Dec 21, 2024 1:57 UTC (Sat)
by kxxt (subscriber, #172895)
[Link]
On x86_64, tracing execs is easy as hooking __x64_sys_execve{,at} and __ia32_compat_sys_execve{,at} for 32bit(execsnoop doesn't handle it, but my tracexec handles it).
Of course there is sched_process_exec but only for successful execs.
Posted Dec 21, 2024 2:06 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (1 responses)
and that is the whole point, right?
> Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail.
I'm afraid I'm missing a beat here. I mean, I miss the ... link (pun intended) between "without context" and "asynchronous". Could someone elaborate?
Posted Dec 21, 2024 3:35 UTC (Sat)
by geofft (subscriber, #59789)
[Link]
Posted Feb 3, 2025 12:17 UTC (Mon)
by roblucid (guest, #48964)
[Link] (2 responses)
Environment solved the problem of child processes having to know about everything that could be set, so banishing that uncessarily, incurs a future maintenance problem. Just imagine if a DB style solution for user preferences hand been imposed with a single point of failure. Having used alternative OS solutions, vast amounts of variables had to be set up and constantly reset in practice, rather than just using a bit copy of an area in process memory.
A better question is why are huge monolithic programs with masses of VM mappings forking anyway? Why aren't small main programs, setting up co-operating processes which have memory isolation and then can fire up threads for their heavy weight processing?
Seems to me the issue is the problem is the coordination between the program pieces, effectively an efficient message passing system so loosely coupled parts of a program can avoid sharing address spaces.
When I read about clone being inefficient because it "copies the whole VM which is mostly unused", I see poor application architecture.
When applications are statically linking huge amounts of common library code in multiple copies, that then may use threading which requires understanding of a memory model and safe operations, all in ONE big pot, it's a bit rich to moan about inefficiency caused by copying VM tables.
Browsers moved away from single process because of security requirements despite it causing duplication, they still manage to be snappy enough, even if it's NOT what you would do in a twitch shooter game.
Posted Feb 3, 2025 18:47 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Why would you reinvent threading and shared memory?
Posted Mar 25, 2025 9:38 UTC (Tue)
by gstrauss (guest, #176692)
[Link]
If the problem is that fork()/execve() is expensive **for large processes** with many memory mappings, file descriptors, etc, then a solution is for those expensive monolithic processes to avoid fork().
One user-space solution: a large process could use AF_UNIX sockets to contact a lightweight process-creation daemon already pre-configured to be ready to do some minimal setup, vfork() or clone(), and then execve(). The lightweight process-creation daemon does not even have to be running all the time. It could conceivably be started on-demand via xinetd or systemd socket triggers.
I wrote one such user-space process-creation daemon over a decade ago, called `proxyexec` (https://github.com/gstrauss/bsock)
tl;dr: Alternative user-space application designs, e.g. possibly using a service oriented architecture (directly or via a process-execution proxy) should be compared and contrasted with extending io_uring to provide IORING_OP_CLONE, IORING_OP_EXEC.
BPF!
BPF!
BPF!
BPF!
BPF!
BPF!
> A zygote process is a time-tested alternative.
BPF!
vfork
/exec
(with zero steps between them, using fexecve) and viola: no races, no possibility of corrupting anything, everything is very clear, simple and guaranteed.BPF!
Empty shell
https://fuchsia.dev/fuchsia-src/reference/kernel_objects/...
BPF!
Generalizing system calls to operate on other processes
Generalizing system calls to operate on other processes
Generalizing system calls to operate on other processes
Generalizing system calls to operate on other processes
BPF!
https://about.gitlab.com/blog/2018/01/23/how-a-fix-in-go-...
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-...
https://bugs.java.com/bugdatabase/view_bug?bug_id=5049299
https://kobzol.github.io/rust/2024/01/28/process-spawning...
- Memory overcommit (Gitaly article): large programs clone resources during fork. After exec,
memory have to be cleaned up by the Kernel, and recycled for re-use.
- Threaded process: After fork, file descriptor set is vague. Forked process can investigate and clean up the state,
so that process doing exec() does not need to know about the caller process.
- This benefit is actually a work around: why to leak file descriptors, just to search for, and remove those before exec()?
- Could forking thread simply enlist the interesting set of fds, and then skip copying the uninteresting ones?
- New process uses parent's memory. No memory overcommit.
Caller thread sleeps, until new process does "exec()".
New process's thread must do as little work as possible, and, and exec into middle process.
Caller thread continues.
Middle helper process (jspawnhelper) closes leaked file descriptors, re-maps stdin,stdout,stderr,
and then exec's to the final process.
- This also copies all file descriptors, thus the fd clean up must be done with a helper program.
Speed increase comes from avoiding the memory "Copy on Write" work though for big memory programs.
and changed. Best is, if nothing unnecessary need to be created, and then destroyed (by middle process).
can be used to avoid first cloning resources, just to tearing those down, and to call a child process
with given arguments, re-mapping stdin, stdout, stderr, passing some extra file descriptors (individual, tail range),
and setting child process working directory.
we could have something simple, which defines (declarative, programmatically, hybrid of those) the configuration for
a new process, and does a clean process launch with the single (0-1) io_uring queue submit,
without enforcing caller thread to sleep, until child process is launched.
BPF!
It shows, how the overhead rises, when caller process has bigger memory mapping.
https://maelstrom-software.com/blog/spawning-processes-on-linux/
BPF!
BPF!
[2] https://github.com/openjdk/jdk/commit/e21cb12d358c22350cb...
[3] https://github.com/python/cpython/issues/80004
[4] https://github.com/python/cpython/issues/113117
BPF!
Re: empty shell
BPF!
They have some variant of posix_spawn which always can be called and they also have fork/exec but only allows those system calls in single threaded environments.
(The child terminates on end of input)
BPF!
BPF!
BPF!
__attribute__((sarcasm)).
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
> The basic problem is that, historically, the standard behavior is that "everything" is inherited, unless explicitly listed at.
Why not just have a one-step spawn?
io_uring
, BPF
or other madness, everything entirely in userspace using syscalls that exist for years and years.Why not just have a one-step spawn?
> But then why not just inline the stub into the "carefully written threadsafe library code" to avoid the double exec?
Why not just have a one-step spawn?
memfd
.io_uring
proposal then I don't even have an idea what we are fighting for and against.io_uring
, new and shiny” is not the goal then what is the goal? Where does that solution is supposed to send us? And why couldn't we arrive there via simpler means?Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
clone
that would call execveat. And then new code in an entirely empty image can do whatever it needs to prepare for the execution of the real binary that you want to execution.io_uring
into something that already can be done entirely from userspace? Buzzword compliance?
Khim, if you have a better idea, please submit a patch showing it. But please stop insulting the work of others, that does not help anybody.
Why not just have a one-step spawn?
> But please stop insulting the work of others, that does not help anybody.
Why not just have a one-step spawn?
posix_spawn
is doable but would be significant amount of work without any clear benefits: do we have lots of users of that syscall? If yes, then where are they, if not then why are they so rare?io_uring
-based solution is the best way forward. Especially if it's a solution-in-a-search-of-a-problem: it's much easier to make someone excited about io_uring
solution than about solution that just combines well-known syscalls in a way that makes posix_spawn
safer.
"Buzzword compliance" takes the work of people who are trying to improve the system and casts it as something useless. If it were my work, I would find that insulting. I do not believe that the people working on this are concerned about buzzwords, they are trying to solve real problems. Please try being a bit more respectful toward them.
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
io_uring
solution the code and special properties of these chained operations would have to stay in kernel forever.memfd_create
, too).io_uring
solution and test it on wide set of real-world tasks (since it can be used in production).io_uring
operations (plus set of special “chaining” rules needed to make them usable) you would collect lots of data which would tell you what works and what doesn't work.posix_spawn
is barely used today).io_uring
, anyway, then it would be good idea to have API of your safeexec
designed in way that would make it easy to switch to io_uring
, at some point. Apps wouldn't even need to know that they stopped using “double exec” trick and switched to io_uring
on kernels that have io_uring
support, it would all be transparent for them.Why not just have a one-step spawn?
Why not just have a one-step spawn?
vfork
/spawn
.Why not just have a one-step spawn?
Chained operations in io_uring
special properties of these chained operations would have to stay in kernel forever.
> The neat thing about the proposed Chained operations in io_uring
io_uring
solution is that the special properties of these chained operations already exist for other reasons
io_uring
attempt to do clone
/exec
attempt: you are doing some important cleanup work after clone
which is, well, important (or we wouldn't worry so much about doing it in the first place) – and if it fails we execute foreign code, anyway.IORING_OP_CLONE
and IORING_OP_EXEC
.safeexec
also includes all the same issues (and probably some more), but there's a big difference: because it's not a kernel API and it can be easily embedded into your application by linking it statically there are no need to support all the warts of the first version indefinitely. In can be tuned and fixed [relatively] freely without commitment to support it forever (because each released version is self-contained and would work like it did on the day one).Chained operations in io_uring
> In any case — nothing obliges developers to use hard links between Chained operations in io_uring
io_uring
operations.
IORING_OP_CLONE
your process is in “undead” state. It's neither alive nor entirely dead, but the important part: it doesn't have any userspace that may act and do some decisions.io_uring
operations in the parent process. Or introduce some high-level “cleanup” operations. Or do many other extra extensions… but before we would do all that the main question needs to be answered: what we are actually trying to do with that mechanism?fork
/exec
is minuscule compared to the time needed to run dynamic loader, verify signatures and so on), but that sounded somewhat sensible.fork
/exec
time, why the time needed to actually start process is not impeding their work and so on.Chained operations in io_uring
> You can see that in this example:
Chained operations in io_uring
IORING_OP_CLONE
and then expects this:
if (cqe->res != -EINVAL)
… Unlinked clone should have failed …
IORING_OP_CLONE
is executed the whole io_uring
machinery (I suspect 99% of Linux functionality) becomes, temporarily, “untouchable”… with some operations still permitted – only in linked form. And then it either succeeds (while silently ignoring errors leading to unknown state of the executed process) or fails – as whole.io_uring
commands one would need to design completely new machinery which can support inter-process io_uring
support! With execution happening in the context of one process and communication channel opened to another process.
You are really determined to sink this patch series, I'm not sure why.
Chained operations in io_uring
> You are really determined to sink this patch series, I'm not sure why.
Chained operations in io_uring
io_uring
, we already have all the pieces), if it's 30% speedup for posix_spawn
, then it's another thing.posix_spawn
.io_uring
commands in this sequence of these instructions between clone
and execute
why are we so sure it would be enough.Chained operations in io_uring
Why not just have a one-step spawn?
Why not just have a one-step spawn?
Why not just have a one-step spawn?
> However, typically, you do want to inherit at least some state from the current process.
Why not just have a one-step spawn?
fd
number that contains image that should be loaded there then with simple, almost trivial in kernel change you would enable fully-userpace solutions.io_uring
, BPF
or maybe even webasm
! More complicated, more invasive, yet much more buzzword-compliant approach!Why not just have a one-step spawn?
Why not just have a one-step spawn?
Task is to create a child program. Child program will have three file descriptors, parent's three fds
mapped as child's stdin, stdout and stderr. Close all unrelated file descriptors.
Perhaps Valgrind's file descriptors with fds near upper bound, 1024, are also allowed to pass thru.
After that, then exec the final child with a clean state. If parent has large memory foot print, this is heavy.
ones into 0,1 and 2. After clean up, execute the final child process. Drawback is to have the middle process
to do the clean up, but if parent has large memory foot print, this is light, compared to fork.
without the separate middle process?
Why not just have a one-step spawn?
zygotes
More work to do for tracing execs
Missing a beat
Missing a beat
Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes
but the failure to establish clearly better alternatives as the computing environment changed. The multi-user machines of the day were continually swapping out whole processes to disk and reloading them, single processes DID not have vmtables. The fork(2) model allowed multi-tasking and the parent in the child process can set to NULL everything the child does not require with a small amount of memory copying and then overlay its own code via exec(2).
Surely that's what you want for cache effieciency, so parts with tightly coupled gang processing can share L3, while loosely coupled parts can be scheduled seperately.
Again people mentioned graphics libraries starting threads when linked, well that's what dynamic loading avoids.
Who created this huge memory management problem putting everything in one large pot?
Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes
Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes
A heavyweight or differently-privileged process can contact proxyexec on an AF_UNIX socket and then pass argv, env, and fds for stdin, stdout, stderr fds. proxyexec can even run under different credentials from the heavyweight process, and can even run in a different container connected by the AF_UNIX named socket. This is not theoretical: an earlier version of proxyexec was (and maybe still is) used by CloudLinux to remove setuid binaries from their containers, and to still provide privileged services -- running under different user accounts and outside the containers -- via proxyexec.