Introducing io_uring_spawn
The traditional mechanism for launching a program in a new process on Unix systems—forking and execing—has been with us for decades, but it is not really the most efficient of operations. Various alternatives have been tried along the way but have not supplanted the traditional approach. A new mechanism created by Josh Triplett adds process creation to the io_uring asynchronous I/O API and shows great promise; he came to the 2022 Linux Plumbers Conference (LPC) to introduce io_uring_spawn.
Triplett works in a variety of areas these days, much of it using the Rust language, though he has also been working on the kernel some of late. He is currently working on build systems as well. Build systems are notorious for spawning lots of processes as part of their job, "so I care about launching processes quickly". As with others at this year's LPC, Triplett said that he was happy to see a return to in-person conferences.
Spawning a process
He began with a description of how a Unix process gets started. There are a number of setup tasks that need to be handled before a new process gets executed; these are things like setting up file descriptors and redirection, setting process priority and CPU affinities, dealing with signals and masks, setting user and group IDs, handling namespaces, and so on. There needs to be code to do that setup, but where does it come from?
![Josh Triplett [Josh Triplett]](https://static.lwn.net/images/2022/lpc-triplett-sm.png)
The setup code for the traditional fork() and exec (e.g. execve()) approach must be placed in the existing process. fork() duplicates the current process into a second process that is a copy-on-write version of the original; it does not copy the memory of the process, just the page metadata. Then exec "will promptly throw away that copy and replace it with a new program; if that sounds slightly wasteful, that's because it's slightly wasteful".
He wanted to measure how expensive the fork-and-exec mechanism actually is; he described the benchmarking tool that he would be using for the tests in the talk. The test creates a pipe, reads the start time from the system, then spawns a child process using whichever mechanism is under test; the exec of the child uses a path search to find the binary in order to make it more realistic. The child simply writes the end time to the pipe and exits, using a small bit of optimized assembly code.
The parent blocks waiting to read the end time from the pipe, then calculates the time spent. It does that 2000 times and reports the lowest value; the minimum is used because anything higher than that is in some fashion overhead that he wants to eliminate from his comparison. The intent is to capture the amount time between the start of the spawn operation and the first instruction in the new process. Using that, he found that fork() and exec used 52µs on his laptop.
But that is just a baseline for a process without much memory. If the parent allocates 1GB, the cost goes up a little bit to 56.4µs. But it turns out that Linux has some "clever optimizations" to handle the common case where processes allocate a lot more memory than they actually use. If the parent process touches all of the 1GB that it allocated, things get much worse, though: over 7500µs (or 7.5ms)
There are more problems with fork() beyond just performance, however. For example, "fork() interacts really badly with threads"; any locks held by other threads will remain held in the child forever. The fork() only copies the current thread, but copies all of the memory, which could contain locked locks; calling almost any C library function could then simply deadlock, he said.
There is a list of safe C library functions in the signal-safety man page, but it lacks some expected functions such as chroot() and setpriority(). So if you fork a multi-threaded process, you cannot safely change its root directory or set its priority; "let alone things like setting up namespaces", he said. Using fork() is just not a good option for multi-threaded code.
"As long as we are talking about things that are terribly broken, let's talk about vfork()". Unlike fork(), vfork() does not copy the current process to the child process, instead it "borrows" the current process. It is, effectively creating an unsynchronized thread as the child, which runs using the same stack as the parent.
After the vfork() call, the child can do almost nothing: it can exec or exit—"that's the entire list". It cannot write to any memory, including the local stack (except for single process ID value), and cannot return or call anything. He rhetorically wondered what happens if the child happens to receive a signal; that is "among the many things that can go horribly wrong". Meanwhile, it does not provide any means for doing the kind of setup that might be needed for a new process.
So given that vfork() is broken, he said, "let's at least hope it's broken and fast". His benchmark shows that it is, in fact, fast, coming in at 31.5µs for the base test and there is only a tiny increase, to 31.9µs, for allocating and accessing 1GB. That makes sense because vfork() is not copying any of the process memory or metadata.
Another option is posix_spawn(), which is kind of like a safer vfork() that combines process creation and exec all in one call. It does provide a set of parameters to create a new process with certain attributes, but programmers are limited to that set; if there are other setup options needed, posix_spawn() is not the right choice. It has performance in between vfork() and fork() (44.5µs base); as with vfork(), there is almost no penalty for allocating and accessing 1GB (44.9µs).
The main need for a copy of the original process is to have a place where the configuration code for the new process can live. fork(), vfork(), and posix_spawn() allow varying amounts of configuration for the new process. But a more recent kernel development provides even more flexibility—and vastly better performance—than any of the other options.
Enter io_uring
The io_uring facility provides a mechanism for user space to communicate with the kernel through two shared-memory ring buffers, one for submission and another for completion. It is similar to the NVMe and Virtio protocols. Io_uring avoids the overhead of entering and exiting the kernel for every operation as a system-call-based approach would require; that's always been a benefit, but the additional cost imposed by speculative-execution mitigations makes avoiding system calls even more attractive.
In addition, io_uring supports linked operations, so that the outcome of one operation can affect further operations in the submission queue. For example, a read operation might depend on a successful file-open operation. These links form chains in the submission queue, which serialize the operations in each chain. There are two kinds of links that can be established, a regular link where the next operation (and any subsequent operations in the chain) will not be performed if the current operation fails, or a "hard" link that will continue performing operations in the chain, even when there are failures.
So, he asked, what if we used io_uring for process setup and launch? A ring of linked operations to all be performed by the kernel—in the kernel—could take care of the process configuration and then launch the new process. When the new process is ready, the kernel does not need to return to the user-space process that initiated the creation, thus it does not need to throw away a bunch of stuff that it had to copy as it would with fork().
To that end, he has added two new io_uring operations. IORING_OP_CLONE creates a new task, then runs a series of linked operations in that new task in order to configure it; IORING_OP_EXEC will exec a new program in the task and if that is successful it skips any subsequent operations in the ring. The two operations are independent, one can clone without doing an exec, or replace the current program by doing an exec without first performing a clone. But they are meant to be used together.
If the chain following an IORING_OP_CLONE runs out of ring operations to perform, the process is killed with SIGKILL since there is nothing for that process to do at that point. It is important to stop processing any further operations after a successful exec, Triplett said, or a trivial security hole can be created; if there are operations on the ring after an exec of a setuid-root program, for example, they would be performed with elevated privileges. If the exec operation fails, though, hard links will still be processed; doing a path search for an executable is likely to result in several failures of this sort, for example.
Beyond performance
There are advantages to this beyond just performance. Since there is no user space involved, the mechanism bypasses the C library wrappers and avoids the "user-space complexity, and it is considerable", especially for vfork(). Meanwhile, the problems with spawning from multi-threaded programs largely disappear. He showed a snippet of code that uses liburing to demonstrate that the combination "makes this remarkably simple to do". That example can be seen on slide 55 of his slides, or in the YouTube video of the talk.
Because he had been touting the non-performance benefits of using io_uring to spawn new programs, perhaps some in the audience might be thinking that the performance was not particularly good, he said. That is emphatically not the case; "actually it turns out that it is faster across the board". What he is calling "io_uring_spawn" took 29.5µs in the base case, 30.2µs with 1GB allocated, and 28.6µs with 1GB allocated and accessed.
That is 6-10% faster than vfork() and 30+% faster than posix_spawn(), while being much safer, easier to use, and allowing arbitrary configuration of the new process. "This is the fastest available way to launch a process now."
"Now" should perhaps be in quotes, at least the moment, as he is working with io_uring creator Jens Axboe to get the feature upstream. Triplett still needs to clean up the code some and they need to decide where the right stopping point, between making it faster and getting it upstream, lies. The development of the feature is just getting started at this point, he said; there are multiple opportunities for optimization that should provide even better performance down the road.
Next steps
He has some plans for further work, naturally, including implementing posix_spawn() using io_uring_spawn. That way, existing applications could get a 30% boost on their spawning speed for free. He and Axboe are working on a pre-spawned process-pool feature that would be useful for applications that will be spawning at least a few processes over their lifetime. The pool would contain "warmed up" processes that could be quickly used for an exec operation.
The clone operation could also be optimized further, Triplett thinks. Right now, it uses all of the same code that other kernel clone operations use, but a more-specialized version may be in order; it may make sense to reduce the amount of user-space context that is being created since it is about to be thrown away anyway. He would also like to experiment with creating a process from scratch, rather than copying various pieces from the existing process; the io_uring pre-registered file descriptors could be used to initialize the file table, for example.
Triplett closed his talk with a shout-out to Axboe "who has been incredibly enthusiastic about this". Axboe has been "chomping at the bit to poke at it and make it faster". At some point, Triplett had to push back so that he had time to write the talk; since that is now complete, he expects to get right back into improving io_uring_spawn. He is currently being sponsored on GitHub for io_uring_spawn, Rust, and build systems; he encouraged anyone interested in this work to get in touch.
After a loud round of applause, he took questions. Christian Brauner said that systemd is planning to use io_uring more and this would fit in well; he wondered if there was a plan to add support in io_uring for additional system calls that would be needed for configuring processes. Triplett said that he was in favor of adding any system call needed to io_uring, but he is not the one who makes that decision. "I will happily go on record as saying I would love to see a kernel that has exactly two system calls: io_uring_setup() and io_uring_submit()."
Kees Cook asked how Triplett envisioned io_uring_spawn interacting with Linux security modules (LSMs) and seccomp(). Triplett said that it would work as well as io_uring works with those security technologies today; if the hooks are there and do not slow down io_uring, he would expect them to keep working. Paul Moore noted some of the friction that occurred when the command-passthrough feature was added to io_uring; he asked that the LSM mailing list be copied on the patches.
A remote attendee asked about io_uring support for Checkpoint/Restore in Userspace (CRIU). Axboe said that there is currently no way for CRIU to gather up in-progress io_uring buffers so that they can be restored; that is not a problem specific to io_uring_spawn, though, Triplett said. Brauner noted that there is a Google Summer of Code project to add support for io_uring to CRIU if anyone is interested in working on it.
Brauner asked about whether the benchmarks included the time needed to set up the ring buffers; Triplett said that they did not, but it is something that is on his radar for the upstream submission. It is not likely to be a big win (or, maybe, a win at all) for a process that is just spawning one other process, but for programs like make, which spawn a huge number of other programs, the ring-buffer-creation overhead will fade into the noise.
[I would like to thank LWN subscribers for supporting my travel to Dublin for Linux Plumbers Conference.]
Index entries for this article | |
---|---|
Kernel | io_uring |
Conference | Linux Plumbers Conference/2022 |
Posted Sep 20, 2022 22:15 UTC (Tue)
by unixbhaskar (guest, #44758)
[Link]
Posted Sep 20, 2022 22:30 UTC (Tue)
by josh (subscriber, #17465)
[Link] (4 responses)
Following up on this more concretely, there were several more conversations at LPC about how seccomp and io_uring can interact. io_uring already has a mechanism for self-imposed restrictions on an io_uring; a first pass would be to externally impose those restrictions on a process. That would provide maximum performance for the common cases. Then, on top of that, we could use seccomp to further restrict specific operations, such as opening files or execing processes.
In that regard, I hope that io_uring can do *better* with seccomp than existing syscalls do. Traditionally in seccomp it has been difficult to enforce restrictions on userspace pointers, without providing userspace helpers. But io_uring normally copies arguments from userspace to the kernel when preparing to run an operation, *before* actually running the operation. That would make it easier for a seccomp filter to filter those arguments in a race-free manner.
I'm hopeful that we can build a robust and comprehensive security solution for io_uring without sacrificing performance to do so.
Posted Sep 21, 2022 2:05 UTC (Wed)
by gutschke (subscriber, #27910)
[Link] (2 responses)
Race conditions with regards to file handles are of course a well-understood problem. They can result in pipes accidentally being kept open when the application expects them to be closed; and that can break event loops in surprising ways. This can happen if two parts of the application try to spawn at the same time, and in principle that's possible even without having threads. All it takes is a signal handler. And you never know when those are present, as libraries can set their own signal handlers. CLOEXEC can partially mitigate this situation, but isn't a complete solution.
But fortunately, by creating pipes in the child and then passing only one end back to the parent, this issue can generally be avoided. Tricky. Subtle. But solvable without having to resort to black magic.
Maybe, forking from a signal context is crazy talk and shouldn't be supported, but that still leaves threads. And annoyingly, there was simply no way to avoid the presence of threads; and those potentially result in even more subtle issues. Even if you fork'd a helper process early on, a constructor in a randomly loaded dependency could very well have started threads already. This means, you don't get around dealing with the whole mess of orphaned thread locks in the child. Glibc tries to do the right thing, but I couldn't convince myself that it would always work 100%.
So, direct system calls it is. By avoiding all calls into Glibc, we can avoid accidentally blocking on a stale thread lock. Writing this type of code is tedious, but since all that nastiness is hidden in a library, hopefully it isn't too offensive. Only, closer inspection of Glibc shows that a bunch of system calls have wrappers, and some of these wrappers can cause issues too. That means, we are now committed to maintaining our own system call library, avoiding Glibc's implementation altogether.
And just when we think we are finally done, the dynamic linker rears its ugly head. A lot of symbols won't get resolved until needed, and of course the dynamic linker has its own set of locks and file descriptors that we don't control. So, now we need to worry about not invoking the linker from the fork'd process.
These are just the major pain points that I remember from working on this code some 10 years ago. I am sure there is more and I have suppressed the memory. Spawning a process with 99% reliability is easy and a lot of the complexity in my code was very rarely needed; but 100% reliable spawning is tough. Fork()/exec() was a great model in the early days of UNIX, but it all came crashing down when threads showed up. Posix_spawn() is a fine alternative, if you don't need particularly complex initialization of the child's environment. But unfortunately, that wasn't an option for us.
I so love the idea of side-stepping all the problems that are caused by undefined global state in the child; moving processing out of user space and into the kernel is brilliant. That's such an elegant solution that gets rid of the entire class of potential problems.
So, while the performance benefits are very welcome, the improvements in reliability and reduction in API complexity has me the most excited. Can't wait to follow progress on this API and see what other system calls will be made available from within io_uring.
Posted Sep 21, 2022 15:04 UTC (Wed)
by wilevers (subscriber, #110407)
[Link]
Posted Oct 6, 2022 13:25 UTC (Thu)
by Shabbyx (guest, #104730)
[Link]
Otherwise, there's no guarantee the new implementation isn't secretly suffering from some of the same problems.
Posted Sep 21, 2022 15:30 UTC (Wed)
by kees (subscriber, #27264)
[Link]
I'm hoping we can do some more perf examination to check "normal" workloads too.
Posted Sep 20, 2022 22:37 UTC (Tue)
by milesrout (subscriber, #126894)
[Link]
Posted Sep 21, 2022 2:22 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (5 responses)
That is an exaggeration. Counterexample: this works on fedora 5.17.12-100.fc34.x86_64, at least if stderr (fd 2) is a tty:
The child of vfork() can do anything as long as it recognizes that side effects (in both user and kernel space) affect both child and parent.
Posted Sep 21, 2022 22:45 UTC (Wed)
by geofft (subscriber, #59789)
[Link] (4 responses)
I believe it is accurate to say that the child of vfork can make any (raw) syscalls, modify registers, and read from memory, as long as it does not write to memory in a way incompatible with the parent process. But this is an impossible constraint to express in C.
(Relatedly, the list of async-signal-safe functions is a list of async-signal-safe C library functions, not syscalls. All raw syscalls are async-signal-safe, because they are atomic and reentrant from the point of view of userspace; you just might struggle to actually call them.)
Posted Sep 22, 2022 10:35 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (2 responses)
Why do you think that? The child is free to do whatever it likes below the initial stack pointer (just like a signal handler might). It can even modify any other memory providing that it it leaves it in a consistent state. So printf() should be fine.
This line in the article:
> It is, effectively creating an unsynchronized thread as the child
is incorrect. The child IS synchronised. To quote from the fine man page:
> vfork() differs from fork(2) in that the calling thread is suspended
"suspended" is a form of synchronisation and ensures, for example, and the caller doesn't do anything to any of the stack below the stack pointer.
Posted Sep 22, 2022 14:44 UTC (Thu)
by dezgeg (subscriber, #92243)
[Link] (1 responses)
Posted Sep 25, 2022 21:48 UTC (Sun)
by neilbrown (subscriber, #359)
[Link]
Posted Oct 9, 2022 7:55 UTC (Sun)
by izbyshev (guest, #107996)
[Link]
[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attrib...
Posted Sep 21, 2022 2:31 UTC (Wed)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Oct 9, 2022 12:47 UTC (Sun)
by izbyshev (guest, #107996)
[Link]
* PTRACE_EVENT_CLONE is reported for the parent thread after the child process has been waken up. Given that IORING_OP_CLONE is executed asynchronously, who reports it, if at all? The kernel worker thread instead of the real parent thread? If so, is there a simple way to get the real parent (without looking in /proc, if it has been filled at this point at all)?
* The child process starts with either SIGSTOP or PTRACE_EVENT_STOP, depending on the ptrace mode. Given that the process created by IORING_OP_CLONE never returns to userspace before IORING_OP_EXEC, what happens to this notification?
I couldn't find io_uring_spawn patches on the net to check any of this.
Posted Sep 21, 2022 8:09 UTC (Wed)
by liam (guest, #84133)
[Link] (4 responses)
Posted Sep 21, 2022 16:12 UTC (Wed)
by ma4ris5 (guest, #151140)
[Link] (2 responses)
epa suggests, that launching other process should provide mechanism to close unlisted file descriptors.
With threads, all kinds of unknown things happen, including race conditions with open() and "CLOEXEC", so posix_spawn() lacks flag to prevent those race conditions by closing the unrelated fds before exec.
io_uring implementation would need similar implementation to be robust,
Posted Oct 9, 2022 8:23 UTC (Sun)
by izbyshev (guest, #107996)
[Link]
Modern glibc has posix_spawn_file_actions_addclosefrom_np().
See also the thread at https://www.openwall.com/lists/libc-coord/2022/01/24/7
AFAIK POSIX resists standardization of such functionality because it could interfere with "private" descriptors used by the implementation.
Posted Oct 23, 2022 2:34 UTC (Sun)
by nybble41 (subscriber, #55106)
[Link]
I can see the utility of this if you're trying to set up a constrained environment like a sandbox or container, but IMHO closing file descriptors your process didn't open would be a mistake in the general case. You can't be certain the subprocess won't need them. Even if you control the subprocess and know exactly which paths it accesses (so you can rule out a user-specified reference to /dev/fd/N), consider the possibility of something like an LD_PRELOAD library making use of an inherited file descriptor identified through an environment variable you know nothing about.
Posted Sep 23, 2022 10:24 UTC (Fri)
by Lennie (subscriber, #49641)
[Link]
https://medium.com/wasmer/running-webassembly-on-the-kern...
Also this presentation might be relevant: https://www.destroyallsoftware.com/talks/the-birth-and-de...
Posted Sep 21, 2022 14:34 UTC (Wed)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
The tl;dr is that glibc switched from using vfork() internally for posix_spawn() to using clone(CLONE_VM | CLONE_VFORK) with a separate stack allocated for the child and masking signals.
Posted Oct 9, 2022 8:34 UTC (Sun)
by izbyshev (guest, #107996)
[Link]
The need for blocking signals applies to both clone(CLONE_VM | CLONE_VFORK) and vfork().
Posted Sep 22, 2022 10:57 UTC (Thu)
by jpfrancois (subscriber, #65948)
[Link]
Posted Sep 24, 2022 23:43 UTC (Sat)
by tullmann (subscriber, #20149)
[Link] (2 responses)
Posted Sep 25, 2022 10:41 UTC (Sun)
by malmedal (subscriber, #56172)
[Link] (1 responses)
If vfork is unsuitable for some reason it is still possible to avoid this issue with CLONE_VM, but it is somewhat annoying.
E.g:
void target_function() {
char *stack = malloc(65536)
You can free or reuse the stack after target_function has called exec, but the annoyance is knowing when exactly that has happened.
Probably the easiest is to set up a pipe and set the filedescriptor to close-on-exec in the child and check for this in the parent.
Posted Oct 9, 2022 9:02 UTC (Sun)
by izbyshev (guest, #107996)
[Link]
* In older glibc it would trash the parent pid/tid cache (while vfork() wouldn't)[1].
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=19957
Posted Jan 9, 2023 11:58 UTC (Mon)
by pabs (subscriber, #43278)
[Link]
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
https://lore.kernel.org/lkml/202209161637.9EDAF6B18@keesc...
Introducing io_uring_spawn
> After the vfork() call, the child can do almost nothing: it can exec or exit—"that's the entire list". It cannot write to any memory, including the local stack (except for single process ID value), and cannot return or call anything.
vfork is not so encumbered
#include <unistd.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
if (0==vfork()) {
char msg[]= "hello\n";
exit(write(2, msg, -1+ sizeof(msg)));
}
else {
char msg[] = "goodbye\n";
exit(write(2, msg, -1+ sizeof(msg)));
}
}
and strace -f proves:
vfork(strace: Process 119381 attached
<unfinished ...>
[pid 119381] write(2, "hello\n", 6hello
) = 6
[pid 119381] exit_group(6) = ?
[pid 119380] <... vfork resumed>) = 119381
[pid 119381] +++ exited with 6 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=119381, si_uid=1000, si_status=6, si_utime=0, si_stime=0} ---
write(2, "goodbye\n", 8goodbye
) = 8
exit_group(8) = ?
vfork is not so encumbered
vfork is not so encumbered
> until the child terminates (either normally, by calling _exit(2), or
> abnormally, after delivery of a fatal signal), or it makes a call to
> execve(2)
vfork is not so encumbered
vfork is not so encumbered
Maybe the best advice is to put such code into a separate .c file, and build that file with '-O0'.
vfork is not so encumbered
strace -f, which follows all descendant processes, must give meaningful output. Similarly, (gdb) set follow-fork-mode child must allow interception in the context of a descendant, but before the first instruction is executed.
remember debugging
remember debugging
Anyone recall this Microsoft Research: A fork() in the road?
Introducing io_uring_spawn
The conclusions aren't the same but neither are the motivations (IMHO), and it's still interesting to note the parallels.
Introducing io_uring_spawn
mm7323 mentions then, that opening /proc/self/fd (after vfork()) would enable efficient way of doing that. Also valgrind fds are mentioned.
so that intermediate process, which does the file descriptor clean up after exec(), wouldn't be needed.
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
glibc implementation of posix_spawn
https://sourceware.org/bugzilla/show_bug.cgi?id=14750
https://sourceware.org/git/?p=glibc.git;a=commit;h=9ff72d...
https://sourceware.org/git/?p=glibc.git;a=commit;h=ccfb29...
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/...
glibc implementation of posix_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
Introducing io_uring_spawn
exec(...)
}
int pid = clone(target_function, stack, CLONE_VM, 0);
Introducing io_uring_spawn
* You need to care about the direction of stack growth, and you code example is actually wrong because of that (the stack grows down on all currently alive arches).
* You need to care about whether the stack pointer is aligned according to the ABI[2].
* You might discover that clone() doesn't exists anymore, but you need to use clone3(), as almost happened for LoongArch[2].
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=27902
[3] https://www.openwall.com/lists/musl/2022/05/12/1
Introducing io_uring_spawn