LWN: Comments on "Process creation in io_uring"

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

gstrauss — Tue, 25 Mar 2025 09:38:03 +0000

A bit late to the party, similar to @roblucid

If the problem is that fork()/execve() is expensive **for large processes** with many memory mappings, file descriptors, etc, then a solution is for those expensive monolithic processes to avoid fork().

One user-space solution: a large process could use AF_UNIX sockets to contact a lightweight process-creation daemon already pre-configured to be ready to do some minimal setup, vfork() or clone(), and then execve(). The lightweight process-creation daemon does not even have to be running all the time. It could conceivably be started on-demand via xinetd or systemd socket triggers.

I wrote one such user-space process-creation daemon over a decade ago, called `proxyexec` (https://github.com/gstrauss/bsock)
A heavyweight or differently-privileged process can contact proxyexec on an AF_UNIX socket and then pass argv, env, and fds for stdin, stdout, stderr fds. proxyexec can even run under different credentials from the heavyweight process, and can even run in a different container connected by the AF_UNIX named socket. This is not theoretical: an earlier version of proxyexec was (and maybe still is) used by CloudLinux to remove setuid binaries from their containers, and to still provide privileged services -- running under different user accounts and outside the containers -- via proxyexec.

tl;dr: Alternative user-space application designs, e.g. possibly using a service oriented architecture (directly or via a process-execution proxy) should be compared and contrasted with extending io_uring to provide IORING_OP_CLONE, IORING_OP_EXEC.

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

Cyberax — Mon, 03 Feb 2025 18:47:42 +0000

> A better question is why are huge monolithic programs with masses of VM mappings forking anyway? Why aren't small main programs, setting up co-operating processes which have memory isolation and then can fire up threads for their heavy weight processing?

Why would you reinvent threading and shared memory?

Fork/exec Did a Job Simply & Well - Most of the Problems Mentioned Have Untreated Causes

roblucid — Mon, 03 Feb 2025 12:17:28 +0000

Modern developers criticising an early 70's design, which gave clean & simple primitives solving problems shouldn't blame fork/exec
but the failure to establish clearly better alternatives as the computing environment changed. The multi-user machines of the day were continually swapping out whole processes to disk and reloading them, single processes DID not have vmtables. The fork(2) model allowed multi-tasking and the parent in the child process can set to NULL everything the child does not require with a small amount of memory copying and then overlay its own code via exec(2).

Environment solved the problem of child processes having to know about everything that could be set, so banishing that uncessarily, incurs a future maintenance problem. Just imagine if a DB style solution for user preferences hand been imposed with a single point of failure. Having used alternative OS solutions, vast amounts of variables had to be set up and constantly reset in practice, rather than just using a bit copy of an area in process memory.

A better question is why are huge monolithic programs with masses of VM mappings forking anyway? Why aren't small main programs, setting up co-operating processes which have memory isolation and then can fire up threads for their heavy weight processing?
Surely that's what you want for cache effieciency, so parts with tightly coupled gang processing can share L3, while loosely coupled parts can be scheduled seperately.

Seems to me the issue is the problem is the coordination between the program pieces, effectively an efficient message passing system so loosely coupled parts of a program can avoid sharing address spaces.

When I read about clone being inefficient because it "copies the whole VM which is mostly unused", I see poor application architecture.
Again people mentioned graphics libraries starting threads when linked, well that's what dynamic loading avoids.

When applications are statically linking huge amounts of common library code in multiple copies, that then may use threading which requires understanding of a memory model and safe operations, all in ONE big pot, it's a bit rich to moan about inefficiency caused by copying VM tables.
Who created this huge memory management problem putting everything in one large pot?

Browsers moved away from single process because of security requirements despite it causing duplication, they still manage to be snappy enough, even if it's NOT what you would do in a twitch shooter game.

Why not just have a one-step spawn?

mrugiero — Sun, 12 Jan 2025 18:17:53 +0000

I believe it would be easier (but would take more execs) to just use execline on your first exec to set up the environment correctly. Nice little scripting language which is already designed for that.

Re: empty shell

chexo4 — Sun, 12 Jan 2025 02:33:28 +0000

Do you think we'll get there at some point? What do you mean by aversion, exactly? This seems like a really useful mechanism to me and I don't see why it wouldn't be a great option to have for the many cases where you don't want/need to inherit the parent process' address space.

BPF!

surajm — Tue, 31 Dec 2024 15:19:46 +0000

That sounds like the fuchsia api. I'm sure it's popular on many microkernels as well.

BPF!

izbyshev — Tue, 31 Dec 2024 07:37:23 +0000

> Thus one of Go's performance secrets is to use posix_spawn() since 2017.

On Linux, Go uses raw syscalls for almost all standard functionality, and Go programs usually don't even link to a C library. So, Go uses a vfork() equivalent followed by execve(), not posix_spawn() library function [1].

> Linux Java from fork() into posix_spawn() near 2018

No, it migrated from vfork() [2] (which has been used by default since forever). So, overcommit issues weren't present in the first place.

The CPython issue [3] for migrating subprocess from fork() to vfork() contains a lot of useful links on the topic. In some parts, it's outdated [4].

[1] https://github.com/golang/go/blob/194de8fbfaf4c3ed54e1a3c...
[2] https://github.com/openjdk/jdk/commit/e21cb12d358c22350cb...
[3] https://github.com/python/cpython/issues/80004
[4] https://github.com/python/cpython/issues/113117

BPF!

bluca — Sun, 29 Dec 2024 19:17:39 +0000

> Thus one of Go's performance secrets is to use posix_spawn() since 2017.

I switched systemd to use pidfd_spawn (which is posix_spawn but with clone3(), which gets back a pidfd instread of a pid, and to clone into the target cgroup atomically) in v255 last year for similar reasons, as the copy-on-write trap overhead was hitting hard the azure fleet. I should probably do a write up about that at some point...

BPF!

ma4ris8 — Sun, 29 Dec 2024 18:56:10 +0000

Here is Rust Maelstrom analysis of the memory usage overhead in fork().
It shows, how the overhead rises, when caller process has bigger memory mapping.
https://maelstrom-software.com/blog/spawning-processes-on-linux/

BPF!

ma4ris8 — Sun, 29 Dec 2024 18:42:05 +0000

Gitaly's experience of fork overhead was an amazing read:
https://about.gitlab.com/blog/2018/01/23/how-a-fix-in-go-...

Thus one of Go's performance secrets is to use posix_spawn() since 2017.

Linux Java from fork() into posix_spawn() near 2018, using "jspawnhelper" as a clean up process:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-...

Oracle's Java for Solaris took the change earlier, in 2013:
https://bugs.java.com/bugdatabase/view_bug?bug_id=5049299

Rust language: glibc uses sometimes posix_spawn():
https://kobzol.github.io/rust/2024/01/28/process-spawning...

Fork/Exec major problem:
- Memory overcommit (Gitaly article): large programs clone resources during fork. After exec,
memory have to be cleaned up by the Kernel, and recycled for re-use.

Fork/Exec benefits:
- Threaded process: After fork, file descriptor set is vague. Forked process can investigate and clean up the state,
so that process doing exec() does not need to know about the caller process.
- This benefit is actually a work around: why to leak file descriptors, just to search for, and remove those before exec()?
- Could forking thread simply enlist the interesting set of fds, and then skip copying the uninteresting ones?

posix_spawn() design:
- New process uses parent's memory. No memory overcommit.
Caller thread sleeps, until new process does "exec()".
New process's thread must do as little work as possible, and, and exec into middle process.
Caller thread continues.
Middle helper process (jspawnhelper) closes leaked file descriptors, re-maps stdin,stdout,stderr,
and then exec's to the final process.
- This also copies all file descriptors, thus the fd clean up must be done with a helper program.
Speed increase comes from avoiding the memory "Copy on Write" work though for big memory programs.

Thus the optimal way (bpf, io_uring) solution would be to declare, what needs to be cloned, re-mapped,
and changed. Best is, if nothing unnecessary need to be created, and then destroyed (by middle process).

The world for big memory programs (Go, Java), has already moved from fork() into posix_spawn.

Thus there is big amount of Kernel work to be avoided, if the io_uring approach (and/or BPF enhancement)
can be used to avoid first cloning resources, just to tearing those down, and to call a child process
with given arguments, re-mapping stdin, stdout, stderr, passing some extra file descriptors (individual, tail range),
and setting child process working directory.

So instead of cloning everything, and tearing down, and making caller thread to sleep until child process is launched,
we could have something simple, which defines (declarative, programmatically, hybrid of those) the configuration for
a new process, and does a clean process launch with the single (0-1) io_uring queue submit,
without enforcing caller thread to sleep, until child process is launched.

Generalizing system calls to operate on other processes

epa — Sat, 28 Dec 2024 20:58:05 +0000

I imagined it would first return from fork() in the parent process, returning zero so you think you are in the child process. But now, any call to open() is actually open_pidfd() applied to the child. And exec() is also redirected so that it calls exec_pidfd() for the child and then jumps back to the end of the fork() call, this time returning the child's pid, so the parent continues executing. That could maybe be done with setjmp/longjmp or with some even darker magic that the C standard library is able to perform, perhaps with the help of inline assembly.

I wouldn't be surprised if similar hacks have existed to help port Unix code to single-tasking operating systems like MS-DOS.

Generalizing system calls to operate on other processes

daroc — Sat, 28 Dec 2024 14:11:11 +0000

You can write your own magical twice-returning functions with setjmp() and longjmp(), although (as always in C) there are caveats around using those correctly.

Generalizing system calls to operate on other processes

mathstuf — Sat, 28 Dec 2024 14:01:42 +0000

> Indeed you could even have a fork emulation layer in the C library which, on returning from fork() or vfork(), acts as though you were in the child process, so that calling open() actually calls into open_pidfd(), and then the final exec() call runs exec_pidfd() and then continues with the parent process's code.

How would that work? The "magic" of `fork()` (and related functions) is that it returns twice: once with a `0` return value and once with a `pid`. How would a library do any kind of emulation to allow taking *both* sides of the `if` condition it (eventually) leads to without some kind of language magic?

Generalizing system calls to operate on other processes

epa — Fri, 27 Dec 2024 17:01:11 +0000

That's a great read. I particularly liked the idea of making a system call that would apply to another process. So all system calls get an additional process id argument (or a pidfd, I guess) and, where reasonably possible, you are allowed to call them to apply to another process, as long as it's one of your children and executing as the same user, or you are root.

That means instead of fork() and in the child process opening file handles, the parent process could take care of all this. Create the child, which is initially not schedulable, make any system calls you want to set up the child's execution environment, then finally an exec_pidfd() to apply to the child process and mark it schedulable. That's a great way to apply "the Unix philosophy", composing the existing simple system calls rather than creating a kitchen sink like posix_spawn(), while avoiding the known problems of forking.

Existing code should be translatable to the new scheme without too much trouble. (Indeed you could even have a fork emulation layer in the C library which, on returning from fork() or vfork(), acts as though you were in the child process, so that calling open() actually calls into open_pidfd(), and then the final exec() call runs exec_pidfd() and then continues with the parent process's code. That's a bit kooky but might be a quick way to migrate older code which just wants to spawn a subprocess.)

BPF!

alkbyby — Fri, 27 Dec 2024 01:52:29 +0000

I predict this will be an interesting discussion.

Can you please elaborate more specifically on thread-unsafety of posix_spawn implementations? POSIX might be not explicitly saying that posix_spawn is safe to use in MT programs, but it's main purpose is clearly to fix fork's problems with threads. So it has to be MT-safe.

Fork+exec and threads are too unsafe in practice. Even our esteemed editor made an error above. Here: "details of what happens between those two calls can vary quite a bit. <skiped>environment adjusted, and so on".

Thing is, updating process environment (e.g. via setenv) typically invokes malloc. And calling malloc in-between fork and exec is unsafe.

As per posix (quoting from man 3posix fork): "If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called."

In practice malloc implementations go to some lengths to make malloc() possible after fork by carefully setting up pthread_atfork or alternatives. But this is big enough can of worms. And for example "abseil" tcmalloc explicitly doesn't (https://github.com/search?q=repo%3Agoogle%2Ftcmalloc%20at...). As per Google's internal policy pthread_atfork is forbidden (which is another, but somewhat related topic).

So posix_spawn is the right thing IMO. And any exotic process setup things that might be missing in your favorite libc (e.g. stuff like unshare/prctl) you can always do in a small helper program. You exec into it. It gets clean slate, can do whatever syscalls and mallocs and what not. Single-threadly. And then exec into real thing.

As for original discussion, I am really hoping io_uring is kept only for perf-critical stuff. Spawning childs isn't.

Chained operations in io_uring

Cyberax — Tue, 24 Dec 2024 23:03:06 +0000

> Normally, you would not use hard-linked operations in the newly cloned child context. If one of the setup operations fails, the entire chain fails

How exactly is this going to be achieved for processes? As I understand, there's going to be a new visible intermediate state for the process, as the operations are being executed, unless the io_uring sequence locks the entire kernel.

This also can cause a problem for userspace process migration. How do you interrupt the io_uring sequence to suspend it? After reading the patch series, I don't see how it would prevent long-running operations like read() from being introduced into the middle of the sequence.

It really is a poorly-designed API. It is very much in line with the good old UNIX tradition of screwing up process management APIs.

Chained operations in io_uring

khim — Tue, 24 Dec 2024 20:36:37 +0000

> You are really determined to sink this patch series, I'm not sure why.

I want to understand what that patch series hopes to achieve, mainly. This part is not reassuring: Krisman hopes to be able to at least partially lift that constraint in the future. And this is even more worrying: The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task.

In essence we are supposed to accept some piece of the whole solution without us knowing where the whole thing even leads.

And, worse yet, it's not clear what problem this whole thing is even supposed to solve!

If it's safety of creation of a new process then it's one thing (there are no need for io_uring, we already have all the pieces), if it's 30% speedup for posix_spawn, then it's another thing.

> But I see no reason to suspect some sort of evil plot here.

Evil plot is unlikely. But it really looks like a solution in a search of a problem… and I want to see the problem and, more importantly, explanation why that's the best solution for it.

As was noted in article one alternative solution would be to just create a dedicated system call that would include all the required operations. Or “double exec” if we just want to safely implement posix_spawn.

And if it's “an attempt to see where it goes” then I don't really want to sink but more to “flesh it out”, understand how the full, final, solution would look like and, again, who, why and how would use it.

Because as it stands currently, it's not clear to me what's the goal of all that activity – and that matters much more then minor details of the implementation in the current form.

Even if we would achieve the final goal of being able to execute all io_uring commands in this sequence of these instructions between clone and execute why are we so sure it would be enough.

Where do we plan to arrive with that change and what do we plan to achieve?

> Hard links can be used in the execveat() sequence to implement a search path. In that case, continuing after failure is the desired outcome; you want to go until you find something you can actually execute.

Ah. I see. While this, again, looks like a solution in a search of a problem (why to look up for the executable before executing it? what's the point of moving this pretty much optional functionality into the kernel? do we really want to try to continue after finding “kinda-sorta-suitable” binary that would end up being broken, for some reason?) at least now I understand what I didn't understood about that patch set.

Thanks for explaining it: while I still am not sure how useful would it be to implement what it tries to implement (because, again, I couldn't see the end goal), at least some operations can be implemented in safe manner. That's better than how I understood it working. Thanks for explanation.

Chained operations in io_uring

corbet — Tue, 24 Dec 2024 19:49:22 +0000

You are really determined to sink this patch series, I'm not sure why.

Normally, you would not use hard-linked operations in the newly cloned child context. If one of the setup operations fails, the entire chain fails, with the status returned to the parent. No silently ignored failures. No unknown state.

Hard links can be used in the execveat() sequence to implement a search path. In that case, continuing after failure is the desired outcome; you want to go until you find something you can actually execute.

I am sorry if the article did not adequately convey that.

Doubtless there is interesting work to do to expand the range of actions available in the just-cloned child context; we will have to see what shape that takes. But I see no reason to suspect some sort of evil plot here.

Chained operations in io_uring

khim — Tue, 24 Dec 2024 19:37:53 +0000

> You can see that in this example:

Which test is that? AFAICS the only test function that doesn't use linking, test_unlinked_clone_sequence just issues unlinked IORING_OP_CLONE and then expects this:

if (cqe->res != -EINVAL)
	… Unlinked clone should have failed …

That's it. All other examples use linked operations, as they should.

It's possible that I have misunderstood something, but at least at the first glance it's obvious why it have to be done that: after IORING_OP_CLONE is executed the whole io_uring machinery (I suspect 99% of Linux functionality) becomes, temporarily, “untouchable”… with some operations still permitted – only in linked form. And then it either succeeds (while silently ignoring errors leading to unknown state of the executed process) or fails – as whole.

> So a program submits the chain of io_uring operations, and then it either succeeds (and a new process is created) or it fails, and the program that submitted it can choose how and whether to retry.

Couldn't see that. At least in that patch series and set of examples.

And it's obvious why: if you add that machinery then, suddenly, instead of simple and non-invasive patch that just adds couple of io_uring commands one would need to design completely new machinery which can support inter-process io_uring support! With execution happening in the context of one process and communication channel opened to another process.

Sure, that's not impossible to create, but… do we really want to add so many new subtle features for 6% speedup?

> So hard links aren't needed, and it's perfectly possible to write a correct program that closes important files with the current patch series.

Show me the code. I couldn't find it. And I suspect that it's precisely as I have said: we only see 10% of the iceberg here, the majority of changes, 90% of iceberg is either doesn't exist or is not submitted for review.

Chained operations in io_uring

daroc — Tue, 24 Dec 2024 19:12:47 +0000

> How would anything work without hard links? After IORING_OP_CLONE your process is in “undead” state. It's neither alive nor entirely dead, but the important part: it doesn't have any userspace that may act and do some decisions.

It's possible that I've misunderstood how the patch series works, but I thought that if the whole series of operations fails, the program that originally started the operation is notified in the normal way (via an item in the io_uring completion queue). You can see that in this example: https://lwn.net/ml/all/20241209234421.4133054-3-krisman@s...

So a program submits the chain of io_uring operations, and then it either succeeds (and a new process is created) or it fails, and the program that submitted it can choose how and whether to retry. So hard links aren't needed, and it's perfectly possible to write a correct program that closes important files with the current patch series.

Chained operations in io_uring

khim — Tue, 24 Dec 2024 16:04:29 +0000

> In any case — nothing obliges developers to use hard links between io_uring operations.

How would anything work without hard links? After IORING_OP_CLONE your process is in “undead” state. It's neither alive nor entirely dead, but the important part: it doesn't have any userspace that may act and do some decisions.

That's the whole point of that patch series: to introduce a way to “clean up” that “undead” process by doing some operations when userspace is entirely gone.

If you wouldn't use hard links… what is supposed to happen? How would non-hardlinked operations work without userspace? What is supposed to happen if some operation would would fail? We don't have any agent that may receive information about failure!

Sure, we would know that combined operation would fail without running the program, but we would just make already stupid situation (where people try to execute programs with non-standard loader, get the message “file not found” and then spend hours trying to understand why program that's clearly there will all the proper permissions couldn't be execute) even worse.

> While it is arguably suboptimal to introduce yet another API that must be used correctly or risk security problems, it is hardly the first such API in the kernel.

It's worse than that: currently it's not “must be used correctly or risk security problems” but “it must be extended more before it would go beyond “proof of concept” phase, because in it's current form it's impossible to use it safely”.

And we have no idea how much more should it be extended to become actually usable. That's precisely why I have said that we need to know who, why and how plans to use that mechanism – because without such information we have no idea what needs to be added to it to make it actually usable.

> Nothing prevents a poorly-written program from leaking various kinds of state to another program with the current fork()/exec() workflow.

Sure, but correctly-written program can do everything correctly. And handle failures safely. Even if glibc fails to do that it's possible, at least in theory. That's currently impossible to do with the new mechanism.

Can it be extended to handle these things? Sure: you can make it possible to receive information about io_uring operations in the parent process. Or introduce some high-level “cleanup” operations. Or do many other extra extensions… but before we would do all that the main question needs to be answered: what we are actually trying to do with that mechanism?

> the best way to resolve that is usually with examples and more explanations.

Sure, but why are you directing that request to me? The main advantage that was touted in the original work way speed that's 6-10% faster than vfork() and 30+% faster than posix_spawn().

I have no idea who would really need that speedup (most of the time time spent in fork/exec is minuscule compared to the time needed to run dynamic loader, verify signatures and so on), but that sounded somewhat sensible.

But if all that complexity (including fight with a kernel corruption after a few spawns) and less reliability than what existing mechanisms provide is justified then it would be really nice to know who executes so many processes that they do care about fork/exec time, why the time needed to actually start process is not impeding their work and so on.

Because if it's some silly specialized unikernel or some kind of cluster management software – then it may very well be handled better with a more focused, more specialized API instead of jenga tower that this patch series starts to build.

Chained operations in io_uring

daroc — Tue, 24 Dec 2024 14:41:15 +0000

Let's please not get too heated; even though we try to make articles as clear as possible, it's easy to have slightly different understandings of complex technical topics, and the best way to resolve that is usually with examples and more explanations.

In any case — nothing obliges developers to use hard links between io_uring operations. If an important cleanup operation is necessary, and it is not safe to execute the new program if it fails, don't use a hard link. While it is arguably suboptimal to introduce yet another API that must be used correctly or risk security problems, it is hardly the first such API in the kernel. Nothing prevents a poorly-written program from leaking various kinds of state to another program with the current fork()/exec() workflow.

Chained operations in io_uring

khim — Tue, 24 Dec 2024 14:10:07 +0000

> The neat thing about the proposed io_uring solution is that the special properties of these chained operations already exist for other reasons

Have you actually read the article? That one, specifically: Krisman hopes to be able to at least partially lift that constraint in the future.

It's extremely clear to me that interface, as presented, it's not finished and not tested. Or, even worse, tested and is just feed to kernel developers in an insidious way to convince them to adopt huge hairball of API that would be immediately rejected if presented in it's full capacity… that's even worse then “unfinished and untested” API in my book (and I sincerely hope it's not that: Hazan's razor and all that).

> The only new things here are IORING_OP_CLONE that creates a new process (not able to run)

Which is something that Linux haven't supported till today. Currently “load new executable in a process” is one atomic operation that starts from the state where kernel have something mapped and executable in it's address space and ends in the state where kernel have something mapped and executable in it's address space.

An attempt to split that process in two looks innocuous enough, but it's entirely not clear what strange pitfalls it may hit.

> The intent is that you can do something like write to a WAL, fsync the WAL if the WAL write succeeds, write to the final location if the WAL write and fsync succeed, and then regardless of success of the WAL and final location writes trigger a futex wake, all in a single submission to the kernel.

Yes. And that's fine because code before and after comes from the exact same codebase. If some steps are omitted and/or failed then code that started the whole mess could, presumably, handle these failures gracefully.

Compare with io_uring attempt to do clone/exec attempt: you are doing some important cleanup work after clone which is, well, important (or we wouldn't worry so much about doing it in the first place) – and if it fails we execute foreign code, anyway.

This sounds, to me, like “hey, we have added nice security vulnerability to the kernel API, we just have no idea how to exploit it in the wild… contest is starting”!

The most likely consequence would be pile of special cases forbid some “likely exploitable” instructions in the sequence between IORING_OP_CLONE and IORING_OP_EXEC.

With ongoing maintenance when they would be discovered and open questions about what to do about apps that rely on these operations.

Of course safeexec also includes all the same issues (and probably some more), but there's a big difference: because it's not a kernel API and it can be easily embedded into your application by linking it statically there are no need to support all the warts of the first version indefinitely. In can be tuned and fixed [relatively] freely without commitment to support it forever (because each released version is self-contained and would work like it did on the day one).

Chained operations in io_uring

farnz — Tue, 24 Dec 2024 11:57:06 +0000

special properties of these chained operations would have to stay in kernel forever.

The neat thing about the proposed io_uring solution is that the special properties of these chained operations already exist for other reasons - in order to allow you to queue up an I/O operation with an appropriate response on error, chains and hard links already exist^[1], and to allow io_uring to operate asynchronously to process context, it already knows how to handle trying to return to a userspace that isn't running.

The only new things here are IORING_OP_CLONE that creates a new process (not able to run) and IORING_OP_EXEC that replaces the program text and turns it into a ready-to-run process. Everything else already exists in io_uring for I/O purposes.

^[1] The intent is that you can do something like write to a WAL, fsync the WAL if the WAL write succeeds, write to the final location if the WAL write and fsync succeed, and then regardless of success of the WAL and final location writes trigger a futex wake, all in a single submission to the kernel.

Why not just have a one-step spawn?

gutschke — Mon, 23 Dec 2024 17:00:31 +0000

No writable/executable mapping is used in my proof of concept. Once the ephemeral ELF image has been exec()'d, there is only a single readable/executable mapping.

I use a single mapping for both code and read-only data. That approach slightly simplified the already painfully complicated open-coded serialization of the various data structures that need to be passed into the child. But that could be split into two separate mappings for a production release.

Or instead of passing data as part of the ELF image, all data could be passed into the ephemeral child over a pipe(). Those design details are certainly up for review.

Why not just have a one-step spawn?

khim — Mon, 23 Dec 2024 16:36:22 +0000

That can be solved if you would crate two mappings: read/write one and read/execute one. Or even just create read/write mapping, then fill it and then change to read/execute before vfork/spawn.

These tricks are already used by JITs and most distributions, even very “enterprise” ones, have knobs to allow JITs, only iOS disabled that completely (and I don't think iOS is in scope for that project).

Changing SELinux settings is needed in any solution, even if you introduce new syscall it's highly unlikely that SELinux wouldn't stop that till you retune it.

Why not just have a one-step spawn?

bluca — Mon, 23 Dec 2024 16:29:48 +0000

> It's “easier” in a sense that you can use it in applications for RHEL8+ and Android8+

Actually I don't think you could use it in either, given it requires writable + executable memory, which is blocked by SELinux by default. Most sandboxing systems restrict that as well, as it's a very commonly used attack vector.

Why not just have a one-step spawn?

khim — Mon, 23 Dec 2024 14:39:37 +0000

TL;DR version: this approach is better because in case of adoption failure (which is quite likely) one may just throw it away and forget about it, whileas if similar failure would happen with io_uring solution the code and special properties of these chained operations would have to stay in kernel forever.

> I am still not 100% convinced that khim's solution is necessarily easier nor more robust.

It's “easier” in a sense that you can use it in applications for RHEL8+ and Android8+ (and most other distributions also have kernels with memfd_create, too).

That means that you could model your io_uring solution and test it on wide set of real-world tasks (since it can be used in production).

Even if final solution would be to add either a dedicated syscall or set of io_uring operations (plus set of special “chaining” rules needed to make them usable) you would collect lots of data which would tell you what works and what doesn't work.

If you start with addition to the kernel, on the other hand, then all these “large parent processes” deployed in various places wouldn't be able to use it for many years – and by the time when you would have real data collected from real apps… kernel API would be long-established and, most likely, not used (just like posix_spawn is barely used today).

P.S. Of course if you plan to eventually go with io_uring, anyway, then it would be good idea to have API of your safeexec designed in way that would make it easy to switch to io_uring, at some point. Apps wouldn't even need to know that they stopped using “double exec” trick and switched to io_uring on kernels that have io_uring support, it would all be transparent for them.

Why not just have a one-step spawn?

fweimer — Mon, 23 Dec 2024 13:44:34 +0000

You can use Solaris, which offers posix_spawn_file_actions_addclosefrom_np. It's always in glibc 2.34 or later, too. The other historically missing bits are posix_spawn_file_actions_addchdir_np and posix_spawn_file_actions_addfchdir_np (glibc 2.29 and later).

In general, this is an anti-pattern, though, because we have to keep adding stuff that's easily expressed elsewhere in code. One issue is that one gets just one error code for an entirely list of actions, and that's bad from a debugging point of view. And one day, we'll need to wrap something where a first action produces a value needed by a second action, and we cannot easily force the value to a caller-supplied choice (like we do for file descriptors today).

One silver lining is that vfork may not be as bad as we thought it was for a while. (The TCB sharing is empirically quite harmless for a wide variety of programs). Running compiled C code instead of walking an action list may be the better approach in the end.

Why not just have a one-step spawn?

gutschke — Mon, 23 Dec 2024 09:26:25 +0000

I am still not 100% convinced that khim's solution is necessarily easier nor more robust. But since I was curious whether the proposal to use existing kernel API's in somewhat unconventional ways would be viable at all, I wrote proof-of-concept code and uploaded it to: https://github.com/gutschke/safeexec/blob/main/safeexec.c

Not surprisingly, since we are doing things that weren't quite intended to be done this way, there are warts and pit-falls. If my code was to be turned into a production-quality library, a good amount of additional polishing is necessary. But as is, this is evidence that khim's suggestion can address several of the concerns raised in these comments.

(The best way to play with the code is to run it under the control of "strace". All it does is call "/bin/true" in a very round-about fashion.)

Why not just have a one-step spawn?

ballombe — Sun, 22 Dec 2024 11:29:47 +0000

> I have no idea and as long as I have no idea I couldn't even say if that's a good idea or not!

Well, I am glad to see that this does not impair your ability to write essay-sized posts about it.

BPF!

joib — Sun, 22 Dec 2024 07:30:16 +0000

There was a nice article a few years ago that describes the problems with fork+exec, and indeed ends up recommending something like what you describe as a potential solution. https://www.microsoft.com/en-us/research/uploads/prod/201...

Why not just have a one-step spawn?

khim — Sat, 21 Dec 2024 20:09:41 +0000

> But then why not just inline the stub into the "carefully written threadsafe library code" to avoid the double exec?

Precisely because then your sign-verifying machinery couldn't verify your code. You are executing things in the context that's polluted by gigabytes of long-living code that may affect your carefully prepared binary.

Even if it was sign-verified and correct when process was started it's not guaranteed to stay sign-verified and correct by the time you [try to] execute it.

> Instead you would need to have a prebuilt helper binary (signed if necessary) that does the work based on some parameters.

You could do what Turbo Pascal did decades ago: concatenate binary and parameters for said binary into one executable.

So there would be signed part and unsigned parameters, signature can be easily checked when binary is loaded, even if it's loaded from memfd.

> and then again, you can implement that approach without doing the double exec.

It's possible in theory but it's not done today. And it doesn't eliminate issues of that code being corrupted and abused before new binary is spawn.

And if we are not fighting that with io_uring proposal then I don't even have an idea what we are fighting for and against.

The big problem of article that we are discussing here is that it carefully describes the answer to some issues, but it entirely neglects to list the issues that we are trying to fix!

Not as bad as infamous “42” as the answer to “the ultimate question of life, the universe, and everything”, but very close to it: sure, that's a mechanism with a certain properties… but what it tries to do? What's the problem that couldn't be easily solved with it but is impossible to solve without it?

I have no idea and as long as I have no idea I couldn't even say if that's a good idea or not!

That's why I'm talking about “buzzword compliance”: simply because if “hey, it's io_uring, new and shiny” is not the goal then what is the goal? Where does that solution is supposed to send us? And why couldn't we arrive there via simpler means?

Why not just have a one-step spawn?

roc — Sat, 21 Dec 2024 19:31:12 +0000

Writing arbitrary code into a memfd and then exec'ing it would get around security subsystems that try to prevent running unsigned/unvalidated binaries. So that's not a general-enough solution. Instead you would need to have a prebuilt helper binary (signed if necessary) that does the work based on some parameters. But then why not just inline the stub into the "carefully written threadsafe library code" to avoid the double exec? And that's basically posix_spawn() today.

You might say that the "users write arbitrary code into a memfd" part is essential for flexibility. Even if we ignore the security issues, it would be nasty to program directly. People would inevitably wrap it in some kind of tiny, portable virtual machine for users to express their setup code ... and then again, you can implement that approach without doing the double exec.

Why not just have a one-step spawn?

quotemstr — Sat, 21 Dec 2024 19:30:16 +0000

> We don't have a full set of system calls for remotely doing everything that a process can do by itself.

In a world with a more regular system interface, *every* system call would require callers specify all object on which to operate, explicitly. We wouldn't have operations that work on "the current process" or "the current thread". In this world, the process bootstrapping the GP proposes would fall naturally out of the general shape of the API surface.

Why not just have a one-step spawn?

ma4ris8 — Sat, 21 Dec 2024 17:26:49 +0000

Let's have a threaded program. It opens and closes file descriptors. Some of those have FD_CLOSE.
Task is to create a child program. Child program will have three file descriptors, parent's three fds
mapped as child's stdin, stdout and stderr. Close all unrelated file descriptors.
Perhaps Valgrind's file descriptors with fds near upper bound, 1024, are also allowed to pass thru.

One way is to fork, then open /proc/self/fd, close unrelated fds, remap related ones into 0,1 and 2.
After that, then exec the final child with a clean state. If parent has large memory foot print, this is heavy.

The other way is to do posix_spawn(). Spawn intermediate process, which closes unrelated fds, remaps related
ones into 0,1 and 2. After clean up, execute the final child process. Drawback is to have the middle process
to do the clean up, but if parent has large memory foot print, this is light, compared to fork.

Third way: how to do it so, that the cleanups could be done in an elegant and memory safe way,
without the separate middle process?

Why not just have a one-step spawn?

corbet — Sat, 21 Dec 2024 16:48:03 +0000

"Buzzword compliance" takes the work of people who are trying to improve the system and casts it as something useless. If it were my work, I would find that insulting. I do not believe that the people working on this are concerned about buzzwords, they are trying to solve real problems. Please try being a bit more respectful toward them.

Why not just have a one-step spawn?

khim — Sat, 21 Dec 2024 16:44:02 +0000

> But please stop insulting the work of others, that does not help anybody.

Where do you see insults? I've faced the need to mangle simple and easy to understand and implement ideas into pretzels to include all the right buzzwords at my $DAYJOBs often enough that I can easily see buzzword compliance as explicit, or more likely, implicit part of the requirements.

And very often it's even the most important one: if you couldn't cause enough buzz around your idea then it would die (except if there are some concrete tasks for concrete customers that may need it) even if it's pretty good, but with enough buzz around your idea you may push it even if it's totally stupid and would hurt everyone in the long run.

> Khim, if you have a better idea, please submit a patch showing it.

There are no patch because in-kernel parts are already done… years ago, in fact.

And to discuss userspace part we need some idea about who, why and how plans to use that mechanism.

The list of interested parties is not in the article thus it's hard for me to offer anything concrete because it's not clear to me how much flexibility is needed or wanted.

Implementation of posix_spawn is doable but would be significant amount of work without any clear benefits: do we have lots of users of that syscall? If yes, then where are they, if not then why are they so rare?

IOW: I don't see enough of a picture related to that work to judge it fairly and if “buzzword-compliance” is part of reasoning (even if an implicit one) then it could be that io_uring-based solution is the best way forward. Especially if it's a solution-in-a-search-of-a-problem: it's much easier to make someone excited about io_uring solution than about solution that just combines well-known syscalls in a way that makes posix_spawn safer.

Why not just have a one-step spawn?

corbet — Sat, 21 Dec 2024 16:10:33 +0000

Khim, if you have a better idea, please submit a patch showing it. But please stop insulting the work of others, that does not help anybody.

Why not just have a one-step spawn?

khim — Sat, 21 Dec 2024 15:39:47 +0000

A much simpler approach would be to just add some code that would do that setup in the empty process. And we already have memfd_create/execveat combo that can do that.

If you want – add flag to the clone that would call execveat. And then new code in an entirely empty image can do whatever it needs to prepare for the execution of the real binary that you want to execution.

Why shove io_uring into something that already can be done entirely from userspace? Buzzword compliance?