|
|
Subscribe / Log in / New account

BPF!

BPF!

Posted Dec 21, 2024 1:08 UTC (Sat) by comex (subscriber, #71521)
In reply to: BPF! by Cyberax
Parent article: Process creation in io_uring

clone() is not POSIX. POSIX includes fork(), vfork(), and posix_spawn().


to post comments

BPF!

Posted Dec 21, 2024 1:47 UTC (Sat) by gutschke (subscriber, #27910) [Link] (2 responses)

posix_spawn() is well-intentioned, but it doesn't really address the main problem with all of these APIs. As far as I can tell, POSIX doesn't guarantee for posix_spawn() to be thread-safe. And when I looked at the source code (admitted, this was years ago), the implementation in glibc most definitely didn't make any effort to ensure thread-safety. Also, posix_spawn() is just too limited to be a general solution. It's a fine response to the problem of Windows not having a fork()/exec() API. But it isn't really a solution for safely starting processes from any context.

fork() is a decent general solution for single-threaded applications, and that's why we've been using it for so many decades. The kernel-level API is amenable to writing thread-safe code using fork()/exec(). But that requires that after fork() returns in the client, no further entries into any libraries are allowed. In fact, I am not even convinced that it is always safe to call the glibc version of fork() instead of making a direct system call.

Both the various wrappers that glibc puts around system calls, and the hidden invocations of the dynamic link loader are potential sources for dead locks or crashes. Depending on how your program has been linked, this can even mean that you can no longer access any global symbols. Everything has to be on the local stack.

The upshot of all of this is that you not only need to carefully screen the system calls that you want to make for potential process-wide side-effects, you also have to call them from inlined assembly instead of deferring to glibc. In addition, fork() only really works with memory over-committing enabled, and for large programs this system call can be expensive.

vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux. Some amount of porting to different OS's will be involved, if you need to spawn a new process from within a multi-threaded environment.

clone() is the pragmatic solution. Once you come to the realization that this code is impossible to implement within the constraints of POSIX alone, you might as well take advantage of everything that Linux can provide to you. It's going to be hairy code to write, but there really is no way around it. Also, just to point out the obvious, the glibc wrapper around clone() is completely unsuitable for the purposes of what we need here. But a direct system call will work fine.

Of course, in 99% of the cases, you won't hit any of the race conditions. They are a little tricky to trigger accidentally, and a lot of them are relatively benign. Who cares about an occasional errno value that isn't set correctly, or a file descriptor that sometimes leaks to a child process. Only in very rare cases will you trigger a dead-lock, crash, or worse. So, many programs simply don't bother, and nobody ever notices that the code is buggy. It's the really big programs that everyone uses that need to worry about these things, as you suddenly have millions of running instances and countless numbers of spawned processes. If there is a way for something to go wrong, it eventually will.

A zygote process is a time-tested alternative. And that's great, assuming you can modify the startup phase of the program. If you can guarantee that your code executes before any threads are created, then a zygote that is fork'd() proactively will avoid all of these complications. But with bigger pieces of software that rely on lots of third-party libraries, that's not always feasible. These days, you should assume that all code is always multi-threaded -- if only because the graphics libraries decide to start threads as soon as they get linked into the program, or something similarly frustrating.

BPF!

Posted Dec 21, 2024 15:18 UTC (Sat) by khim (subscriber, #9252) [Link]

> A zygote process is a time-tested alternative.

Zygote solves an entirely different problem: how to start not one process, but many processes while executing an initialization part only once.

It works, but that's entirely different task.

> vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux.

It's also the simplest way to do everything reliably and efficiently on Linux.

For some unfathomable reason everyone's attentions is on an unsolvable problem: how to prepare a new process state using remnants of the old code that is interwined with the state of your program.

Just ditch all that! Start from the clean state! Create a new setup code, push whatever you need/want in there, then execute vfork/exec (with zero steps between them, using fexecve) and viola: no races, no possibility of corrupting anything, everything is very clear, simple and guaranteed.

The only downside: you have to develop that in arch-dependent way… but so what? If you compare that to insane amount of effort one would need to support all these bazillion zygote-based solutions then adding some kind of portable wrapper with arch-dependent guts even for 3-4 most popular architectures is not too hard.

Best property of that solution: it's not supposed to be perfect! If you would find out that it doesn't work – nobody stops you from redoing that portable API and adding or removing something to it. Because you ship it with your code or in a shared library it's replaceable without any in-kernel politics.

P.S. I think it can be called “double-exec” solution, and it requires Linux-specific syscalls, but the best part: all these syscalls are already there and are not even especially new.

BPF!

Posted Dec 27, 2024 1:52 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

I predict this will be an interesting discussion.

Can you please elaborate more specifically on thread-unsafety of posix_spawn implementations? POSIX might be not explicitly saying that posix_spawn is safe to use in MT programs, but it's main purpose is clearly to fix fork's problems with threads. So it has to be MT-safe.

Fork+exec and threads are too unsafe in practice. Even our esteemed editor made an error above. Here: "details of what happens between those two calls can vary quite a bit. <skiped>environment adjusted, and so on".

Thing is, updating process environment (e.g. via setenv) typically invokes malloc. And calling malloc in-between fork and exec is unsafe.

As per posix (quoting from man 3posix fork): "If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called."

In practice malloc implementations go to some lengths to make malloc() possible after fork by carefully setting up pthread_atfork or alternatives. But this is big enough can of worms. And for example "abseil" tcmalloc explicitly doesn't (https://github.com/search?q=repo%3Agoogle%2Ftcmalloc%20at...). As per Google's internal policy pthread_atfork is forbidden (which is another, but somewhat related topic).

So posix_spawn is the right thing IMO. And any exotic process setup things that might be missing in your favorite libc (e.g. stuff like unshare/prctl) you can always do in a small helper program. You exec into it. It gets clean slate, can do whatever syscalls and mallocs and what not. Single-threadly. And then exec into real thing.

As for original discussion, I am really hoping io_uring is kept only for perf-critical stuff. Spawning childs isn't.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds