Out-of-memory victim selection with BPF

Posted Aug 19, 2023 16:02 UTC (Sat) by mb (subscriber, #50428)
In reply to: Out-of-memory victim selection with BPF by vadim
Parent article: Out-of-memory victim selection with BPF

Well, do programs actually still use fork()?
There is vfork(), and clone() with all its flags and stuff to avoid a memory COW duplication.
COW is still kind of expensive, because it requires page table duplication. So it's worth avoiding, even without overcommit in mind.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 17:05 UTC (Sat) by vadim (subscriber, #35271) [Link] (6 responses)

vfork() blocks the parent. I don't think much uses that.

clone() is Linux specific and thus non-portable, and much more complicated to use than fork(). I'm sure there's stuff that uses it, but I think most things just won't bother without a good reason.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 17:42 UTC (Sat) by mb (subscriber, #50428) [Link] (5 responses)

> vfork() blocks the parent. I don't think much uses that.

Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do. That takes a microsecond or so (Plus two context switches).

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 20:32 UTC (Sat) by kleptog (subscriber, #1183) [Link] (4 responses)

> Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do.

Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc. The fact you need to actually do things between the fork() and execve() is why stuff like posix_spawn() never goes anywhere. There's an awful lot of state that gets inherited and you need to be able to manipulate all of it before starting the new process.

Ideally you'd like a way to create a new (empty) process and be able to manipulate its execution state using the standard syscalls without actually forking and then at the last moment kick it off with the new ELF image directly. Probably some smart cookie has designed such an interface, but I don't see it taking off any time soon.

Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 21:03 UTC (Sat) by izbyshev (guest, #107996) [Link]

> Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc.

Technically, all these things are the kernel state, not the libc state, so they can be configured after vfork() via direct syscalls without ever touching libc. But this is rarely a good option for a typical application because there are some footguns with direct syscall usage, as well as with vfork() itself.

> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

This has been discussed: https://lwn.net/Articles/908268. No news after that, I'm afraid.

Out-of-memory victim selection with BPF

Posted Aug 20, 2023 3:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc.

For 99% of cases none of that is needed.

> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

Ideally? Create a process in a suspended state, returning its file descriptor, then poke at it with process management functions that accept FDs, and finally let it continue.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:33 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (1 responses)

One does not even need to extends kernel API with syscalls that takes a process handle. Just have a single api to set the process handle on the current thread the following APIs will use.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:48 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Oh man, that sounds like quite a footgun. Error paths forgetting to restore it, what about other threads, etc. But I guess that's nothing new to those coding close to the kernel. Are there any other kinds of syscalls that can manipulate a scoped resource like that (I'm excepting intrinsic properties like pid, tid, etc. here…I suppose `cwd` is such a resource, but that feels more user-y than kernel-y to me)?

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 20:50 UTC (Sat) by izbyshev (guest, #107996) [Link] (2 responses)

> Well, do programs actually still use fork()?

Last time I checked the situation across the languages was the following:

* musl and modern glibc implement posix_spawn() via vfork().
* OpenJDK has been using vfork() by default since forever.
* Modern .NET (nee .NET Core) switched to vfork() at some point.
* CPython uses vfork() in subprocess on Linux by default since 3.10 (and can use posix_spawn() in some non-default cases since 3.8).
* Go uses vfork()-equivalent clone(CLONE_VM|CLONE_VFORK).
* IIRC Rust uses posix_spawn() with a fork() fallback for cases whey they need something not supported by the former.

So it's feasible to avoid fork() nowadays in most cases.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:59 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (1 responses)

Thanks for the info!

But then one should be able disable overcommit on Linux and things will work since most apps use vfork and that does not require for the kernel to reserve the memory as the child shares the memory with the parent until exec. Or have I missed something?

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 22:37 UTC (Mon) by izbyshev (guest, #107996) [Link]

> But then one should be able disable overcommit on Linux and things will work since most apps use vfork

While vfork() is used behind the scenes in many languages, I'd be surprised if most C/C++ programs migrated from fork(). And yes, posix_spawn() is suitable in many cases, but it still lacks portable options for some trivial stuff like changing the current directory (glibc and musl have posix_spawn_file_actions_addfchdir_np() extension for that). One could use a simple wrapper executable to tweak the child attributes and then execve() to the real program, but all of this is annoying.

So I'd expect that whether disabling overcommit is fine or not still depends on your set of apps heavily.

And of course, even if fork() had been the main reason to default to overcommit, there is stuff like mmap'ing lots of memory (without MAP_NORESERVE) and then not touching most of it, like those sparse data structures that people mentioned in the comments. Hyrum's law means that after all these years somebody definitely relies on this.