Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Posted Aug 19, 2023 16:02 UTC (Sat) by mb (subscriber, #50428)In reply to: Out-of-memory victim selection with BPF by vadim
Parent article: Out-of-memory victim selection with BPF
There is vfork(), and clone() with all its flags and stuff to avoid a memory COW duplication.
COW is still kind of expensive, because it requires page table duplication. So it's worth avoiding, even without overcommit in mind.
Posted Aug 19, 2023 17:05 UTC (Sat)
by vadim (subscriber, #35271)
[Link] (6 responses)
clone() is Linux specific and thus non-portable, and much more complicated to use than fork(). I'm sure there's stuff that uses it, but I think most things just won't bother without a good reason.
Posted Aug 19, 2023 17:42 UTC (Sat)
by mb (subscriber, #50428)
[Link] (5 responses)
Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do. That takes a microsecond or so (Plus two context switches).
Posted Aug 19, 2023 20:32 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (4 responses)
Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc. The fact you need to actually do things between the fork() and execve() is why stuff like posix_spawn() never goes anywhere. There's an awful lot of state that gets inherited and you need to be able to manipulate all of it before starting the new process.
Ideally you'd like a way to create a new (empty) process and be able to manipulate its execution state using the standard syscalls without actually forking and then at the last moment kick it off with the new ELF image directly. Probably some smart cookie has designed such an interface, but I don't see it taking off any time soon.
Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
Posted Aug 19, 2023 21:03 UTC (Sat)
by izbyshev (guest, #107996)
[Link]
Technically, all these things are the kernel state, not the libc state, so they can be configured after vfork() via direct syscalls without ever touching libc. But this is rarely a good option for a typical application because there are some footguns with direct syscall usage, as well as with vfork() itself.
> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
This has been discussed: https://lwn.net/Articles/908268. No news after that, I'm afraid.
Posted Aug 20, 2023 3:01 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
For 99% of cases none of that is needed.
> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
Ideally? Create a process in a suspended state, returning its file descriptor, then poke at it with process management functions that accept FDs, and finally let it continue.
Posted Aug 21, 2023 16:33 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
Posted Aug 21, 2023 16:48 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Aug 19, 2023 20:50 UTC (Sat)
by izbyshev (guest, #107996)
[Link] (2 responses)
Last time I checked the situation across the languages was the following:
* musl and modern glibc implement posix_spawn() via vfork().
So it's feasible to avoid fork() nowadays in most cases.
Posted Aug 21, 2023 16:59 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
But then one should be able disable overcommit on Linux and things will work since most apps use vfork and that does not require for the kernel to reserve the memory as the child shares the memory with the parent until exec. Or have I missed something?
Posted Aug 21, 2023 22:37 UTC (Mon)
by izbyshev (guest, #107996)
[Link]
While vfork() is used behind the scenes in many languages, I'd be surprised if most C/C++ programs migrated from fork(). And yes, posix_spawn() is suitable in many cases, but it still lacks portable options for some trivial stuff like changing the current directory (glibc and musl have posix_spawn_file_actions_addfchdir_np() extension for that). One could use a simple wrapper executable to tweak the child attributes and then execve() to the real program, but all of this is annoying.
So I'd expect that whether disabling overcommit is fine or not still depends on your set of apps heavily.
And of course, even if fork() had been the main reason to default to overcommit, there is stuff like mmap'ing lots of memory (without MAP_NORESERVE) and then not touching most of it, like those sparse data structures that people mentioned in the comments. Hyrum's law means that after all these years somebody definitely relies on this.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
* OpenJDK has been using vfork() by default since forever.
* Modern .NET (nee .NET Core) switched to vfork() at some point.
* CPython uses vfork() in subprocess on Linux by default since 3.10 (and can use posix_spawn() in some non-default cases since 3.8).
* Go uses vfork()-equivalent clone(CLONE_VM|CLONE_VFORK).
* IIRC Rust uses posix_spawn() with a fork() fallback for cases whey they need something not supported by the former.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF