Race-free process creation in the GNU C Library [LWN.net]

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 18:15 UTC (Fri) by dwest (guest, #110523) [Link] (31 responses)

Could you explain the objection to proc? I haven't heard any other complaints about it so I'm curious about whether this is some larger complaint that I've managed to miss entirely...

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:11 UTC (Fri) by mb (subscriber, #50428) [Link] (30 responses)

Well, the problem is that proc is not always available. e.g. in chroots or containers.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:21 UTC (Fri) by bluca (subscriber, #118303) [Link] (29 responses)

containers really should have it, and chroots - I can't imagine services tracking processes such as dbus or polkit or systemd would be running in a chroot? The way polkit/dbus do it now relies on parsing /proc anyway, so it wouldn't make much of a difference in that regard

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:27 UTC (Fri) by mb (subscriber, #50428) [Link] (17 responses)

>containers really should have it

One additional nail into the coffin of unprivileged containers?

>The way polkit/dbus

I'm talking about the fundamental pidfd API. Any process could use pidfds.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:35 UTC (Fri) by bluca (subscriber, #118303) [Link] (16 responses)

> One additional nail into the coffin of unprivileged containers?

I'm pretty sure those can have /proc too?

$ id -u
1000
$ unshare -U -m --mount-proc -p -f
$ mount | grep img
proc on /tmp/img type proc (rw,nosuid,nodev,noexec,relatime)

> I'm talking about the fundamental pidfd API. Any process could use pidfds.

Sure, to do process tracking - what kind of process would you need to track in a chroot? Besides, it's all moot, this is not glibc's fault, the kernel provides this interface, so that's what glibc can use to provide an abstraction

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 19:36 UTC (Fri) by bluca (subscriber, #118303) [Link]

(copy-pasta, that should have been --mount-proc=/tmp/img - give us an edit button already!)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (11 responses)

> what kind of process would you need to track in a chroot

Any process that wants to spawn a process and use pidfd, but also write the pid in a log file or debug trace? Ignoring portability for a second, it could even be something like make or cargo.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:19 UTC (Fri) by bluca (subscriber, #118303) [Link] (10 responses)

That requires procfs to do today, no? So there shouldn't be a regression in that regard?

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:30 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (9 responses)

It doesn't require procfs if it uses the (inferior) pid-based API and SIGCHLD. So it's a regression if this hypothetical program wants to switch to pidfd. A ioctl does seem to be a good idea, it can return ESRCH in case of a race.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:23 UTC (Fri) by bluca (subscriber, #118303) [Link] (4 responses)

Ok - sounds like those use cases need to make a choice: continue to use pid-based APIs and no procfs, or switch to pidfds and mount procfs with hidepid= to sandbox it

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:46 UTC (Fri) by josh (subscriber, #17465) [Link] (3 responses)

Or bypass glibc and call the nice race-free function the kernel provides, and continue advocating that glibc provide clone3.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 0:43 UTC (Sat) by bluca (subscriber, #118303) [Link] (2 responses)

The kernel doesn't provide functions to resolve pidfds

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:08 UTC (Sat) by josh (subscriber, #17465) [Link] (1 responses)

Given access to clone3, you can directly obtain a pidfd and a pid simultaneously when you first create the process, rather than retrieving the pid later.

(That operation would still be useful when passed a pidfd from elsewhere, but not *necessary* for the common case where you got the pidfd by creating a process.)

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:37 UTC (Sat) by bluca (subscriber, #118303) [Link]

The case when you want to resolve a pidfd received via SO_PEERPIDFD/SCM_PIDFD is exactly where you need that, and what is enabled by all these new APIs that have recently been added, and where this resolving glibc function. I know because I had to reimplement it across 4 projects...

Race-free process creation in the GNU C Library

Posted Sep 3, 2023 4:14 UTC (Sun) by IanKelling (subscriber, #89418) [Link] (3 responses)

> So it's a regression if this hypothetical program wants to switch to pidfd.

I don't think it is hypothetical. From my sysadmin perspective, I often build software in a chroot without a /proc mount. Very rarely, the build has needed it and I wanted to know why. Bind bounding /proc, I see find shows 546,160 user-listabable files and 304,803 user readable files. Making that a requirement to create processes just because opt-in to an api that avoids a race condition would be roughly a regression in my book.

Race-free process creation in the GNU C Library

Posted Sep 3, 2023 10:26 UTC (Sun) by bluca (subscriber, #118303) [Link] (2 responses)

Why would compiling some stuff require resolving pidfds?

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 9:16 UTC (Mon) by taladar (subscriber, #68407) [Link] (1 responses)

Why wouldn't it? Compiling spawns lots of processes and that kind of thing usually involves printing the PID when logging what you are doing to be able to distinguish between different instances of the same program (e.g. the compiler when spawned by some sort of build tool).

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 9:53 UTC (Mon) by bluca (subscriber, #118303) [Link]

Then the tools that spawn such processes, if they want to implement tracking by pidfd, will need to implement appropriate fallbacks (which are easy to add as the error codes are different). They'll need that anyway for compatibility with older kernels. So still not sure where the regression would be?

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:07 UTC (Fri) by geofft (subscriber, #59789) [Link] (2 responses)

There's a practical problem that a Kubernetes container that is not marked "privileged" (which is a Kubernetes concept, rather different from the ordinary meaning of "privileged" as in "runs as root") gets certain things in /proc overmounted, e.g., /proc/sysrq-trigger and /proc/kcore, as a form of sandboxing. The goal is to reduce the impact of a malicious pid 0 inside a container. (User namespacing would also work, but most Kubernetes deployments don't use it yet - it's an alpha feature on k8s' end and only supports one container runtime.) This is, in isolation, an understandable / defensible feature, and I can see systems other than Kubernetes doing it (e.g., I can totally see it being a systemd Restrict option down the line).

Meanwhile, the kernel has a feature where, if your current /proc is in any way overmounted, you're not allowed to mount a new /proc - because that would give you access to the files that are supposed to be hidden to you. This is also, in isolation, an understandable / defensible feature.

The intersection of these features is that you can't correctly mount /proc inside a nested container or container-like thing inside a non-privileged Kubernetes container. If you make a new pidns (either because you're root or via a new userns, as in your example), all the paths in /proc are wrong because they refer to outer PIDs.

(The intersection of these features also ceases to be really defensible in the case where you don't allow your Kubernetes workloads to run as pid 0, which is a really good idea on its own.)

There have been some patches for a second procfs (whose exact name I'm forgetting) that provides /proc/$pid/ and the /proc/self/ symlink but not anything else in /proc, but I don't think they've been merged. If those could get merged and guaranteed mountable by anyone with CAP_SYS_MOUNT in their current namespace, regardless of what the existing /proc outside it looks like or even whether it exists, that would satisfactorily address the issue.

I suppose another option would be for /proc to always enumerate the calling process's PID namespace, but maybe that gets weird with open file descriptors passed between PID namespaces.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:28 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

Isn't that what the hidepid= mount options (and systemd's ProtectProc=) do? To resolve pidfds you just need proc/self/fd/ and proc/self/fdinfo which are both available under those sandboxing options

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:56 UTC (Sat) by cyphar (subscriber, #110703) [Link]

subset=pids has no effect on the mount_too_revealing() check because all of the "are the flags the same" checks are based on the generic VFS flags not FS-specific ones. So if you only have an overmounted procfs you cannot mount subset=pids even if the overmounts are paths that don't exist with subset=pids.

In fact this also means you can bypass the check entirely -- if you have a "safe" subset=pids mount in your namespace, the kernel will allow you to mount an unmasked (fully-fledged) procfs.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:24 UTC (Fri) by wahern (subscriber, #37304) [Link] (8 responses)

procfs has historically been a reliable vector for exploits, both for the kernel and applications. Principle of least privilege says that if you don't need procfs, don't mount procfs. But if libc relies on procfs for basic features, you necessarily have to expose many other parts (if not necessarily all) of procfs even beyond those strictly needed for those features.

Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit? Now rather than getting EMFILE, you get unexpected errors from syscall wrappers. And to avoid descriptor leaks, libc has to go through herculean efforts to make the syscall wrapper async- and thread-safe, and those efforts are definitely not always bug-free; or alternatively, now there's another threading/fork foot gun laying around.

None of these issues may be of concern to *you*, but they're of concern to other people, and have been for decades. Moreover, PID fds is an interface which people concerned about reliability, correctness, and security, have been desiring for a long-time; PID fd usability being tied to procfs substantially reduces the net value. Not all process management can be shoe-horned into systemd and other global services; far from it. Process management is often something ones needs to perform *after* dropping various privileges. That not all privilege separating or privilege reducing tasks can be performed immediately before or after exec, or cannot be reduced to one-line configuration directives, is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:48 UTC (Fri) by bluca (subscriber, #118303) [Link] (6 responses)

> procfs has historically been a reliable vector for exploits, both for the kernel and applications. Principle of least privilege says that if you don't need procfs, don't mount procfs. But if libc relies on procfs for basic features, you necessarily have to expose many other parts (if not necessarily all) of procfs even beyond those strictly needed for those features.

Nah, procfs supports various sandboxing features nowadays, and especially when unprivileged it necessarily implies a pid namespace so you do not have visibility in the rest of the system, only on processes in your pid namespace, and if it's a chroot that's going to be just the shell. If you are privileged, you can use the ProtectProc= systemd option (or if you are running on the 0.000x% of Linux install, mount /proc with the various hidepid= options that provide equivalent functionality)

> Moreover, procfs requires opening descriptors. But what if you've already hit your descriptor limit?

The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

> PID fd usability being tied to procfs substantially reduces the net value.

Considering they've been available as-is for 4 years and nobody bothered to do anything about that, and have been providing great net value in the meanwhile, I'll have to take that with a grain of salt.

> Process management is often something ones needs to perform *after* dropping various privileges.

Not sure what that has to do with using procfs?

> is precisely why OpenBSD's pledge and unveil are infinitely more ergonomic than comparable Linux solutions.

I mean, if you dislike modern Linux so much and prefer OpenBSD, then just use OpenBSD? That's an absolutely fine thing to do.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:34 UTC (Fri) by bbockelm (subscriber, #71069) [Link] (1 responses)

> The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

Oh, the youthful banter of someone who hasn't spent a few hours this week debugging issues caused by file descriptor exhaustion!

(In this case, it was due to a hypervisor that booted a VM with trivial amounts of memory, the VM kernel adjusted system-wide file descriptor limits down accordingly, then the hypervisor would hotplug another 32GB of RAM later...)

For what it's worth, I agree this _should_ have been a problem relegated to history. I want to live in the future!

Race-free process creation in the GNU C Library

Posted Sep 6, 2023 8:39 UTC (Wed) by lathiat (subscriber, #18567) [Link]

I debugged the exact same issue, with proxmox having "Memory Balooning" enabled. Despite having a "minimum memory" of 16GB, it would boot with 1GB and plug the rest in later. This gave you very low maximum number of processes on the system giving fork: retry: Resource temporarily unavailable errors inside a Docker container.

I found the following very low Default:

# systemctl show --property=DefaultTasksMax
DefaultTasksMax=981

Which you also see in cgroupfs:
find /sys/fs/cgroup -name pids.max -exec grep -H . {} ;

The systemd docs state this is set based on threads-max "Configure the default value for the per-unit TasksMax= setting. See systemd.resource-control(5) for details. This setting applies to all unit types that support resource control settings, with the exception of slice units. Defaults to 15% of the minimum of kernel.pid_max=, kernel.threads-max= and root cgroup pids.max. Kernel has a default value for kernel.pid_max= and an algorithm of counting in case of more than 32 cores. For example with the default kernel.pid_max=, DefaultTasksMax= defaults to 4915, but might be greater in other systems or smaller in OS containers."

We then find a very low /proc/sys/kernel/threads-max of 6541. According to the kernel docs "During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages."

Despite being a pretty experience Linux performance engineer it took me a bit to find that one, as it only showed up in the cgroup limits and not in /proc/PID/limit.

Good times :)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:56 UTC (Fri) by dezgeg (subscriber, #92243) [Link] (2 responses)

> The 1980s are calling and want their problems back ;-) In 2023 and on modern Linux, file descriptors are only limited by available memory. Open as many as you want.

Is that really common to have no ulimit for them? 1024 fds limit has been very typical what I've seen (since default FD_SET size is that, so most programs that use select() will break on high fds)

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 22:29 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

That's the default soft limit yes, but since many years the default hard limit is the highest the kernel can give out. So a process defaults to 1024 to avoid breaking the legacy select() interfaces, but can raise the soft limit at will in case it doesn't use those interfaces, which I'd hope it's most things these days.

Race-free process creation in the GNU C Library

Posted Sep 4, 2023 20:18 UTC (Mon) by comex (subscriber, #71521) [Link]

However, a library probably shouldn’t assume that the program it’s linked into isn’t using select(), or try to raise the soft limit itself.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 23:50 UTC (Fri) by josh (subscriber, #17465) [Link]

> Considering they've been available as-is for 4 years and nobody bothered to do anything about that,

People have been bothering to do something about that, and it has taken this long to get something on a potential path to acceptance.

It's the fault of libc that we cannot simply call clone3 directly. It's the responsibility of libc to *stop hiding the underlying useful functionality* just because it thinks it knows better.

Race-free process creation in the GNU C Library

Posted Nov 14, 2023 23:57 UTC (Tue) by Rudd-O (guest, #61155) [Link]

Excellent arguments. Thank you. Yuck to requiring more procfs in basic libc stuff.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 22:31 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

Sandstorm deliberately does not mount /proc for sandboxing reasons.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 6:37 UTC (Tue) by fw (subscriber, #26023) [Link]

Somewhat recently, systemd-nspawn removed proc support from nested chroots: https://bugzilla.redhat.com/show_bug.cgi?id=2210335

So it's unfortunately not the case that proc is universally available or can be made so.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 18:28 UTC (Fri) by bluca (subscriber, #118303) [Link] (6 responses)

I'm not convinced using proc is a problem, but in any case, that's the only interface there is to resolve pidfds, there is no alternative

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:32 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Perhaps it should be added? It can be a very simple syscall. Or even an ioctl.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:36 UTC (Fri) by bluca (subscriber, #118303) [Link]

Thanks for volunteering! :-P

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:10 UTC (Sat) by josh (subscriber, #17465) [Link] (3 responses)

The alternative is to obtain the pidfd and pid simultaneously when calling clone3.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 1:38 UTC (Sat) by bluca (subscriber, #118303) [Link] (2 responses)

Doesn't work when you receive a pidfd from somewhere else

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 12:05 UTC (Tue) by hmh (subscriber, #3838) [Link] (1 responses)

Yes, a pidfd_getpid() *syscall* is clearly missing. It might not have been really needed before (modulo system libraries lacking appropriate (if non-portable) functionality to actually return you the pidfd *and* pid on clone and/or fork), but now that you can send pidfds around the system over sockets, a syscall is clearly very desirable, it seems to be the best way to solve the underlying problem.

While it looks at first glance that it would be "easy" to write one, that's for someone already used to working in that area of the kernel -- there are likely permission checks one need to get perfectly right to not create a security mishap, namespace concerns, etc. Experience in the specific area of the kernel you're working with almost always help a lot on the quality of the first public version of a patch, and faster acceptance in mainline for non-controversial changes.

Race-free process creation in the GNU C Library

Posted Sep 5, 2023 12:29 UTC (Tue) by bluca (subscriber, #118303) [Link]

Sure, but that's really nothing to do with glibc and its developers/maintainers, it's something an experienced kernel developer would be in the best position to implement, as you noted. I mean the proposed glibc API could even transparently switch to such a syscall, if/when it becomes available in the future.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 20:48 UTC (Fri) by Karellen (subscriber, #67644) [Link] (5 responses)

I wonder if there would be any value in a system call that does the equivalent of open("/proc", O_PATH|O_DIRECTORY|O_CLOEXEC) and return an fd to the proc filesystem - even if /proc is mounted elsewhere or not at all? And similarly for /dev and /sys?

Then again, if admins wanted to limit access to those filesystems for a container, they'd need to implement some kind of seccomp-bpf/pledge style block, instead of just... not mounting those filesystems in the container.

Race-free process creation in the GNU C Library

Posted Sep 1, 2023 21:26 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Putting my SRE/sysadmin hat on for a moment: If an application wants to read from procfs, it should just try to open /proc. If that doesn't work, it's the sysadmin's problem, not the application's problem. It is far too late to "upgrade" all extant applications to call open_proc() (or whatever name you prefer) instead of open("/proc/...") directly, so introducing open_proc and then saying "use seccomp if you want to lock it down" is just giving sysadmins another knob we have to twiddle, for no discernable benefit. We'll still have to deal with open("/proc") compat. issues for old applications anyway (unless you propose that libc should somehow intercept those calls and redirect them to open_proc, which is IMHO insane).

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 2:03 UTC (Sat) by cyphar (subscriber, #110703) [Link] (3 responses)

This does kind of exist with fsopen(), but it requires privileges and because it is a mount you are at the whims of the mount_too_revealing() checks, which means it won't work in containers or in a namespace where there is no proc mount at all.

I have wondered whether it would be possible to allow fsopen("proc") to unprivileged processes but only for subset=pids -- this would solve many hacks needed in container runtimes to defend against certain attacks. Unfortunately, I suspect that even the new mount infrastructure is probably not going to be considered safe for unprivileged users to touch.

Race-free process creation in the GNU C Library

Posted Sep 7, 2023 9:06 UTC (Thu) by Jonno (subscriber, #49613) [Link] (2 responses)

> I have wondered whether it would be possible to allow fsopen("proc") to unprivileged processes but only for subset=pids

That would still let the unprivileged process learn of other processes on the system that it otherwise would be oblivious about.

But perhaps allowing something like `openat(pidfd, ".", O_DIRECTORY)` to get a fd equivalent to the /proc/<pid> directory except you can't ".." out of it would work.

Race-free process creation in the GNU C Library

Posted Sep 9, 2023 5:03 UTC (Sat) by cyphar (subscriber, #110703) [Link] (1 responses)

v1 pidfds kind of worked this way, my understanding is that there were a bunch of issues with creating handles to procfs mounts and thus only a few pidfd operations work with that style -- the new ones are all anonymous inodes (like most other fd interfaces).

It's a bit of a shame, because that could've been the nicest behaviour -- though the contents of quite a few procfs files depend on the pid namespace associated with the procfs in ways that will cause confusion when sending them between processes and I'm not sure there would be a nice solution for that.

Race-free process creation in the GNU C Library

Posted Sep 16, 2023 14:35 UTC (Sat) by Jonno (subscriber, #49613) [Link]

> v1 pidfds kind of worked this way
Not quite. The first version of fd references to a pid was by open("/proc/«pid»", O_DIRECTORY) [or open("/proc/self", O_DIRECTORY)], giving you a directory fd that was guaranteed to never refer to an newer process, even if the pid was reused (it would instead refer to an unlinked directory). The problem was that this (1) required a mounted procfs to work, and (2), could not be used for polling or waitid. The upshot was that, being a directory fd, you could use it to open files in the procfs directory of the process in question.

To re-gain that ability without the old problems you need some race-free way of going from a pidfd to the corresponding dirfd without a mounted procfs. Simply getting a procfs reference for use in *at syscalls without actually mounting procfs (as proposed by Karellen) would make it possible for live processes, but not for exited processes still referred to by a pidfd, and it wouldn't be race-free. My proposal using openat, or some new flag to dup3 or fcntl, would solve it fully.

Race-free process creation in the GNU C Library

Posted Sep 2, 2023 4:39 UTC (Sat) by iabervon (subscriber, #722) [Link]

One non-obvious thing is that the use of proc in pidfd_getpid is not about the other process. It's actually that the process that has a pidfd for some other process opens something under /proc/self in order to get miscellaneous further information about one of its own file descriptors, in fdinfo/(n).

It really seems like it would be sensible for the kernel to provide the information that's in /proc/self available to the process itself without access to procfs more generally or use of absolute paths. On the other hand, that's a separate issue from the pidfd stuff.