Completing the pidfd API

By Jonathan Corbet
July 26, 2019

Over the last few kernel releases, the kernel has gained the concept of a "pidfd" — a file descriptor that represents a process. What started as a way of sending signals to processes without race conditions has evolved into a more complete process-management interface. Now one of the last pieces is being put into place: the ability to wait for processes using pidfds. But, naturally, that API has had to go through some revisions first.

A pidfd recap

Unix-like systems traditionally represent many objects as files, but processes have always been an exception. They are, instead, represented by process IDs (PIDs), which are small integers — limited to 32767 by default, though that limit can be raised on Linux systems. There are a few problems with this representation, but the biggest one is arguably that PIDs are reused; when a process exits, its PID can be assigned to a new, unrelated process, and this can happen quickly. That creates a race condition where code that operates on a process, most often by sending it a signal, might end up performing an action on the wrong process.

A pidfd is, instead, a file descriptor that refers to an existing process. Once the pidfd exists, it will only refer to that one process, so it can be used to send signals without worry that the wrong process might end up being the recipient. This feature is valuable enough that some process-management systems, most notably the one used by Android, are being rewritten to take advantage of it.

There are two ways to create a pidfd. The preferred method in most cases will be to supply the CLONE_PIDFD flag to the clone() system call (or perhaps clone3() in the future); upon successful process creation, a pidfd representing the child will be returned to the parent. It is also possible to create a pidfd for an existing process with pidfd_open(), which was merged for the 5.3 kernel.

A process holding a pidfd for a process can send a signal to that process using pidfd_send_signal():

    int pidfd_send_signal(int pidfd, int signal, siginfo_t *info, unsigned int flags);

The 5.3 kernel also adds the ability to pass a pidfd to poll(), which will provide a notification when the process represented by that pidfd exits.

Waiting on a pidfd

While it is now possible to use poll() to learn when a process has exited, that is not a complete solution for process-management systems, which need to be able to wait for specific processes and reap the exit information once they are done. That requires some sort of variant on the wait() system call. To fill in that gap, Christian Brauner proposed the addition of yet another new system call:

    int pidfd_wait(int pidfd, int *stat_addr, siginfo_t *info,
    		   struct rusage *rusage, int states, int flags);

This call would wait for the given pidfd; the states parameter can be used to specify which state transitions (WSTOPPED for when the process receives a stop signal, for example) to wait for. The flags field offers additional options, including WNOHANG for non-blocking operation; see the above-linked patch cover letter for the full list.

This call, Brauner said, is "one of the few missing pieces to make it possible to manage processes using only pidfds". It is destined to remain missing, though, at least in that form; Linus Torvalds made it clear that he didn't like it. He had no objection to the desired functionality, but questioned the need for a new system call; instead, he said, the waitid() system call should simply be extended with a new flag.

That is exactly what was done in a new patch series posted by Brauner; waitid() has gained a new P_PIDFD ID-type value that causes the given ID to be interpreted as a pidfd. This approach ended up being a rather smaller patch that does not need to add a new system call; there have been no responses to it as of this writing, but it would be unsurprising if this change were to be merged for 5.4.

Beyond the ability to unambiguously specify which process should be waited for, this change will eventually enable another interesting feature: it will make it possible to wait for a process that is not a child — something that waitid() cannot do now. Since a pidfd is a file descriptor, it can be passed to another process via an SCM_RIGHTS datagram in the usual manner. The recipient of a pidfd will, once this functionality is completed, be able to use it in most of the ways that the parent can to operate on (or wait for) the associated process.

There was one other interesting piece in the original pidfd_wait() proposal: a new clone() flag (CLONE_WAIT_PID) that would cause the newly created process to be invisible to most wait() calls. Only a variant of wait() that specified that process in particular (by specifying its pidfd, for example) would be able to reap its exit information. There are a few use cases for this functionality; one that was listed is a library that needs to create a helper process that won't show up if the calling application calls wait(). This feature was not part of the second patch set, but is expected to show up in a separate posting in the near future.

There will almost certainly be other pidfd-oriented enhancements in the future; this feature is new and should not be considered to be complete. But the ability to wait on a pidfd might be seen as the end of the first round of development for the pidfd concept. It has been a relatively quiet set of changes, but the move to pidfds is a fundamental change in how processes are managed on Linux systems.

Index entries for this article
Kernel	pidfd

Completing the pidfd API

Posted Jul 26, 2019 21:10 UTC (Fri) by clugstj (subscriber, #4020) [Link] (12 responses)

I think it would be better to extent poll() to allow it to receive the exit information of the process (maybe read the exit info. from the pidfd). That way, a thread could wait for process termination and socket activity at the same time.

Completing the pidfd API

Posted Jul 26, 2019 22:42 UTC (Fri) by roc (subscriber, #30627) [Link]

Yes, that seems like an obvious thing to want.

Completing the pidfd API

Posted Jul 27, 2019 2:45 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (10 responses)

poll on pidfds already works

Completing the pidfd API

Posted Jul 27, 2019 3:01 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

But not read() afterwards.

Completing the pidfd API

Posted Jul 27, 2019 3:15 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (7 responses)

Sure, but doesn't pidfd_wait serve the role of read?

Completing the pidfd API

Posted Jul 27, 2019 3:19 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Well, yes. But we're getting a waitid() flag instead.

Completing the pidfd API

Posted Jul 27, 2019 3:28 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (5 responses)

Spelling differences. Doesn't change the model.

Completing the pidfd API

Posted Jul 27, 2019 11:47 UTC (Sat) by ale2018 (guest, #128727) [Link] (4 responses)

Since it is an fd, it would seem natural to expect to be able to read or write to it. Reading a bit when the process exits is not quite managing, say, a pipe. A pipe?! Hm... stdpid?

How do I know if the process is busy crunching, sleeping, or waiting for input?

Just fooling...

Completing the pidfd API

Posted Jul 28, 2019 0:53 UTC (Sun) by quotemstr (subscriber, #45331) [Link]

This is not a useful comment.

Completing the pidfd API

Posted Jul 30, 2019 9:21 UTC (Tue) by cyphar (subscriber, #110703) [Link] (2 responses)

We need to be very careful about adding read()/write() support to control-related fds -- because you can always spawn a setuid program with a different set of stdio fds and potentially trick it into reading/writing something that was not intended to the control fd (and if the permission checks aren't done on open()-time then you have just created a security bug).

Completing the pidfd API

Posted Aug 1, 2019 7:16 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (1 responses)

I think the lesson of this is probably not to introduce any new setuid programs anymore, and do privilege elevation only by IPC.

Completing the pidfd API

Posted Aug 2, 2019 9:14 UTC (Fri) by flussence (guest, #85566) [Link]

We should probably replace CAP_SYS_ADMIN programs (e.g. ffmpeg kmsgrab without running explicitly as root) with IPC first. setuid is less subversive, as at least it's visible in ls.

Completing the pidfd API

Posted Jul 28, 2019 1:02 UTC (Sun) by clugstj (subscriber, #4020) [Link]

Oh, I see now. Just poll() for whatever processes/sockets you want. When the poll() returns saying the process has exited, use pidfd_wait() to get the result.

Completing the pidfd API

Posted Jul 26, 2019 22:53 UTC (Fri) by roc (subscriber, #30627) [Link] (1 responses)

The discussion does not mention how this interacts with ptrace. rr could potentially benefit from the ability to hand ptrace control of a traced task from one ptracer process to another. I guess even if some pidfd-based API let other processes read wait statuses, those other process still wouldn't be able to execute ptrace() commands because they're not the (sole) ptracer of the traced tasks.

Another question is whether this new API follows the ptrace/waitpid behavior, i.e. each ptraced thread of a process reports exit independently and is independently reaped. I really want that to be true, because that would give us a sane and reliable way to wait for some specific subset of all traced threads to exit, which is currently impossible.

Completing the pidfd API

Posted Aug 27, 2019 19:17 UTC (Tue) by nix (subscriber, #2304) [Link]

The fact that it's modelled on waitid() suggests not. waitid() throws away some of the ptrace()-necessary info waitpid() packs into its return value, so you can't use it if you're doing ptrace monitoring (though this is nowhere documented that I can see: you have to reverse-engineer it from the code and from the fact that the ptrace documentation never once mentions waitpid).

Completing the pidfd API

Posted Jul 27, 2019 20:07 UTC (Sat) by doublez13 (guest, #122213) [Link] (2 responses)

Can we get a link to the patches/RFCs for the Android work mentioned? Thanks.

Completing the pidfd API

Posted Jul 27, 2019 20:17 UTC (Sat) by brauner (subscriber, #109349) [Link] (1 responses)

Not sure about the actual LMKD work but the backports for the kernels at least do exist:
https://android-review.googlesource.com/q/topic:%22pidfd+...
https://android-review.googlesource.com/q/topic:%22pidfd+...
https://android-review.googlesource.com/q/topic:%22pidfd+...

Completing the pidfd API

Posted Jul 28, 2019 0:47 UTC (Sun) by doublez13 (guest, #122213) [Link]

Thank you! :)

Completing the pidfd API

Posted Jul 28, 2019 22:34 UTC (Sun) by naptastic (guest, #60139) [Link] (1 responses)

I'm really excited to see this work happening, even though most of my work is far removed from the kernel.

I think BSD saw the (valid, real) problems with /proc and took the wrong lesson, where Linux is now converging on something smarter: providing an even more UNIXy interface ("a process is now also a file") to the process space. I'm looking forward to using this functionality, even if only indirectly.

Completing the pidfd API

Posted Jul 29, 2019 10:19 UTC (Mon) by wahern (subscriber, #37304) [Link]

FreeBSD has had pdfork for almost 8 years (9.0 released Jan 2012): https://www.freebsd.org/cgi/man.cgi?query=pdfork&sekt...

The real dilemma after this is how to acquire process fds when children fork. The BSD kqueue framework has permitted tracking forks and exits of descendants since almost the beginning[1], though there's still no mechanism to acquire a process fd for them.

I mention this because there's no grand theory for a better process model, unless you count Capsicum from whence pdfork came. But in the Capsicum security model forking is normally disabled in descendants. Arguably one of the reasons it's taken Linux so long to get a process fd is precisely because of all the open ended questions about where to go next, which while unanswered have the effect of casting doubt on the utility of process fds, notwithstanding that most people agree that in the abstract they're a great idea.

[1] Sometime between 1999, when kqueue was originally merged, and 2003, the earliest hit I got with a naive Google search.

Completing the pidfd API

Posted Oct 8, 2019 2:45 UTC (Tue) by rvk (guest, #111525) [Link]

Would be nice to get this working with process events connector.

Completing the pidfd API

Posted Mar 5, 2020 3:46 UTC (Thu) by re:fi.64 (subscriber, #132628) [Link] (1 responses)

> Beyond the ability to unambiguously specify which process should be waited for, this change will eventually enable another interesting feature: it will make it possible to wait for a process that is not a child — something that waitid() cannot do now.

I think this is false, at least having tried it waitid will always return ESRCH.

Completing the pidfd API

Posted Mar 5, 2020 3:50 UTC (Thu) by re:fi.64 (subscriber, #132628) [Link]

Whoops, minor amendment: ECHILD, not ESRCH

Completing the pidfd API

Posted Jun 14, 2023 2:44 UTC (Wed) by jredfox_ (guest, #165585) [Link] (3 responses)

This is a quick fix for an invalid solution. the proper solution is the PID solution. you should have in the arguments PID-CREATIONTIME where creation time is MS preferably. I don't like JIFFIES or UNIX seconds it's not precise enough.

Or an even better solution create a call called reservePID(unsigned long PID). this will reserve the PID until the process that called it is closed. For security reasons it should limit the number of reserves it can use to about 200 PID's for IPC(unrelated non child process's) per process and unlimited amount for child process's.

Completing the pidfd API

Posted Jun 14, 2023 5:07 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> you should have in the arguments PID-CREATIONTIME where creation time is MS preferably. I don't like JIFFIES or UNIX seconds it's not precise enough.

Why not UUIDs then?

And pidreserve doesn't prevent all race attacks.

Completing the pidfd API

Posted Jun 14, 2023 23:23 UTC (Wed) by jredfox_ (guest, #165585) [Link] (1 responses)

UUIDS MS creation time is way over 700! Also UUID collisions can and have occurred. currentMS states the current ms the creation time was fetched and the creation time of the process never changes. UUID's are problematic

Completing the pidfd API

Posted Jun 15, 2023 1:42 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Version 1 and 2 UUIDs include system time, so they can't collide unless the kernel is compromised.

But more practically, your system with time-based IDs is just ugly, just like UUIDs.

And pidreserve() won't help against targeted wraparound attacks.

Completing the pidfd API

Posted Sep 6, 2023 16:59 UTC (Wed) by bartoc (guest, #124262) [Link] (1 responses)

Does this get rid of some shared data-structure in the kernel that maps PIDs to the actual resources of the process? I ask because I noticed that with pthreads one can write "pthread_detach(pthread_self())" to detach yourself, whereas presumably detaching yourself via a pidfd you got from "yourself" would be a no-op as any resources would be held open by the other pidfd opened by your parent thread. I noticed this when implementing pthreads/c11 threads on windows, where threads are represented by HANDLES that are recounted and work similarly to files. On windows I came to the conclusion that allowing threads to detach themselves would require a shared data-structure holding the mapping of PIDs to HANDLES.

Completing the pidfd API

Posted Oct 17, 2024 0:34 UTC (Thu) by jengelh (guest, #33263) [Link]

>one can write "pthread_detach(pthread_self())" to detach yourself, whereas presumably detaching yourself via a pidfd you got from "yourself" would be a no-op

In glibc-nptl, pthread_self and _detach are functions that involve just userspace. There is not going to be a deadlock/deadlock-avoiding-noop as you envisioned.