A new filesystem for pidfds

By Jonathan Corbet
March 13, 2024

The pidfd abstraction is a Linux-specific way of referring to processes that avoids the race conditions inherent in Unix process ID numbers. Since a pidfd is a file descriptor, it needs a filesystem to implement the usual operations performed on files. As the use of pidfds has grown, they have stressed the limits of the simple filesystem that was created for them. Christian Brauner has created a new filesystem for pidfds that seems likely to debut in the 6.9 kernel, but it ran into a little bump along the way, demonstrating that things you cannot see can still hurt you.

In this case, the pidfd filesystem is indeed invisible; it cannot be mounted and accessed like most other filesystems. A pidfd is created with a system call like pidfd_open() or clone3(), so there is no need for a visible filesystem. (One could imagine such a filesystem as a way of showing all of the existing processes in the system, but /proc already exists for that purpose). Since there was no need to implement many of the usual filesystem operations, pidfds were implemented using anon_inode_getfile(), a helper that creates file descriptors for simple, virtual filesystems. Over time, though, this filesystem has proved to be a bit too simple, leading to Brauner's pidfdfs proposal as a replacement.

So what was the problem with the anonymous-inode approach? Brauner provides a list of capabilities added by pidfdfs in the changelog to this patch. It allows system calls like statx() to be used on a pidfd, for example, and that, in turn, allows for direct comparison of two pidfds to see whether they refer to the same process. While not implemented yet, pidfdfs will enable functionality like automatically killing a process when the last pidfd referring to it is closed. The initial version of the series also used dentry_open() to set up the "file" behind the pidfd; that brought the opening of the pidfd under the control of Linux security modules and made the user-space file-notification system calls work with them as well.

The patch series subsequently had to evolve considerably, though. Linus Torvalds was not entirely happy with how it had been implemented, even though much of that implementation was borrowed from the existing namespace filesystem in the kernel. Some significant reworking followed, resulting in a cleaner implementation that Torvalds described as "quite nice".

That was not the end of the story, though. Nathan Chancellor reported that, with pidfdfs in the kernel, many services on his system failed at boot time; Heiko Carstens ran into similar problems. It turns out that, while users may or may not appreciate the robustness of race-free process management, they are, without exception, unimpressed by a system that lacks functional networking. So Brauner had to go looking for an explanation.

It seems, though, that he already knew where to look when "something fails for completely inexplicable reasons": the SELinux security module. As noted above, one of the advantages of the new filesystem is that it exposed pidfd operations to security modules, which is something that the policy maintainers had requested. The downside is that it exposed those operations to security modules, one of which promptly set about denying them.

There was, as Brauner later described, a bit of a cascade of failures here. SELinux started seeing events on a new type of file descriptor that it had no policy for; following fairly normal security practice, it responded by denying everything, causing attempts to work with pidfds to fail. The dbus-broker process, on seeing these failures, decided to just throw up its virtual hands and let the system collapse into a smoldering heap. This is somewhat ironic given that, as Brauner pointed out, that process has a PID-using fallback path that it uses on kernels that do not support pidfds at all, but it didn't use that path here. So, to truly fix this problem, there needs to be both an SELinux policy update and a D-Bus fix; patches for both have already been prepared and submitted.

Even then, though, there was the little problem that some systems may get a new kernel before the above fixes arrive. The same users who have proved so strangely intolerant of broken networking are likely to also be slow to accept the idea that networking will only come back once their user-space code has been fixed and updated. Beyond that, Torvalds didn't like the idea that the internal filesystem change somehow caused the resulting descriptors to behave differently in user space, and requested that something better be done.

After a bit of discussion, Brauner found a solution. Rather than call dentry_open(), the new filesystem sets up the new file descriptor directly, using lower-level operations, and without invoking the problematic security hook. The people in charge of security modules still want to be able to intervene in pidfd creation, of course; that will be accommodated by adding a new security hook for that case. Once SELinux (or any other security module) is ready to make decisions about pidfds, it can use the new hook; until then, things will work as they did before. Torvalds liked this approach: "This is how new features go in: they act like the old ones, but have expanded capabilities that you can expose for people who want to use them".

With those changes, it would appear that the roadblocks to the addition of pidfdfs have been overcome. The code is in linux-next now, and will probably find its way to the mainline for the 6.9 release. Most users will, if all goes according to plan, never notice that anything has changed.

Index entries for this article
Kernel	Filesystems
Kernel	pidfd

A new filesystem for pidfds

Posted Mar 13, 2024 12:20 UTC (Wed) by jkingweb (subscriber, #113039) [Link] (1 responses)

What a journey. That was a fun read.

A new filesystem for pidfds

Posted Mar 13, 2024 17:37 UTC (Wed) by mattdm (subscriber, #18) [Link]

Agreed -- thanks, LWN!

A new filesystem for pidfds

Posted Mar 13, 2024 13:20 UTC (Wed) by bluca (subscriber, #118303) [Link] (1 responses)

And already queued for 6.10, a new way to get a reference to the new filesystem: a pidfdfsfd. Which then unlocks the possibility of querying which process holds such a ref, by getting a pidpidfdfsfd. Which then brings us...

A new filesystem for pidfds

Posted Mar 14, 2024 0:21 UTC (Thu) by am (subscriber, #69042) [Link]

systemd-pidpidfdfsfdd.service - a clever solution to a difficult problem that you never realized you've always had!

A new filesystem for pidfds

Posted Mar 13, 2024 18:05 UTC (Wed) by ianmcc (subscriber, #88379) [Link] (7 responses)

Why wasn't /proc/pid reused for this? If there was a way to access /proc without it being mounted, then that would seem to solve some other problems.

A new filesystem for pidfds

Posted Mar 13, 2024 21:46 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (6 responses)

According to pidfd_open(2), opening /proc/pid *does* give you a pidfd, but it has a few minor restrictions compared to a properly-obtained pidfd (from pidfd_open(2) or clone(2)). I am uncertain of the reason for those restrictions, but it does look like there was a conscious decision to create two different kinds of pidfd. Maybe someone who actually knows what they're talking about can shed further light on this issue.

A new filesystem for pidfds

Posted Mar 13, 2024 22:49 UTC (Wed) by fraetor (subscriber, #161147) [Link] (5 responses)

IIRC the main difference is that you can only get the pidfd from /proc after the process has been created. This leaves (at least a little) time for the desired process to exit and the pid to recycled for another process.

This race condition is one of the main things pidfds exist to fix, so it's better to get the fd immediately at process creation.

A new filesystem for pidfds

Posted Mar 14, 2024 1:46 UTC (Thu) by magfr (subscriber, #16052) [Link]

But why is that a hindrance for clone to return an fd in procfd? Sure, there might not be any link to it left under /proc but that is no problem, is it?

A new filesystem for pidfds

Posted Mar 14, 2024 4:29 UTC (Thu) by ibukanov (subscriber, #3942) [Link] (3 responses)

A pid can only be recycled after the parent process called waitpid or related methods. Thus in a properly written application there should be no race when using the pid.

However that does complicate the code especially when one writes a library that wants to start external process as arranging for waitpid call becomes problematic.

A new filesystem for pidfds

Posted Mar 14, 2024 8:14 UTC (Thu) by roc (subscriber, #30627) [Link] (1 responses)

It's extra-annoying because waitpid() only has a limited set of options for what you can wait for. You can't choose an arbitrary set of processes to wait on.

It's also a problem that only the parent can do the wait. You can't pass ownership of the subprocess to another process this way.

A new filesystem for pidfds

Posted Mar 19, 2024 7:04 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> It's also a problem that only the parent can do the wait. You can't pass ownership of the subprocess to another process this way.

Technically, you can sorta kinda hand the ownership to another process by passing PR_SET_CHILD_SUBREAPER to prctl(2), but there are so many caveats with this method that it's not even funny:

* The destination process must be an ancestor of you.
* You have to orphan the child process, which means you have to do a double-fork.
* It is a global (process-wide) flag on the destination process, which makes it assume ownership of all orphaned processes under it, not just your particular process. If the destination does not periodically call wait, it will leak zombies until it dies.
* It is really meant to be used by things like systemd. If you are not implementing something that resembles systemd, there are probably other pitfalls I'm unaware of.

In short: This may be reasonable if you are trying to make an entirely self-contained self-managing all-singing all-dancing daemon that does all of its own bookkeeping, sessions, etc., but in practice you're probably better off configuring systemd to do those things for you instead, unless you're one of those sysvinit-or-death people.

A new filesystem for pidfds

Posted Mar 14, 2024 9:52 UTC (Thu) by bluca (subscriber, #118303) [Link]

That only helps if it's the parent process that needs to reliably identify another process. That is not the case, and hasn't been for years.

A new filesystem for pidfds

Posted Mar 14, 2024 2:44 UTC (Thu) by pcmoore (subscriber, #37989) [Link]

It's frustrating that all of these decisions around LSM hooks and the SELinux implementation were done without CC'ing the LSM or SELinux mailing lists. We need to improve collaboration across Linux kernel subsystems, and adding the appropriate CCs to a discussion - even part way through the discussion - should be standard practice.