A new filesystem for pidfds
In this case, the pidfd filesystem is indeed invisible; it cannot be mounted and accessed like most other filesystems. A pidfd is created with a system call like pidfd_open() or clone3(), so there is no need for a visible filesystem. (One could imagine such a filesystem as a way of showing all of the existing processes in the system, but /proc already exists for that purpose). Since there was no need to implement many of the usual filesystem operations, pidfds were implemented using anon_inode_getfile(), a helper that creates file descriptors for simple, virtual filesystems. Over time, though, this filesystem has proved to be a bit too simple, leading to Brauner's pidfdfs proposal as a replacement.
So what was the problem with the anonymous-inode approach? Brauner provides a list of capabilities added by pidfdfs in the changelog to this patch. It allows system calls like statx() to be used on a pidfd, for example, and that, in turn, allows for direct comparison of two pidfds to see whether they refer to the same process. While not implemented yet, pidfdfs will enable functionality like automatically killing a process when the last pidfd referring to it is closed. The initial version of the series also used dentry_open() to set up the "file" behind the pidfd; that brought the opening of the pidfd under the control of Linux security modules and made the user-space file-notification system calls work with them as well.
The patch series subsequently had to evolve considerably, though. Linus
Torvalds was
not entirely happy with how it had been implemented, even though much
of that implementation was borrowed from the existing namespace filesystem
in the kernel. Some significant reworking followed, resulting in a cleaner
implementation that Torvalds described
as "quite nice
".
That was not the end of the story, though. Nathan Chancellor reported that, with pidfdfs in the kernel, many services on his system failed at boot time; Heiko Carstens ran into similar problems. It turns out that, while users may or may not appreciate the robustness of race-free process management, they are, without exception, unimpressed by a system that lacks functional networking. So Brauner had to go looking for an explanation.
It seems, though, that he already
knew where to look when "something fails for completely inexplicable
reasons
": the SELinux security module. As noted above, one of the
advantages of the new filesystem is that it exposed pidfd operations to
security modules, which is something that the policy maintainers had
requested. The downside is that it exposed those operations to security
modules, one of which promptly set about denying them.
There was, as Brauner later described, a bit of a cascade of failures here. SELinux started seeing events on a new type of file descriptor that it had no policy for; following fairly normal security practice, it responded by denying everything, causing attempts to work with pidfds to fail. The dbus-broker process, on seeing these failures, decided to just throw up its virtual hands and let the system collapse into a smoldering heap. This is somewhat ironic given that, as Brauner pointed out, that process has a PID-using fallback path that it uses on kernels that do not support pidfds at all, but it didn't use that path here. So, to truly fix this problem, there needs to be both an SELinux policy update and a D-Bus fix; patches for both have already been prepared and submitted.
Even then, though, there was the little problem that some systems may get a new kernel before the above fixes arrive. The same users who have proved so strangely intolerant of broken networking are likely to also be slow to accept the idea that networking will only come back once their user-space code has been fixed and updated. Beyond that, Torvalds didn't like the idea that the internal filesystem change somehow caused the resulting descriptors to behave differently in user space, and requested that something better be done.
After a bit of discussion, Brauner found
a solution. Rather than call dentry_open(), the new
filesystem sets up the new file descriptor directly, using lower-level
operations, and without invoking the problematic security hook. The people
in charge of security modules still want to be able to intervene in pidfd
creation, of course; that will be accommodated by adding a new security
hook for that case. Once SELinux (or any other security module) is ready
to make decisions about pidfds, it can use the new hook; until then, things
will work as they did before. Torvalds liked
this approach: "This is how new features go in: they act like the
old ones, but have expanded capabilities that you can expose for people who
want to use them
".
With those changes, it would appear that the roadblocks to the addition of
pidfdfs have been overcome. The code is in linux-next now, and will
probably find its way to the mainline for the 6.9 release. Most users
will, if all goes according to plan, never notice that anything has
changed.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems |
| Kernel | pidfd |
Posted Mar 13, 2024 12:20 UTC (Wed)
by jkingweb (subscriber, #113039)
[Link] (1 responses)
Posted Mar 13, 2024 17:37 UTC (Wed)
by mattdm (subscriber, #18)
[Link]
Posted Mar 13, 2024 13:20 UTC (Wed)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Mar 14, 2024 0:21 UTC (Thu)
by am (subscriber, #69042)
[Link]
Posted Mar 13, 2024 18:05 UTC (Wed)
by ianmcc (subscriber, #88379)
[Link] (7 responses)
Posted Mar 13, 2024 21:46 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Posted Mar 13, 2024 22:49 UTC (Wed)
by fraetor (subscriber, #161147)
[Link] (5 responses)
This race condition is one of the main things pidfds exist to fix, so it's better to get the fd immediately at process creation.
Posted Mar 14, 2024 1:46 UTC (Thu)
by magfr (subscriber, #16052)
[Link]
Posted Mar 14, 2024 4:29 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link] (3 responses)
However that does complicate the code especially when one writes a library that wants to start external process as arranging for waitpid call becomes problematic.
Posted Mar 14, 2024 8:14 UTC (Thu)
by roc (subscriber, #30627)
[Link] (1 responses)
It's also a problem that only the parent can do the wait. You can't pass ownership of the subprocess to another process this way.
Posted Mar 19, 2024 7:04 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Technically, you can sorta kinda hand the ownership to another process by passing PR_SET_CHILD_SUBREAPER to prctl(2), but there are so many caveats with this method that it's not even funny:
* The destination process must be an ancestor of you.
In short: This may be reasonable if you are trying to make an entirely self-contained self-managing all-singing all-dancing daemon that does all of its own bookkeeping, sessions, etc., but in practice you're probably better off configuring systemd to do those things for you instead, unless you're one of those sysvinit-or-death people.
Posted Mar 14, 2024 9:52 UTC (Thu)
by bluca (subscriber, #118303)
[Link]
Posted Mar 14, 2024 2:44 UTC (Thu)
by pcmoore (subscriber, #37989)
[Link]
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
A new filesystem for pidfds
* You have to orphan the child process, which means you have to do a double-fork.
* It is a global (process-wide) flag on the destination process, which makes it assume ownership of all orphaned processes under it, not just your particular process. If the destination does not periodically call wait, it will leak zombies until it dies.
* It is really meant to be used by things like systemd. If you are not implementing something that resembles systemd, there are probably other pitfalls I'm unaware of.
A new filesystem for pidfds
A new filesystem for pidfds
