Toward race-free process signaling

December 6, 2018

This article was contributed by Marta Rybczyńska

Signals have existed in Unix systems for years, despite the general consensus that they are an example of a bad design. Extensions and new ways of using signals pop up from time to time, fixing the issues that have been found. A notable addition was the introduction of signalfd() nearly 10 years ago. Recently, the kernel developers have discussed how to avoid race conditions related to process-ID (PID) recycling, which occurs when a process terminates and another one is assigned the same PID. A process that fails to notice that its target has exited may try to send a signal to the wrong recipient, with potentially grave consequences. A patch set from Christian Brauner is trying to solve the issue by adding signaling via file descriptors.

PIDs increase for each new process up to the maximum value, and then go back to the beginning. For the maximum value, most distributions use the conservative value of 32768 to avoid breaking legacy systems. However, users can consult and change the maximum value in /proc/sys/kernel/pid_max. Signal-related APIs identify processes by PID. The disadvantage of this method is that, in the lifetime of a system, the same PID is reused as processes are created and terminated. What happens if a process has finished and another one has taken its PID? The PID value stays valid. Other processes, unaware of the situation, may try to send signals to the wrong process. This may have consequences as serious as terminating the wrong service. This race condition requires the PID space to wrap between the creation of the two processes, which is not uncommon.

`/proc/pid/kill` proposal

The recent discussion started when Daniel Colascione proposed adding a file called kill to each process's /proc directory. Writing the numerical value of the desired signal to that file would send that signal to the selected process. The race was solved — or attempted to be solved — by holding the /proc directory open, thus preventing the PID from being reused at the wrong time. The discussion showed that other developers were considering the same problem in parallel.

While the problem was well understood, a debate started about the implementation. One part of the discussion concerned whether opening the /proc directory is enough to prevent PID reuse, in which case the patch would not have not been necessary in the first place. A small test case was developed and showed that the answer is no. A modification in the kernel, like the proposed patch, is needed to solve the issue.

A heated debate followed on a proper API to deliver the signal: should it be a write to the file or another system call? Jann Horn suggested an ioctl(), followed by Tycho Andersen, who agreed that this would simplify the permission checks. Colascione replied by supporting the choice of using write() and stated that it is unsafe to call ioctl() on an unknown descriptor. In case of a mistake, we do not know what the effect of a random operation will be (this is, however, similar when writing to an unknown file). His other option was adding a new system call.

Another part of the debate was started by Aleksa Sarai who added namespaces to the mix: there is a risk of processes sending signals between PID namespaces in situations when they normally do not have that ability. He suggested that only processes from the same PID namespace should be able to send such signals.

In the same thread, Brauner mentioned that he is working on a similar solution and proposed postponing the patch review to after the discussion at Linux Plumbers Conference that was then two weeks away. Colascione and other developers were interested to know more. Colascione also added some context to the discussion by noting that there had been previous, but unsuccessful, attempts to solve this problem. There are two options to fix the issue and keep the interface race-free: either keep a PID reserved when the handle is open, or keep a reference to the struct pid instead of the PID value.

Signaling by `/proc/pid`

After LPC, Brauner submitted a new patch set. It proposes to solve the signal delivery issue by using file descriptors to identify processes; these descriptors would be obtained by opening a process's /proc directory. The solution Brauner proposed is to store a handle to the process's struct pid in the inode associated with that descriptor; this gives a stable handle that does not have the disadvantages of the simple PID number. It turned out that the patch could be simplified; Eric W. Biederman explained that the handle is already present in the inode reference and proc_pid() is enough to get the handle from a file descriptor.

While the first part of the patch deals with getting the handle, the second part of the patch set implements sending the signal itself; it is done using a new system call named procfd_signal(). This system call operates on a file descriptor of a process; the previous discussions convinced Brauner that this is a solution preferred over an ioctl(). The new system call has the following prototype:

    long procfd_signal(int fd, int sig, siginfo_t *info, int flags);

It sends the signal sig to the process identified by the file descriptor fd. The optional info argument is a pointer to siginfo_t provided by the caller (used when sending realtime signals), and flags is reserved for future use and should be zero. On success, the system call returns zero; in case of an error it returns -1 and errno is set to the detailed error code: EBADF if the given file descriptor is not valid, EINVAL if the signal value is invalid or the file descriptor does not refer to a process, EPERM if the caller does not have sufficient permissions to send a signal to the target, and ESRCH if the target process does not exist.

The submission caused discussions of both the implementation and the use of signaling via file descriptors. While there has been no direct opposition, the developers noted a number of issues that should be taken into account. Sarai started a discussion about sending signals to other namespaces. As a result, a check has been added so that sending signals is possible only to processes in child PID namespaces. This avoids problems when file descriptors leak between namespaces, for example when the root file system is bind-mounted into a container. Adding the possibility to send signals to ancestors can be always added in the future.

The debate on which system call to use restarted with Andersen again preferring an ioctl() interface. Colascione and Brauner argued instead for a new system call. This kind of debate happens quite often in the kernel community. Some developers prefer adding a new ioctl(), because they think that adding a system call is too complex. On the other hand, ioctl() is considered a worse API. Andy Lutomirski added a twist to the debate and proposed a better version of the ioctl() system call. The discussion finished without a clear conclusion.

An example of how to use the mechanism has been posted in the cover letter. It is simple: the programs opens the right /proc/pid directory and then sends the signal with all parameters.

The second version of the submission included both 32 and 64-bit versions of the system call with two different entry points. Lutomirski objected, explaining that this design should be avoided for new system calls. The two versions are necessary due to the differing definition of struct siginfo_t. An easy way to avoid creating multiple entry points was eventually found, and the patch set was reposted as taskfd_send_signal() with the same argument types.

The submission tries to fix a problem that has been experienced by multiple people and the developers seem motivated to have the work done. The solution goes in the direction of following the long-established convention of using file descriptors. There is no conclusion yet if this approach will be accepted — probably more iterations will still be needed. However, it seems likely that we will get an improvement in the robustness of signal usage in the not-that-far future.

Index entries for this article
Kernel	pidfd
Kernel	System calls/pidfd_send_signal()
GuestArticles	Rybczynska, Marta

Toward race-free process signaling

Posted Dec 6, 2018 19:26 UTC (Thu) by daniele.nicolodi (subscriber, #94121) [Link] (23 responses)

I may be missing something but I don't see how this solves all race conditions. Unless a mechanism is introduced to fork a process and return a file handle pointing at /proc/pid there still is a race between the time fork() returns and /proc/pid is opened.

Toward race-free process signaling

Posted Dec 6, 2018 19:58 UTC (Thu) by jspenguin (guest, #120333) [Link] (22 responses)

There has never been an issue with a parent sending a signal to a direct child process. PIDs are never recycled until the parent process calls wait() or exits. This mechanism would only be necessary for processes sending signals to other processes that are not direct child processes.

Toward race-free process signaling

Posted Dec 6, 2018 20:07 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (21 responses)

There's a race here nevertheless: The pid of a process one wants to send a signal to can only be known at some time x + n, x being the time when fork returned and n the delay until the pid was communicated to the process desiring to use it. There's no way this process can determine if the pid is still used by the other process it wanted to send a signal to by the time it tries to use it, be it roundabout via proc or directly via kill.

Toward race-free process signaling

Posted Dec 6, 2018 20:11 UTC (Thu) by nopsled (guest, #129072) [Link] (7 responses)

I think there are plans to introduce a new variant of clone that has CLONE_NEW or so as an additional flag that returns a process descriptor (since clone has already run out of flags).

Toward race-free process signaling

Posted Dec 6, 2018 20:28 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (6 responses)

The next clone, instead of just having a bigger flag parameter, really should, instead, look something like this:

#define CLONE2_SHARE_VM (1<<0) /* VALUE is a bool: default to true */
#define CLONE2_OUTPUT_FD /* VALUE is an int*: on success, clone2 writes a procfd FD */
...

struct clone2_parameter {
void* value; /* In, in-out, or out depending on NAME */
int name; /* CLONE2_XXX; after the pointer for 32-bit ABI compat crap */
};

/* Make a new task (process or thread). The new process's properties are mumble.
clone2_parameters override these defaults. If multiple clone2 parameter structures
have the same NAME, the last one wins. Return child PID on success; on failure, -1 with errno set. */
int clone2(const struct clone2_parameter* parameters, int nparameters);

This way, we can create an interface as rich as we want, in an extensible way, without ever again having to make a new god damn process creation system call. It'd also let us implement a posix_spawn directly, without vfork.

Toward race-free process signaling

Posted Dec 8, 2018 5:51 UTC (Sat) by dw (subscriber, #12017) [Link] (5 responses)

I've been working on a deep dive into the history of process management on UNIX, and noticed all these kernel implementations of posix_spawn() popping up (NetBSD, OS X, Solaris). While their rationale is solid (avoiding huge pointless serializing VM copies), directly implementing posix_spawn() looks like a potential mistake, due to the perpetual incompleteness of that interface.

This led to thoughts about a compromise, and a potentially reasonable answer: something like a combined fork+exec, where the VM is not preserved and inherited FDs are explicitly specified. The target of the exec is a helper binary to implement the desired post-fork behavior. Configuration of the helper would be done e.g. over a pipe or a memfd.

If a new clone() variant is under discussion as described here, another option might be to address the VM problem by allowing the variant to specify memory regions preserved in the child. posix_spawn() only requires access to existing file descriptors and say, one page describing the new child configuration.

The point of both options is to keep that huge inflexible interface in userspace, and some system call that still provides all the performance benefit, for applications where posix_spawn() doesn't go far enough (e.g. changing security credentials or similar)

Toward race-free process signaling

Posted Dec 8, 2018 5:55 UTC (Sat) by dw (subscriber, #12017) [Link]

Forgot to mention, the biggest motivation for a userspace spawn may be the sheer size of an equivalent kernel implementation ;) https://github.com/SilkieDragon/xnu/blob/0a798f6738bc1db0...

Toward race-free process signaling

Posted Dec 8, 2018 6:00 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

I thought that one of the better ideas is to make all the relevant process-management functionality to accept an explicit process file descriptors. And then the final piece - ability to create a process in a "suspended" state.

Intermediary executables are just a crutch.

Toward race-free process signaling

Posted Dec 8, 2018 6:15 UTC (Sat) by dw (subscriber, #12017) [Link] (2 responses)

The trouble is that there is an almost unlimited set of fork() use cases that can't be addressed by posix_spawn(), and no sensible finite set of APIs can ever fully describe what was already possible post-fork. As a contrived example, creating a connected UNIX client socket attached to the new process stdio, where peercred accurately reflects the connecting process is the new process.

Toward race-free process signaling

Posted Dec 8, 2018 9:03 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> As a contrived example, creating a connected UNIX client socket attached to the new process stdio, where peercred accurately reflects the connecting process is the new process.
You still will be able to dup() sockets into the new process. I'm not sure how you would forge the peercred, though.

Toward race-free process signaling

Posted Dec 8, 2018 15:38 UTC (Sat) by dw (subscriber, #12017) [Link]

The question is not whether one contrived example has an incomplete solution in the proposed scheme, but whether even a reasonable subset of use cases could be catered for even when approaching something like 100% churn in the interface, implementation, locking requirements and security model for almost all existing system calls, without forcing users to resort to e.g. vfork or debugging APIs (which they may lack privilege to use) to fill the remaining gaps.

We don't need to answer it for ourselves, Windows had process handles since prehistory, where you find endless hacks like this to cope with their process creation API remaining incomplete after almost 30 years.

Toward race-free process signaling

Posted Dec 6, 2018 20:13 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (3 responses)

And that's why we'll eventually need a variant of clone(2) that returns a file descriptor. That's out of scope for the present patch, though.

Toward race-free process signaling

Posted Dec 6, 2018 20:47 UTC (Thu) by josh (subscriber, #17465) [Link] (2 responses)

I wrote CLONE_FD a while ago; if anyone wants to pick it up and run with it, the only open remaining from the discussion at the time was that people wanted a more detailed ptrace test suite to make sure nothing in the intricate interaction between ptrace, reparenting, and child processes broke.

Toward race-free process signaling

Posted Dec 9, 2018 1:01 UTC (Sun) by brauner (subscriber, #109349) [Link] (1 responses)

Fwiw, if this patch lands I intend to pick yours up and would very much like your input.

Toward race-free process signaling

Posted Dec 9, 2018 12:50 UTC (Sun) by josh (subscriber, #17465) [Link]

Great! Happy to help.

Toward race-free process signaling

Posted Dec 6, 2018 20:15 UTC (Thu) by jspenguin (guest, #120333) [Link] (8 responses)

But with fork, there's no race condition, because the process doing the signaling is the direct parent. Even if the forked child exits immediately, before fork even returns in the parent, the process will remain in the table as a zombie. As long as the parent process signals the child before calling wait, it is impossible for that pid to refer to any other process. Signaling a zombie has no effect, even SIGKILL.

Toward race-free process signaling

Posted Dec 6, 2018 20:42 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (6 responses)

There's still a race, even with direct children, when multiple components in a process don't coordinate with each other and each want to run child processes for their own purpose. For example, the Java VM basically has a constant waitpid(-1) running; if some shared library in that process makes a subprocess, and that subprocess dies, then Java's waitpid might reap it before the shared library does, making the shared library's process manipulation malfunction in various exciting ways.

One of my long-term goals is to make it possible for unrelated components that happen to share a process to easily manage their own sets of children.

Toward race-free process signaling

Posted Dec 6, 2018 21:01 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

In principle, this can be solved with an intermediate process acting as relay for "a component" desiring to run controlled processes in face of a 'hostile' VM interfering with that: It would create and destroy more processes on behalf on "the component" and would escape the VM-autoreaper by not terminating.

Toward race-free process signaling

Posted Dec 7, 2018 20:55 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

OTOH, this ("unrelated components sharing a process") is pretty much a travesty: As these components share an address space, they are related, regardless if they want this or not, and have to cooperate with each other. If they're really unrelated, they ought to run in different processes.

There's little point in coding against basic system design choices like this one (just because other systems are or were based on different design choices).

Toward race-free process signaling

Posted Dec 8, 2018 21:46 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (2 responses)

It's perfectly reasonable for a process to host components that are developed separately and that aren't aware of each other. That ship sailed a long time ago.

Toward race-free process signaling

Posted Dec 8, 2018 22:42 UTC (Sat) by smurf (subscriber, #17840) [Link]

Today, the best way to wait on N child processes in that kind of environment is to spawn N threads, each of which calls waitpid(). At least with a clone2-returning-a-process-FD you can reduce that to a single, race-free poll().

Toward race-free process signaling

Posted Dec 9, 2018 13:20 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

"It's commonly done" is not the same as "it's perfectly reasonable". That's what I already wrote: The "hosted components" are not separate and have to be aware of each other if they're running in a shared address space/ single process. In the given context, a process can manage its children which includes reliably signalling them but multiple "components" sharing a process can't separate "their children" from each other. If this has to be done, a number of (possibly cooperating) processes have to be used. A process is the fundamental system abstraction for separating independently executing entities from each other. Hence, if such separation is desired, it's sensible to use processes for that.

One could argue backwards and forwards here whether or not this design choice is or was sensible but "that ship sailed a long time ago" and it's better to put the available facilities to (some) good use than to keep fighting them. Instead of bemoaning fork (an undercurrent in this conversation), one can try to find uses for it (and there are plenty -- I use fork much more often than exec).

Eg, some program I have been writing in the past (idea more recently reused in a different context) had to do "HTTP requests" to interact with some set of servers. In the 'modern' internet, where TCP connections can suddenly turn into black holes for all kinds of weird reasons (aka random "session timeouts" enforced by firewalls), this obviously needs some sort of timeout. An extremely simple way to implement this reliably is to fork a process tasked with doing the HTTP interaction synchronously (easily possible because the forked process runs a copy of the forking program) and let the parent enforce the timeout. Or use alarm to arrange for a SIGALRM to be delived to the forked process after some time. The parent can then just wait until the child terminated, determine its fate via exit status and possibly retry the request.

Toward race-free process signaling

Posted Dec 7, 2018 14:27 UTC (Fri) by kjp (guest, #39639) [Link]

> One of my long-term goals is to make it possible for unrelated components that happen to share a process to easily manage their own sets of children.

Something like the windows Job API? https://docs.microsoft.com/en-us/windows/desktop/procthre...

Toward race-free process signaling

Posted Dec 6, 2018 20:45 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

That's obviously correct but there's no need for a new mechanism to send signals in this case, either. The parent can just use kill.

Toward race-free process signaling

Posted Dec 6, 2018 19:49 UTC (Thu) by nopsled (guest, #129072) [Link] (7 responses)

Hm, Daniel Colascione's approach as I understand it was to preserve the PID and *prevent* reuse, but anyway, the new approach seems fine to me. Except one thing.

Why does taskfd_send_signal() have to do permission checks like kill does? Isn't that very much NOT how Unix works, that is, shouldn't they be done once at open() time and not when sending signals? I can, for instance, drop privileges after getting a process handle and still send signals as if I had the same privileges back then, or pass it to some other process (think of scenarios where service managers delegate killing to a different supervisor for some unit, which has its own specialized needs).

The patchset even describes it being in the same vein as the write() system call, in a loose sense:

> Signaling processes through taskfds is the equivalent of writing to a file.

I can already see why, privilege elevation (and the curse i.e. setuid), but, it is then appropriate to invalidate the file descriptor completely for anyone but root. That should not stop userspace from having coherent and proper semantics, that do not drift away from the general pattern followed elsewhere in the kernel.

FreeBSD already committed this same mistake in their pdkill interface. I hope it is not repeated here.

Toward race-free process signaling

Posted Dec 6, 2018 20:19 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (5 responses)

See https://lore.kernel.org/lkml/CAKOZuetfqvn1uVqKJ=16iEzG4g4...

The short version: ordinarily, I'd agree with you, but processes are weird. They can change their credentials (and thus signal-ability) over their lifetimes without changing their identity. IMHO, this is a fundamental misdesign in the Unix process model, but we're stuck with it now. Consequently, we need to separate objects that refer to process *identity* from objects that give object-holders the capability to act on processes, and invalidate the latter on process credential change.

In the message above, I described how one would go about implementing this sort of interface. It would *work*, but from a pragmatic perspective, I don't see a need for this open-based capability check infrastructure, so I favor a simple direct system call instead. This capability FD approach could be implemented in the future as an extension to the current API. We're having a hard enough time getting this relatively small patch into the tree. I don't want to complicate it further.

Toward race-free process signaling

Posted Dec 6, 2018 20:48 UTC (Thu) by nopsled (guest, #129072) [Link] (4 responses)

> Consequently, we need to separate objects that refer to process *identity* from objects that give object-holders the capability to act on processes, and invalidate the latter on process credential change.

I do agree, but it appears to me that the wrong step has been taken already, and the case of extending this particular process descriptor API to a capability based model is already very weak. It would be better if this handle was not obtained through /proc/PID (because a process handle now means many things as far as capabilities one can exercise over it, and now needs another layer on top through flags in yet another system call(s) to further granularise access to what it offers), but something like /proc/PID/kill, so it only represents the capability to be able to send signals.

My 2 cents: How this API and anything else on top shapes (there is more stuff planned for inclusion, from what I gather) will very much depend on what the notion of this process handle represents, and if nothing like what we're talking of is in the minds of people pushing for this change, I find it difficult to understand if it will ever happen. If anything, system calls need very good reasons to get included (we already have too many), and if the current API proves to be good enough for the immediate use cases people have, it would never happen, really.

It is also worthwhile to note where the inspiration for process descriptors (FreeBSD's pdfork and friends), the original CLONE_FD patchset (porting it from David Drysdale's Capsicum for Linux patchset), and others, actually comes from (Capsicum), and why I am emphasizing to treat them as capabilities (and why I called FreeBSD's pdkill a mistake).

Toward race-free process signaling

Posted Dec 6, 2018 20:58 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Perfect is the enemy of the good. I really don't want a good and useful feature to get bogged down in unrelated discussions of capsicum and various BSD APIs. What you're proposing can be added in the future and has no immediate use case, so adding it at this point would just endanger the whole feature.

Toward race-free process signaling

Posted Dec 6, 2018 22:05 UTC (Thu) by luto (guest, #39314) [Link] (2 responses)

I suggested roughly what you’re suggesting, and it got shot down. The problem is that a capability to manipulate a process that survives execve is problematic without NO_NEW_PRIVS.

Toward race-free process signaling

Posted Dec 6, 2018 22:37 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

Can you elaborate? It's not immediately clear to me how NO_NEW_PRIVS matters.

Toward race-free process signaling

Posted Dec 15, 2018 20:09 UTC (Sat) by droundy (subscriber, #4559) [Link]

I presume the idea is that I fork and get a kill fd. Then I execve a suid executable and now I can send signals to a root process, unless something were done to prevent the suid bit from having effect, or that would invalidate my kill fd.

Re: FreeBSD already committed this same mistake in their pdkill interface

Posted Dec 6, 2018 21:01 UTC (Thu) by kmeyer (subscriber, #50720) [Link]

Can you elaborate on what you think the mistake is?

Toward race-free process signaling

Posted Dec 6, 2018 21:29 UTC (Thu) by kjp (guest, #39639) [Link] (40 responses)

So pardon my ignorant question but...

Why can't an extra UUID be added to all processes, and a new kill_with_uuid(PID, UUID) syscall handle disambiguation?

Toward race-free process signaling

Posted Dec 6, 2018 21:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

UUIDs are ugly

Toward race-free process signaling

Posted Dec 6, 2018 21:57 UTC (Thu) by kjp (guest, #39639) [Link] (30 responses)

After reading the article more, the questions are:

1) Should the process uniqueness token be "internal" (a file descriptor that must never be closed/lost) or "external" (some opaque string/uuid that can be used by any program, like a pid is)
2) Should the kernel or userspace generate it, and how/when.

Typically if I start a transaction I care about, I assign my own client-generated uniqueness/idemptotency token and send that to the server. So that would loosely translate into

child_token = generate_my_uniquess_token_string()
child_pid = fork_with_token(..., child_token)
...
kill(child_pid, child_token)

And there are no races. Assume that any UID is responsible for not colliding its own child_token strings with itself.

Toward race-free process signaling

Posted Dec 6, 2018 22:07 UTC (Thu) by kjp (guest, #39639) [Link] (28 responses)

As a further thought experiment - this reminds me of when AWS finally added the ability to set resource tags when you create EC2 instances. Previously, there was a race - you'd create instances first, get their ids, then tag them. Bad. Now you can just pass the tag(s) you want into the RunInstances call.

So is that the goal? Letting a unix user tag their child processes safely, so they can identify them later? Seems very useful, and should be orthogonal to "security" issues.
And to be clear, you can first generate that tag and write it to disk, and then create the child process. So even if you crash, your child can be cleaned up safely by some other monitoring process.

Toward race-free process signaling

Posted Dec 6, 2018 23:52 UTC (Thu) by dw (subscriber, #12017) [Link] (27 responses)

After reading your comment it's hard to take the other proposals seriously, they're ridiculously over-engineered. A kill2() that accepted an opaque, randomly generated per-process cookie stashed in the task structure that could be extracted somehow would be vastly simpler for implement and for users to understand.

Of course, having a 'process file descriptor' is overall much more generic, and perhaps those designs have aspirations for extending the functionality later, but this does not seem worth probably yet another 8kb of .text, possibly only in the name of being 'UNIXey'

Toward race-free process signaling

Posted Dec 7, 2018 7:42 UTC (Fri) by epa (subscriber, #39769) [Link] (18 responses)

So you are saying that we supplement the current process id (limited to 32767) with a 64-bit or 128-bit value that is unique for the lifetime of the system (until a reboot)?

Then all the existing system calls taking a process id get a ‘2’ version taking the longpid instead. All other semantics stay the same.

That does seem a much better way to address the issue (and perhaps others besides, eg pid namespaces for containers would no longer be necessary once user space migrates to the new API).

Toward race-free process signaling

Posted Dec 7, 2018 8:20 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

This won't work. Or more precisely, it'll have all the same drawbacks.

Consider the current use-case:
- list processes
- get process pid
- kill process by pid

The new use-case will be:
- list processes
- get process pid
- get long pid by pid
- kill process by long pid

The race condition is still there. You'll need to fix all the APIs to use long pids in the first place.

Toward race-free process signaling

Posted Dec 7, 2018 9:02 UTC (Fri) by epa (subscriber, #39769) [Link] (16 responses)

Indeed, every system call will need a version that returns a long pid. So the new fork() will return the long pid directly, and so on. There is no need for a separate and race-prone lookup from short pid to long pid (which is not a 1-1 mapping anyway).

Toward race-free process signaling

Posted Dec 7, 2018 9:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (15 responses)

But at this point it makes sense to just use file descriptors instead of long pids. File descriptors are way better for many reasons - they can be securely sent over Unix sockets, they can be inherited by subprocesses and so on.

Toward race-free process signaling

Posted Dec 7, 2018 11:22 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

I guess the only thing you can't do with a file descriptor is type it in on the command line. So shell scripts, etc, would still be prone to race conditions.

(A numeric 64-bit pid can be sent over sockets and told to a subprocess, of course.)

Toward race-free process signaling

Posted Dec 7, 2018 14:01 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (3 responses)

A 64bit pid is long enough you can't reliably type it on the command line, even 32bits are a problem.

This is part of the reason why pids which have a 32bit type are limited to 16bits by default.

Toward race-free process signaling

Posted Dec 7, 2018 17:18 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

These days, that's the only reason to do this. People not running 10-year-old code are unlikely to be affected by >16-bit PIDs.

I habitually set maxpid to 99999. Anything unlikely to run >1000 processes, like Raspberry Pis, get 9999.

Toward race-free process signaling

Posted Dec 7, 2018 17:53 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

I think the bigger maxpid, the better – safer. Short pids encourage manual typing, which is error-prone. Big pids kinda forces copy-pasting, which is safer (modulo pid reuse).

Toward race-free process signaling

Posted Dec 7, 2018 19:46 UTC (Fri) by epa (subscriber, #39769) [Link]

Sorry for being unclear. I didn’t mean literally typing in the number (I would cut and paste anyway). I was illustrating the general point that a process id is just a number, with no special magic, and can be handled by any programming language including shell scripts. It can be saved to a file, passed on the command line, even sent over TCP/IP if necessary.

Existing code which works with 15-bit process ids could normally work on 64-bit ones with no change, or at most a change of type from int to long in strongly typed languages. File descriptors are great, but they form their own closed world and need a new set of APIs. They cannot just be treated as an opaque number or a string of text.

Toward race-free process signaling

Posted May 6, 2019 3:02 UTC (Mon) by cyphar (subscriber, #110703) [Link] (1 responses)

In many cases, /proc/self/fd/... is a neat way to "type an fd on the command-line".

Toward race-free process signaling

Posted May 6, 2019 12:22 UTC (Mon) by smurf (subscriber, #17840) [Link]

Your favorite shell's autocomplete mechanism should be able to understand PIDs too.

It's still somewhat dangerous to actually use that, though. The probability that mistyping the first four digits and pressing TAB gives you an entirely unrelated process shouldn't be underestimated.

Toward race-free process signaling

Posted Dec 7, 2018 16:03 UTC (Fri) by dw (subscriber, #12017) [Link] (7 responses)

My understanding is that this is an attempt to fix an edge case in code that does not keep track of its own children correctly. The problem is one of:

1) Child exits, crap parent kills unrelated process because it wasn't paying attention
2) /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.

No solution presented thus far actually solves case 1), the old API will continue to exist in perpetuity, and any new API will always only see limited uptake, due to portability or simple lack of effort to port everything over. There is a limit to the value in any solution, because it is unlikely to see revolutionary uptake. A simple solution therefore seems preferable.

The file descriptor solution does not meaningfully solve case 2), there is still a race for the init script to open /proc/blah/pid and somehow introspect the descriptor it received matches the daemon it is trying to kill, so some "is this really the process I want?" code is still necessary.

The FD solution creates a world of security pain that doesn't match the typical UNIX files model, because the kernel object in question can change its security identity over time.

The cookie-based solution does not entail updating every single API, the original problem is only about signal delivery, and thus only effects kill() and possibly clone().

A cookie-based solution allows the identifier persist on disk easily. Consider two new system calls:

- pid_to_handle(pid_t pid, struct pid_handle *handle) -- accepts pid==0 or pid==child pid. In the 0 case, PID of current process returned. In remaining case, return -1 if PID is not a child of the current process.

- kill_by_handle(pid_t, struct pid_handle *); -- works identically to kill(), except handle must match. No other restriction placed on caller.

After calling clone(), pid_to_handle() is used by the parent prior to waitpid() to retrieve the handle. For daemonizing processes, it must be the child invoking it on itself as any handle the parent could receive would be for the intermediary daemonizing process that almost immediately died.

Toward race-free process signaling

Posted Dec 7, 2018 17:30 UTC (Fri) by smurf (subscriber, #17840) [Link] (6 responses)

> A cookie-based solution allows the identifier persist on disk easily.

A pid-plus-verifiable-identifier approach solves this problem just as well.

> /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.

This is why sane init systems tend to not use PID files.

> After calling clone(), pid_to_handle() is used by the parent prior to waitpid() to retrieve the handle.

This entails a race. You should not be required to assume that your thread is the only one calling waitpid(-1). clone2() needs to return the handle atomically.

Toward race-free process signaling

Posted Dec 7, 2018 17:37 UTC (Fri) by dw (subscriber, #12017) [Link]

> This entails a race. You should not be required to assume that your thread is the only one calling waitpid(-1). clone2() needs to return the handle atomically.

That's a fair point, but multi-threaded software with competing threads calling waitpid(-1) are no less buggy IMHO than those with competing threads say, closing random file descriptors, or creating new ones without CLOEXEC -- the problem is simply moved. It's just one of many single-thread-centric interfaces an MT app must give up. And particularly, it is a class of problem that is not fixed by modifying clone() -- an MT app exhibiting this behaviour has bigger problems than race-free child signalling

Toward race-free process signaling

Posted Dec 8, 2018 2:12 UTC (Sat) by wahern (subscriber, #37304) [Link] (4 responses)

> > /etc/init.d/postfox stop, crap init script kills unrelated process due to stale PID file.
>
> This is why sane init systems tend to not use PID files.

If the service takes a POSIX lock on the PID file (rather than writing it out), the PID can be queried atomically. You can't *use* it atomically, but that's because the only way to atomically send a signal to an individual process is if you're the parent and aren't using SA_NOCLDWAIT.

If the child disassociates from the service manager then you either need to rely on process groups or cgroups. While process groups are atomic (a beneficial inheritance from legacy TTY and batch job management), the cgroups approach still involves reading PIDs from a file, which has the same TOCTTOU race.

Basically, on Linux I think it's still impossible to write a service manager that isn't susceptible to the classic PID file race while also being able to accurately signal individual wayward processes. (And to be fair, I don't think it's possible on any other Unix-like system, at least not using published and supported interfaces.) You could use cgroups and PID namespaces to minimize collateral damage, but it's still fundamentally a hack. You could use a seccomp policy to prevent disassociation from the process group, but you still couldn't target *individual* processes in the group.

To safely signal individual processes there's really no substitute for process descriptors. A larger PID namespace that doesn't recycle PIDs isn't any better, even as an expediency. In both cases you still need to add a bevy of new syscalls and additional bookkeeping in the kernel. While PIDs may seem easier to use from the shell, the shell is perfectly capable of juggling and passing around descriptors (e.g. exec 8</proc/PID). The necessary bookkeeping in the kernel isn't less for wider PIDs because, like with the shell, all the infrastructure for descriptors exists and is easily applied. The benefit of descriptors, however, is that it gives processes a handle to query process state, like exit status, as well as a channel for reliable delivery of lifetime events (e.g. fork) so that a service manager could manage process trees in a straight-forward, race-free manner. That may not happen immediately, but if you're going to add new syscalls, why pick the dead-end solution?

Toward race-free process signaling

Posted Dec 8, 2018 2:41 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Basically, on Linux I think it's still impossible to write a service manager that isn't susceptible to the classic PID file race while also being able to accurately signal individual wayward processes.
You can do that with cgroups, but it does require some trickery:
- Put a process in cgroup.
- SIGSTOP it.
- Inspect the cgroup to make sure the process is still the correct one.
- Send the signal.
- SIGCONT it.

Toward race-free process signaling

Posted Dec 8, 2018 2:43 UTC (Sat) by dw (subscriber, #12017) [Link]

If you're willing to risk sending SIGSTOP to a random process, as done here, there is no value to cgroups or indeed any API change whatsoever.

Toward race-free process signaling

Posted Dec 8, 2018 8:39 UTC (Sat) by nopsled (guest, #129072) [Link] (1 responses)

No need to SIGSTOP or anything else, just use the freezer (which is coming for v2, patches have already been posted).

Toward race-free process signaling

Posted Dec 8, 2018 9:02 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

The last time I tried that (3-4 years ago) it resulted in unrecoverable system lockups. So I kinda hesitate to recommend it.

Toward race-free process signaling

Posted Dec 8, 2018 1:59 UTC (Sat) by kmeyer (subscriber, #50720) [Link] (7 responses)

If you just want to track and kill your children processes, FreeBSD's pdfork() and pdkill() do exactly that. The handle is an opaque unique integer (i.e., a file descriptor).

Toward race-free process signaling

Posted Dec 9, 2018 19:22 UTC (Sun) by nybble41 (subscriber, #55106) [Link] (6 responses)

> The handle is an opaque unique integer (i.e., a file descriptor).

File descriptors are only integers within the context of a single process; the integer is meaningless without the process's descriptor table. For tracking one's own child processes that works OK, but it makes it difficult to save the identifier to a file or send it to an unrelated process, which are both desirable use cases.

Personally, I like the suggestion to switch to monotonically increasing 64-bit process IDs, but with the constraint that at any time the least significant 32 bits of any new process ID must be unique and range from 1 to pid_max. Just skip any PIDs which overlap or would be out of range. Make the 64-bit PIDs start at 2**32 so that they can be distinguished from traditional PIDs, and have system calls accept both the full 64-bit PID or just the least significant 32 bits. The effect would be that you can still refer to processes exactly as you do now, or in a race-free way using larger integer IDs, using the same system calls. (This is basically a cross between "don't reuse PIDs" and "tag processes with GUIDs as well as traditional PIDs").

Toward race-free process signaling

Posted Dec 10, 2018 21:25 UTC (Mon) by kmeyer (subscriber, #50720) [Link] (5 responses)

> File descriptors are only integers within the context of a single process;

Maybe you meant "unique" rather than "integers?"

> it makes it difficult to save the identifier to a file or send it to an unrelated process

Huh? E.g., http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_k...

> Personally, I like the suggestion to switch to monotonically increasing 64-bit process IDs, but with the constraint that at any time the least significant 32 bits of any new process ID must be unique and range from 1 to pid_max.

Yeah, that is a pretty good proposal given the constraints Linux faces around not breaking the kernel ABI at all for any binary that exists today.

Toward race-free process signaling

Posted Dec 12, 2018 4:30 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (4 responses)

>> File descriptors are only integers within the context of a single process;
> Maybe you meant "unique" rather than "integers?"

That would also work, but no, I meant "integers". I suppose some would say that the file descriptor *is* the integer, and representationally that's often true, but to me the term refers conceptually to an active entry in the file descriptor table (a particular file/pipe/socket/device, set of flags, position, etc.), and the integer is just the index into the table. (Consequently, all file descriptors have unique integer identifiers within a process, but not all integers are file descriptors.) A process only has one descriptor table, so there is an isomorphism between the process's file descriptor table entries and their integer indices. Outside the process, however, you need some other way to identify the state that a file descriptor stands for; the index alone, without context, isn't enough.

>> it makes it difficult to save the identifier to a file or send it to an unrelated process
> Huh? [link to "Passing File Descriptors over UNIX Domain Sockets"]

I'm aware of FD passing over UNIX domain sockets, but that's only a partial solution. How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example? Even from C it's more complex than just sending an identifier over *any* available communications channel. It also doesn't address the use case of serializing the identifier to a file, where there may not be any process running to hold the file descriptor open.

Toward race-free process signaling

Posted Dec 12, 2018 5:43 UTC (Wed) by kmeyer (subscriber, #50720) [Link] (3 responses)

> That would also work, but no, I meant "integers". I suppose some would say that the file descriptor *is* the integer, and representationally that's often true, but to me the term refers conceptually to an active entry in the file descriptor table (a particular file/pipe/socket/device, set of flags, position, etc.), and the integer is just the index into the table.

Ok — this might just be a terminology difference between FreeBSD and Linux. In FreeBSD, file descriptors are definitely integers. They index a per-process table of 'struct filedescent's named the 'fdescenttbl.' Each filedescent has a 'struct file' pointer, 'struct filecaps', flags, and a sequence number associated with it.

The 'struct file' tracks things like 'f_type' (DTYPE_VNODE for regular files, DTYPE_SOCKET, DTYPE_PIPE, DTYPE_KQUEUE, etc); 'fileops' (file-level vtable of operations); file-associated credentials; associated 'struct vnode' (inode in Linux terminology), if any; the file offset (for operations like read(2) or write(2) that don't take an explicit offset); etc.

So you can understand my confusion — we would call what you're talking about a 'filedescent' or just 'file'.

> all file descriptors have unique integer identifiers within a process

Ok, definitely 'filedescent' in our terminology, rather than 'file' (which would be shared by a dup(2)ed fd).

> Outside the process, however, you need some other way to identify the state that a file descriptor stands for; the index alone, without context, isn't enough.

Sure. That state can be passed between processes using unix domain sockets and control messages, though. That's not as frictionless as copying and pasting some pid number between arbitrary processes, but it's not like the credential is locked to the original process exclusively.

> How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example?

It could readily be done with an extension builtin, or a helper program. (Or something crazy like ctypes.sh.) Yeah, it's higher friction than just passing around some integer. On the flip side, it works without increasing pid space to 64 bits.

> It also doesn't address the use case of serializing the identifier to a file, where there may not be any process running to hold the file descriptor open.

I'm not sure it's reasonable to serialize a process handle of any kind to a file and expect it to be meaningful later. Say your file is on NFS, or backed up to a remote system. Or even a local filesystem, but the machine has been rebooted. I'm not overly concerned with not handling that case.

Thanks for the discussion, it's been interesting and given me some food for thought.

Toward race-free process signaling

Posted Dec 12, 2018 22:34 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (2 responses)

> In FreeBSD, file descriptors are definitely integers. They index a per-process table of 'struct filedescent's named the 'fdescenttbl.'

So to unpack the abbreviations a bit, you're saying that in FreeBSD, "file descriptors" are integers which identify "file descriptor entry" structures in a per-process "file descriptor entry table". I can see how that would be confusing.

I'm not sure it's a FreeBSD vs. Linux thing, but to me including "entry" in the name of the type for the elements of an array seems a bit awkward. I would just say "file descriptor table" for the array, and the elements of the array ("file descriptor table entries") would simply be referred to as "file descriptors". If the "file descriptor" is a member of the current process's "file descriptor table", as is usually the case, then it can be referred to simply by its index in the table (its "file descriptor number"). One might casually refer to this as "passing a file descriptor" when one is really just passing the index of the file descriptor, much like "passing an object" usually just means passing the address of the object.

(BTW, note the terminology in that document you linked to: "Passing File Descriptors over UNIX Domain Sockets". It is not the *integer* which is being passed over the socket—the receiving process will most likely get a different integer—but rather the object which the integer refers to.)

>> How exactly does one pass a file descriptor to or from a shell script using Unix-domain sockets, for example?
> It could readily be done with an extension builtin, or a helper program.

It could be, but now we're talking about adding dependencies on external programs which (a) haven't been written yet and (b) won't necessarily be present on every system. In the end you could just rewrite any shell script in C, but I wouldn't consider that a realistic solution to the problem of "how to do X in a shell script". The object is to create a script which works with existing systems and commonly available tools.

> On the flip side, it works without increasing pid space to 64 bits.

Along those lines, the only change that's really needed is for open references to /proc/PID directories (or their contents) to block PID reuse. As others have pointed out, there are ways to tag processes via their environment variables—as long as they are willing to cooperate—so any process could (1) open the PID directory, (2) check for the matching environment variable, and (3) send the signal if the environment matches, knowing that the PID won't be reused since the directory is open. The kernel could improve on this somewhat by implementing pdfork() and assigning unforgeable identifiers when processes are created, but it's not really necessary just to prevent races for most use cases.

Toward race-free process signaling

Posted Dec 15, 2018 16:14 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

POSIX already has terms for these things. The things open(2) returns are 'file descriptions': the file description is where things like the open flags and file offset reside: the file descriptor is an integer pointer to a file description, which also has flags of its own (currently only O_CLOEXEC). dup() and fork() copy the descriptor, but leave it referring to the same description. close() closes a descriptor, and if no descriptors are left referencing its description, the description is also closed.

Toward race-free process signaling

Posted Dec 18, 2018 16:35 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> the file descriptor is an integer pointer to a file description, which also has flags of its own (currently only O_CLOEXEC).

This is interesting but it really just adds another layer of indirection: file descriptor number (integer) -> file descriptor (object with flags) -> file description (object with flags & file offset) -> file. It's still true that a file descriptor is not simply an integer, since integers don't have O_CLOEXEC flags.

Toward race-free process signaling

Posted Dec 7, 2018 16:36 UTC (Fri) by pj (subscriber, #4506) [Link]

My first thought was similar: Instead of having to hack on /proc open() semantics, put a uuid into the path of the file to send signals to:

echo 1 > /proc/1234/uuid/u-u-i-d-h-e-r-e/kill

So it's accessible, and instead of saving the PID to send a signal to, you save the path to the kill file. If you need/want to preserve the racy behavior, it's still possible to do

echo 1 > /proc/1234/uuid/*/kill

...though of course discouraged. Make it easier to Do The Right Thing, not harder.

Toward race-free process signaling

Posted Dec 6, 2018 22:29 UTC (Thu) by excors (subscriber, #95769) [Link] (7 responses)

Or maybe just 64-bit PIDs (incrementing as normal, but with no practical need to worry about recycling)? Instead of adding weird new syscalls that take process fds, add pid64_t variants of everything that takes a pid_t now. On 64-bit archs I assume pid_t is passed to syscalls as a 64-bit int anyway, so you wouldn't even need to define new syscalls - just swap them all to accept pid64_t, then "#define pid_t pid64_t" in userspace to make use of the expanded range.

For compatibility with old applications, I suppose you could start PIDs at 2^32, and if the syscall is passed a value <2^32 then it must have come from an application with 32-bit pid_t, so do a pid_t->pid64_t lookup in some clever table. Guarantee that no concurrent processes will have the same PID mod 2^32, via some other cleverness. The lookup will only be wrong if two non-concurrent processes have the same PID mod 2^32, which is the race condition that already exists so it's not making anything worse.

(I'd be surprised if it was really that easy, though, and I assume people have already considered and rejected it?)

Toward race-free process signaling

Posted Dec 7, 2018 8:33 UTC (Fri) by smurf (subscriber, #17840) [Link] (6 responses)

PIDs are still part of the user interface. When I need to kill a process on my long-running server, I don't want to copy+paste (or, on a dumb terminal (these still exist, you know), remember and type) a 10-digit number. Not to mention my eyes glazing over when I start "ps".

That being said, clone-with-fd and kill-via-uniquetoken serve different purposes and IMHO should both be added. The former prevents a multithreaded program from getting confused about whether its child has been reaped by another thread; the latter prevents some unrelated process from killing the wrong task by mistake.

"kill-via-uniquetoken" doesn't even need another syscall – just a field in the task struct that "ps" displays. "kill" would then open /proc/pid/NNN, read the current token, and compare it with the one it thinks is correct before killing the process the usual way – the PID can't be raced now because the directory is still open.

Toward race-free process signaling

Posted Dec 7, 2018 16:20 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Keeping the directory open does not reserve the PID.

Toward race-free process signaling

Posted Dec 7, 2018 17:23 UTC (Fri) by smurf (subscriber, #17840) [Link]

> Keeping the directory open does not reserve the PID.

I assumed from previous notes here that this is no longer true in current kernels. I didn't check, so if I'm mistaken, well, that's fixable. (Though somebody could DoS the system by locking every PID in existence, so this feature should probably be limited to processes the holder can actually affect.)

Toward race-free process signaling

Posted Dec 7, 2018 19:51 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

I expect the process id numbers would only reach ten digits if the system had been up long enough to spawn a billion processes. Tab completion for process ids would help.

Toward race-free process signaling

Posted Dec 10, 2018 11:04 UTC (Mon) by james (subscriber, #1325) [Link] (1 responses)

Tab completion for process ids would help.

Would it?

$ ps aux | grep 'runawayprogram'
james    1720223467 ... runawayprogram

runawayprogram dies (and is reaped).
User enters (without pressing Enter)
```
$ kill -9 17202
```
User presses tab, and Bash finds only one process: 1720233467

What's the chance that the user notices before pressing enter?

Toward race-free process signaling

Posted Dec 11, 2018 8:21 UTC (Tue) by epa (subscriber, #39769) [Link]

Well, OK, maybe not. I don't think the problem gets any worse than if the process id is limited to 15 bits, in which case exactly the same id could end up belonging to a different process.

Currently kill(1) can take a pid or a process name. It should let you redundantly specify both, and (without any race condition) check that they match before sending the signal.

Toward race-free process signaling

Posted Dec 8, 2018 2:35 UTC (Sat) by wahern (subscriber, #37304) [Link]

> the latter prevents some unrelated process from killing the wrong task by mistake.

How does an unrelated process discover the token? By querying a global task list, which then returns an identifier. Because files (including open file references) are first-class citizens in Unix, even from the shell, it's not much more difficult as a technical matter to use and apply a descriptor object than it is a named identifier. For the simple use case of debugging or killing errant processes, the token approach might be simpler. But let's be honest: for those cases the current situation is good enough. When process management races are a real concern, it's rarely in situations where people are copy+pasting output from ps(1). The issue of races arise in the context of building tooling or otherwise programmatically doing process management work.

Finally, process descriptors permit userspace (e.g. the service manager) to implement unique process identifiers, at least in theory. Process descriptors provide a path to unique tokens. But simply providing unique tokens is a dead-end street.

Toward race-free process signaling

Posted Dec 6, 2018 22:05 UTC (Thu) by thiago (guest, #85680) [Link] (4 responses)

FreeBSD has a similar, but much more limited API: https://www.freebsd.org/cgi/man.cgi?query=pdkill

pid_t
pdfork(int *fdp, int flags);

int
pdgetpid(int fd, pid_t *pidp);

int
pdkill(int fd, int signum);

int
pdwait4(int fd, int *status, int options, struct rusage *rusage);

The pdkill(2) syscall is insufficient for the needs of realtime signals, since it doesn't allow for the siginfo structure. But if taskfd_send_signal(2) exists, pdkill can be implemented as a glibc function. The pdgetpid(2) call sounds nice, but I don't think we want it. First, because of namespaces (which FreeBSD doesn't have to deal with) and second because it would bind a legacy interface to a new technique.

For me, as a library maintainer, the important part are the pdfork() and pdwait4() syscalls. They are not required to solve the problem that is being attempted to be solved, though. The patch that josh and I created a few years ago tried to do that: pdfork() was implemented by a CLONE_FD parameter passed to a clone4(2) (an extended version of clone(2)) and the waiting was implemented by select(2)/poll(2) triggering, then reading a siginfo_t from the file descriptor.

In other news, I still have that clone4() syscall patch and it's probably useful even without the CLONE_FD flag.

Toward race-free process signaling

Posted Dec 8, 2018 2:16 UTC (Sat) by kmeyer (subscriber, #50720) [Link] (2 responses)

For what it's worth, FreeBSD's pdwait4() was always aspirational — it was documented, but never implemented. I removed it from the documentation after years of it never getting implemented. It may happen one day, but for now it is vaporware.

One other thing FreeBSD has done in this space that I think was inspired by Linux cgroups is procctl(2) PROC_REAP_*[1]. This mechanism allows you to create a sort of process group that cannot be escaped. You can signal all live processes within the group from the parent ("reaper"); a reused pid does not get re-added to the reapgroup, and the reaper is responsible for wait(2)ing orphaned zombies that originate in the reaper group.

I am less familiar with the procctl REAP stuff, so I may be doing a poor job of explaining it. Sorry. :-)

[1]: https://www.freebsd.org/cgi/man.cgi?query=procctl&apr...
[2]: https://svnweb.freebsd.org/base?view=revision&revisio...

Toward race-free process signaling

Posted Dec 8, 2018 3:00 UTC (Sat) by wahern (subscriber, #37304) [Link]

Linux has prctl(2) PR_SET_CHILD_SUBREAPER

A subreaper fulfills the role of init(1) for its descendant
processes. When a process becomes orphaned (i.e., its
immediate parent terminates) then that process will be
reparented to the nearest still living ancestor subreaper.
Subsequently, calls to getppid() in the orphaned process will
now return the PID of the subreaper process, and when the
orphan terminates, it is the subreaper process that will
receive a SIGCHLD signal and will be able to wait(2) on the
process to discover its termination status.

-- http://man7.org/linux/man-pages/man2/prctl.2.html

Toward race-free process signaling

Posted Dec 21, 2018 10:30 UTC (Fri) by njs (subscriber, #40338) [Link]

Isn't EVFILT_PROCDESC basically the replacement for pdwait4? Except I guess it doesn't let you reap the process...

Toward race-free process signaling

Posted Dec 9, 2018 0:58 UTC (Sun) by brauner (subscriber, #109349) [Link]

Fwiw, if this patch lands I intend to pick yours up and would very much like your input.

Additional existing process match entropy

Posted Dec 10, 2018 1:32 UTC (Mon) by ewen (subscriber, #4772) [Link] (5 responses)

If the aim is just to avoid the "recorded PID a while ago, it exited, and a new process has started with the same PID" problem, the obvious solution (other than decreasing the chances of PID reuse, eg, by increasing the PID range beyond 15 bits...) is to use some more entropy from the process to ensure that you have found the *correct* process, for instance the *Parent* Process ID:

ewen@abbey:~$ cd /proc/$$
ewen@abbey:/proc/7244$ grep Pid status | grep -v Tracer
Pid: 7244
PPid: 7243
ewen@abbey:/proc/7244$

If we match on both Pid *and* PPid, then the chances of sending the signal to the "wrong version of Pid NN" is dramatically reduced. (The remaining use cases are basically (a) spawned by the same parent, or (b) reparented to PID 1 because parent exited; in the first case the parent proces should *already* know if it has collected the exit code or not, and from memory the PID won't be reused until it has collected the exit code.)

Obviously this requires a new system call, and obviously while the old system call exists it could also be used to send signals ignoring these extra match parameters. But that seems to be true of every other proposal too. And using the PPid value requires very little change other than a new system call to accept the additional parameter. Ideally that new system call could optionally accept a bunch of other "match parameters", at least as a trivial extension later -- eg UID / GID / root / CWD all seem like potentially useful matches in some cases, that should be "slow changing" (and thus limited race on fetch / use cycles).

Ewen

Additional existing process match entropy

Posted Dec 10, 2018 17:59 UTC (Mon) by kjp (guest, #39639) [Link] (4 responses)

FWIW, I have a daemon that monitors 4,000 processes on a VM. It looks for custom UUID values in /proc/PID/environ.
That works great if you're calling fork() from a single threaded process. But if multithreaded, you can't set new environment vars just for your own thread+fork AFAIK.

Additional existing process match entropy

Posted Dec 11, 2018 6:53 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Huh? You fork first and *then* set the uuid envvar. Still a race condition either way. If I were to write that sort of daemon I'd add a small API to it to *tell me* which process has a given UUID – or, even better, send the signals on my behalf.

Additional existing process match entropy

Posted Dec 11, 2018 18:03 UTC (Tue) by kjp (guest, #39639) [Link] (2 responses)

Yeah, total brain fart - I was just trying to say "I don't recommend the environ hacks as a general solution".

But the point is that there is a way to assign arbitrary string labels / tags to processes in Linux today, using the environ mechanism.
If only there were 1) a way to set the labels at clone/fork time (not exec), and 2) a way to make kill() match the extra labels.
Assuming we've all given up on just making pids 64 bit.

Additional existing process match entropy

Posted Dec 11, 2018 18:35 UTC (Tue) by kjp (guest, #39639) [Link] (1 responses)

If the /proc directory locking can temporarily prevent PID recycling, the problem could be decomposed:

1) Kernel adds /proc/PID/longpid attribute (guaranteed unique by kernel) Pass PID+longpid to kill_ex() to avoid a PID recycling race.
2) Kernel makes open(/proc/pid) prevent PID recycling
3) (optional) Kernel adds /proc/PID/tags, which is like /proc/PID/environ but set at clone() time. Only userspace sets/gets this field.

So now, an external process can scan /proc/*, look at "tags", and then get "longpid". Using open/openat to prevent races.

Additional existing process match entropy

Posted Dec 11, 2018 18:41 UTC (Tue) by kjp (guest, #39639) [Link]

(Really wish I could edit comments).

It looks like #1 is not even needed in some situtations here. If you call kill() while you have the /proc/pid dir locked, there's no need for disambiguation. Hmm.

Toward race-free process signaling

Posted Dec 12, 2018 5:14 UTC (Wed) by amworsley (subscriber, #82049) [Link]

Why can't you just use an existing file descriptor system call to send the signal?
e.g. write(fd, siginfo_t *info, sizeof (siginfo_t))

It seems a natural analog of the process which is receiving signals via signalfd filedescriptors?

Toward race-free process signaling

/proc/pid/kill proposal

Signaling by /proc/pid

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Re: FreeBSD already committed this same mistake in their pdkill interface

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

Toward race-free process signaling

`/proc/pid/kill` proposal

Signaling by `/proc/pid`