Missed some rows

Posted Apr 30, 2011 11:29 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)
In reply to: Missed some rows by nicooo
Parent article: Poettering: Why systemd?

Actually, no. The problem with miskills due to PID wraparound are very well-known.

Various 'enterprise' Unixes had workarounds since forever. Like ability to 'lock' PID of a process (so it won't be reused). Or locking PID for several minutes after getpid() calls (so "ps | grep ... | xargs kill" won't kill some innocent process).

Missed some rows

Posted May 3, 2011 1:21 UTC (Tue) by wahern (subscriber, #37304) [Link] (11 responses)

That all seems so convoluted. The whole problem boils down to the size of the namespace and the familiar TOCTOU race condition. The cgroups solution works because it uses a different namespace with well-crafted rules, and really only works in the context of systemd, which is taking on a role--maintaining a persistent, unique, global namespace--part of which should be done in the kernel.

The easiest and cleanest general purpose solution would be to extend the PID namespace to 64-bits, or maybe even 128-bits. Problem solved. This is a common solution for when maintaining and communicating a consistent global state is not practically feasible, which is the case with the historical paradigm of process management on Unix.

I don't know why this has never been done. The existing 16-bit namespace is ridiculous. There should be a kernel compile-time option to increase the pid_t width. Then over the course of several years broken applications that make unwarranted assumptions about pid_t could be fixed. The vast majority of issues are probably with printf formatting; people usually cast pid_t to (int). If PIDs were chosen at random (as on OpenBSD) than the 31- or 32-bits shown would actually be useful, much like Git's truncated hash identifiers. So even most broken apps would only be half broken.

I realize it's a *huge* change, but its simple and straight-forward, the consequences are mostly foreseeable, and with open source software readily addressed by even casual C programmers. GCC could be instrumented to track pid_t conversions, and in a matter of weeks I bet Debian's build system would uncover the vast majority of issues. All of a sudden one of the most ugly Unix warts--that is, fundamentally broken in the context of common usage--disappears.

Missed some rows

Posted May 3, 2011 9:10 UTC (Tue) by leighbb (subscriber, #1205) [Link] (1 responses)

Just so that you are aware, you can actually enable a 22-bit pid by doing:

sysctl -w kernel.pid_max=4194304

Not as much as you were after but bigger than you thought :-)

Missed some rows

Posted May 3, 2011 13:13 UTC (Tue) by wahern (subscriber, #37304) [Link]

Thanks. I was completely unaware.

Missed some rows

Posted May 3, 2011 15:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Not really. By going to 32 bits for PID namespace this problem still won't be solved, it will just be harder to trigger.

And larger PID lengths are way too clumsy for humans. That's definitely NOT a good engineering.

Besides, even with 128-bit PID length you'll still have problems with double-forked processes (which are reparented to init).

systemd nicely solves these problems.

Missed some rows

Posted May 3, 2011 18:17 UTC (Tue) by wahern (subscriber, #37304) [Link] (7 responses)

A larger PID wouldn't do everything that systemd does with cgroups. cgroups does two things: (1) provides a larger namespace (roughly 2^(8 * 255) bits, AFAIU) to identify processes, and (2) handles inheritance. But a larger PID would solve in a backwards compatible fashion the one clear issue in Unix process management, the signal-PID race, which is more-or-less the same as the first thing above. Although I'm not familiar with cgroup usage, I think that there's still a race in adding a fresh process to a cgroup, so even systemd could benefit from a larger PID space.

It's really only an unresolvable issue when you have errant, buggy processes. Otherwise, a sophisticated daemon should have a domain socket which takes control messages. But I'm presuming that process management means being able to handle processes that aren't well behaved.

Missed some rows

Posted May 3, 2011 18:21 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Signals must die, they are a relic of ancient time.

>Although I'm not familiar with cgroup usage, I think that there's still a race in adding a fresh process to a cgroup, so even systemd could benefit from a larger PID space.

Nope. cgroups work on kernel level and so they use proper locking, so PIDs won't be able to leak. Also, one can easily protect processes in a cgroup from an accidental kill (in fact, cgroups can be used as a complete lightweight virtualization solution).

Missed some rows

Posted May 4, 2011 5:03 UTC (Wed) by wahern (subscriber, #37304) [Link] (5 responses)

I'm confused then. Say I have a new process which I want to add to a cgroup. How do I assign the process to a cgroup? All the documentation I can find says to echo the PID to a cgroup control file. But if I'm using a PID--and I'm not the process with that PID--then I'm still subject to a race--the PID can become stale between acquiring the value and communicating it to the cgroup subsystem.

cgroup inheritance I can understand. A process forked from a process already assigned to a particular cgroup atomically inherits membership in the cgroup, just as it would atomically inherit a session id and process group id. But now, say, I want to reassign that process to a different cgroup PID. It seems like there's the same problem as above. What am I missing?

Missed some rows

Posted May 4, 2011 5:44 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That's a trick question.

You need to somehow have a unique process handle, which PID is definitely not. On Linux it can be done using the /proc/PID/ directory. The sequence would be:
1) Change current directory to /proc/PID
2) Look around and check that this PID is still the correct one. That's safe because if the process its /proc/PID directory becomes empty - and stays that way.
3) Write to /proc/PID/cgroup.

Of course, it's better to create a process directly in the required group in the first place.

Missed some rows

Posted May 4, 2011 8:26 UTC (Wed) by wahern (subscriber, #37304) [Link]

I thought /proc/$PID/cgroup was read-only; to add a process to a group you needed to write to /dev/cgroup/$TASK/tasks. In such case, you're left with a race condition. (I tried confirming or disproving this, but can't even get the example in cgroups.txt to work.)

My proposal was to make PID a unique quasi-handle the same way random UUIDs are unique.

Missed some rows

Posted May 4, 2011 19:31 UTC (Wed) by njs (subscriber, #40338) [Link] (2 responses)

> Say I have a new process which I want to add to a cgroup. How do I assign the process to a cgroup? All the documentation I can find says to echo the PID to a cgroup control file. But if I'm using a PID--and I'm not the process with that PID--then I'm still subject to a race--the PID can become stale between acquiring the value and communicating it to the cgroup subsystem.

In the above scheme, if you're the one who's spawning this new process that you want to end up in a cgroup, then you can do
1) fork
2) the child adds itself to the desired cgroup
3) the child calls exec()

That's race-free.

Missed some rows

Posted May 4, 2011 21:08 UTC (Wed) by wahern (subscriber, #37304) [Link] (1 responses)

Sure. But the issue is handling arbitrary, non-well behaving processes. And AFAICT there's still no provably safe way to handle that on Unix systems. With only a 16-bit (or 15-bit, or 22-bit) PID space, it's trivial to write a program to sit around and wait to take advantage of a race. (I don't have an attacker mindset, but I wouldn't bet against the proposition that it could be a useful vector.)

Of course, "who cares" is a valid reply; we've been living with it for 40 years. But that response challenges the value added by systemd's reliance on esoteric Linux subsystems. For example, when we talk about how a service manager is so much better than a race-prone PID file, nobody ever considers that the race condition is easily avoided by not using root. If you create a user per daemon--_www, _ftp, etc--then even if you read a stale PID and signal the wrong process, as long as you're sending the signal with a service-delegated UID then it will never be delivered.

I never brought it up before because it's arguably not very elegant. I'm loath to defend PID files. But if we're going to replace them with something, I'd like it to be generic and tailored to the specific issue, rather than lauding some supposed panacean init replacement.

The past decade in Linux-land has seen a parade of sophisticated daemon services intended to patch over some clunky Unix interface (device management, process management, etc, etc). They each require application developers to change from portable POSIX patterns to using some new API or library or protocol. But they come and go like the wind. Worthy solutions tend to be so obviously beneficial that all the free unices eagerly adopt or mimic them.

Missed some rows

Posted May 5, 2011 1:06 UTC (Thu) by njs (subscriber, #40338) [Link]

I guess I don't understand what you mean by "managing arbitrary, non-well behaving processes".

IIUC, when systemd starts a service, that service gets stuck (reliably, and race-freely) into its own cgroup, from which it cannot escape. Then you can kill it or whatever reliably, even if it's badly behaved (spawning children that double-fork and end up as orphans, forking to a new PID every 100 ms, whatever you like).

If you're trying to go after a process that was started outside of a cgroup, then this doesn't work so well, but not much does. That process that keeps switching PIDs as quickly as possible can't easily be killed even if you have a collision-free PID space.

Missed some rows

Posted May 4, 2011 21:28 UTC (Wed) by mjthayer (guest, #39183) [Link]

> Actually, no. The problem with miskills due to PID wraparound are very well-known.

> Various 'enterprise' Unixes had workarounds since forever.

A workaround I implemented a while ago for "normal" Unixes was for the daemon to place an advisory lock on its pidfile. It only works on filesystems with that feature of course, but by checking that the file is locked before issuing your kill command you greatly reduce the race window.