Systemd vs. Docker

Posted Feb 26, 2016 20:41 UTC (Fri) by wahern (subscriber, #37304)
In reply to: Systemd vs. Docker by fandingo
Parent article: Systemd vs. Docker

cgroups do not provide a race-free way to kill all processes in the group. Read the systemd code. I did and was surprised. The relevant code is in cg_kill: https://github.com/systemd/systemd/blob/master/src/basic/...

What it does is iteratively read a /proc file which lists all the PIDs in a cgroup. It then walks the list and kill(2)s each PID. This is racy because the PID might have been recycled to a process outside the cgroup between when systemd read the proc file and when it calls kill(2). This is the exact same race you see with traditional PID files, although in this case because systemd is PID 1 the window for a race is smaller the race, and can only happen if the parent process is ignoring SIGCHLD. But if this does happen, things get worse: systemd skips PIDs that it has already killed, so if a PID is recycled into the same cgroup, then cg_kill will loop endlessly.

I don't think Poettering ever claimed that cgroups allowed systemd to fix all PID races. But people assumed this because Poettering _did_ claim that systemd resolved PID-file races. (Important distinction.) But implicit in that characterization was that it only applied if processes were cooperative. systemd does not solve the problem of uncooperative daemons.

cgroups doesn't provide the necessary magic. Specifically, cgroups does not provide a way to atomically broadcast a signal to all members in a cgroup. That's what you would need to fix this problem, and that functionality doesn't exist.

Systemd vs. Docker

Posted Feb 26, 2016 21:09 UTC (Fri) by smurf (subscriber, #17840) [Link] (7 responses)

It's a /sys file, or rather a cgroupfs file, but yeah.

Another way to fix this race would be to first stop all members of the cgroup in question, then check each PID whether it's still in there. If so, really kill the process, else send SIGCONT. But I agree that a kernel patch would be the best approach to this problem. In fact I wonder why that feature doesn't exist yet.

Systemd vs. Docker

Posted Feb 27, 2016 0:04 UTC (Sat) by wahern (subscriber, #37304) [Link] (6 responses)

Stopping all the members of the cgroup is the same problem as sending a signal to all of them. But you don't need cgroups to send a signal to all of them. Traditional process groups already provide that. That's almost entirely their function, so a controlling process (e.g. a shell) can broadcast a signal to a tree of subprocesses.

The problem with process groups is that forking daemons usually create a new process group, and systemd is making a superficial attempt to handle those uncooperative daemons. I'd guess the idea was that most existing forking daemons don't know anything about cgroups so aren't going to be changing their membership.

As for the kernel maintainers, I never followed the LWN coverage of the systemd debates very closely, but it was my understanding that they were resistant to patches to cgroups which reinvented the wheel of interfaces like process groups. Also, the cgroups data structures are supposedly awkward and inefficient (in part, I assume, because of the nesting semantics) and they were reluctant to allow them to become more deeply embedded in the fundamentals of process management. But maybe I totally misunderstood things.

One possible hack would be to use a seccomp filter to silently ignore attempts to create a new process group. Another might be PID namespaces, though I'm not very familiar with them.

Although, to be clear, IMO all of these options are trying to put lipstick on a pig. I'm not sure systemd _should_ be fixed. I'd rather see all the poorly written software fixed. Teach poorly written daemons to optionally run in the foreground so systemd or any other service manager doesn't have to use hacks to track it. And for daemons that fork new processes, work to improve their correctness, and to not needlessly create new process groups. Basically, _subtract_ code, rather than add hundreds of thousands of lines of new code to the pile.

But things like systemd, Docker, etc, are all the rage these days. Apparently people prefer fixing problems with "full stack" solutions rather than submitting a 5-line diff. Whatever... just keep it all off my lawn :)

Systemd vs. Docker

Posted Feb 27, 2016 1:57 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

PID namespaces solve this problem entirely - you can just kill everything inside of it, without any chance to interfere with the parent namespace.

Process handles are another way to solve this.

Systemd vs. Docker

Posted Feb 27, 2016 7:00 UTC (Sat) by wahern (subscriber, #37304) [Link] (2 responses)

Process handles don't solve this problem unless you forbid the daemon from forking. But if the daemon wasn't forking subprocesses this race wouldn't exist.

Process handles are useful for passing off the management of a process. For example, traditional shell-based init.d scripts could acquire a process handle and pass it on to a service manager, in effect obviating the need for a PID file. The service manager wouldn't even need to be PID 1, yet the PID file race issue would be solved just as well (or just as incompletely, depending on your perspective).

It does look like PID namespaces solves this problem. The manual page says that when the PID 1 "init" process in the namespace terminates, all the processes in that namespace are SIGKILL'd. So you could just kill that process without having to bother enumerating the PIDs, perhaps simply by closing its process handle. The problem with that solution--and all the others--is that it's not backwards compatible. Neither the software nor administrators may be expecting the process(es) to be running in a different PID namespace.

Systemd vs. Docker

Posted Feb 27, 2016 7:16 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

The do help if you can kill the daemon-spawned processes faster that the are created. Which you realistically can, not even considering stuff like pid limits.

Systemd vs. Docker

Posted Feb 27, 2016 8:31 UTC (Sat) by smurf (subscriber, #17840) [Link]

Process handles do solve the problem because as long as you hold such a handle, the process won't entirely go away, thus the PID won't get reassigned. Systemd and the kernel could simulate that easily -- opendir(/proc/PID). The problem is that adding even more overhead doesn't help when you try to catch a fork bomb. My SIGSTOP idea, however, just might. I'll have to test that idea further.

Systemd vs. Docker

Posted Mar 7, 2016 1:41 UTC (Mon) by cg909 (guest, #95647) [Link]

Using process groups would work for simple daemons, but not for services like sshd.

The main problem is that process groups always belong to a session. So every service that spawns user sessions would also need to break out of the process group.

If you use seccomp filters to ignore setpgrp() and setsid() sshd would fail in spectacular ways as all processes in the process group will share the same controlling terminal and so every process spawned by sshd will receive SIGHUP when a session is closed. Also anything spawning a shell might run into problems as shells use process groups to separate tasks.

You'd need "super process groups" which may span multiple sessions and contain multiple process groups.

And this is what cgroups provide.

Systemd vs. Docker

Posted Jun 14, 2016 17:32 UTC (Tue) by davidlee (guest, #109327) [Link]

I agree with the sentiment that it is NOT Docker's fault that errant or poorly designed applications are being run inside of a container.

The "solution" I am using is call supervisord. If I need something controlled from inside the Docker, I do it myself. I note some of the previous comments made derisive comments about the kinds of init scripts folks like me might come up with. So what? I don't need their permission, nor do I need their acceptance. With four decades of script writing, I think I can write one that will do the job.

Yes, I could use systemd for those applications that install scripts it will use. I like the three-line example for Apache. But it was quite unique in that Apache installs SO MUCH STUFF that the example works. Perhaps Nagios might also work. Or Splunk. Or a myriad of other major applications which are mature enough to do so.

One of my recent docker containers was an interface with Dropbox. Nope, no three-liner there. It required too much to configure and set up. Actually, virtually every Docker container I have designed has required setup and configuration -- and, thank you very much, a carefully crafted startup script.

If I have a docker which needs to manage internal processes, I'll stick with solutions like supervisord (there are more options, but this is the one I have settled on). I'd rather use that on the few docker containers I need it than have the weight of systemd in every single docker container I generate. Imagine a busybox container with systemd running...

Systemd vs. Docker

Posted Feb 27, 2016 12:20 UTC (Sat) by paulj (subscriber, #341) [Link] (5 responses)

Oh, wow - that's very interesting code. Is that really what all the "cgroups are so much better for process control!" stuff boils down to? The emperor has no clothes, if that's true.

/me very interested to hear of rebuttals.

Systemd vs. Docker

Posted Feb 27, 2016 14:14 UTC (Sat) by smurf (subscriber, #17840) [Link] (1 responses)

Yes, if your goal is to 100%-safely kill aberrant processes, this is not a 100% fail-safe solution.

However, it is still *way* better than anything any other init system is doing. Also, this method is able to identify and kill all processes which belong to the current task (again, assuming they're not maliciously forking around), which again is way better than anything else out there.

Systemd vs. Docker

Posted Feb 27, 2016 18:52 UTC (Sat) by cortana (subscriber, #24596) [Link]

Could the freezer cgroup controller be used to reliably kill processes? If everything in a cgroup is frozen then it can no longer fork, so you can then list the processes and kill them one by one.

Systemd vs. Docker

Posted Feb 27, 2016 23:42 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

You can't escape a cgroup, so they're reliable. However, process killing races with process creation so it might be possible to create processes faster than systemd can kill them. In practice, it doesn't happen - I tried.

Forkbombs are more interesting - you CAN cause PID starvation by launching a forkbomb in an unconfined cgroup. https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt can help against it.

Another problematic case are PID races. SIGSTOP+SIGKILL does the job reliably, SIGSTOP can't be ignored and also forces the process to stick around.

Systemd vs. Docker

Posted Feb 28, 2016 3:43 UTC (Sun) by mchapman (subscriber, #66589) [Link] (1 responses)

> Another problematic case are PID races. SIGSTOP+SIGKILL does the job reliably, SIGSTOP can't be ignored and also forces the process to stick around.

It might be possible for two or more cooperating processes to circumvent this by continually SIGCONTing each other, forking new processes along the way. cortana's suggestion of using the freezer controller seems like a better approach.

Systemd vs. Docker

Posted Feb 28, 2016 6:44 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, you need to SIGSTOP everything first. Though it can be susceptible to a livelock (if a group of processes SIGCONT every other process). Freezer would solve this, but it's not available everywhere.

It appears that process handles or PID namespaces is the only reliable way.

Systemd vs. Docker

Posted Feb 28, 2016 20:21 UTC (Sun) by justincormack (subscriber, #70439) [Link]

Can't you use the cgroup freezer https://www.kernel.org/doc/Documentation/cgroup-v1/freeze... to stop all the processes and then kill them all?

Systemd vs. Docker

Posted Mar 1, 2016 12:14 UTC (Tue) by nix (subscriber, #2304) [Link]

FWIW, this is the same problem debuggers have, but at a different level. Linux provides no way to determine the set of threads in a process in a racefree manner, so the only way to ptrace() every thread in the face of a victim that might be creating more is to keep looping through a readdir() of /proc/$pid/task, ptracing and stopping every new thread you see, until you do at least one complete iteration and there are no new ones. (gdb does a couple of extra loops in case the kernel was slow creating new dirs, which says a lot about the degree of confidence debugger developers have in this shaky old rattletrap of a design.)