Systemd vs. Docker
Systemd vs. Docker
Posted Feb 26, 2016 20:41 UTC (Fri) by wahern (subscriber, #37304)In reply to: Systemd vs. Docker by fandingo
Parent article: Systemd vs. Docker
What it does is iteratively read a /proc file which lists all the PIDs in a cgroup. It then walks the list and kill(2)s each PID. This is racy because the PID might have been recycled to a process outside the cgroup between when systemd read the proc file and when it calls kill(2). This is the exact same race you see with traditional PID files, although in this case because systemd is PID 1 the window for a race is smaller the race, and can only happen if the parent process is ignoring SIGCHLD. But if this does happen, things get worse: systemd skips PIDs that it has already killed, so if a PID is recycled into the same cgroup, then cg_kill will loop endlessly.
I don't think Poettering ever claimed that cgroups allowed systemd to fix all PID races. But people assumed this because Poettering _did_ claim that systemd resolved PID-file races. (Important distinction.) But implicit in that characterization was that it only applied if processes were cooperative. systemd does not solve the problem of uncooperative daemons.
cgroups doesn't provide the necessary magic. Specifically, cgroups does not provide a way to atomically broadcast a signal to all members in a cgroup. That's what you would need to fix this problem, and that functionality doesn't exist.
Posted Feb 26, 2016 21:09 UTC (Fri)
by smurf (subscriber, #17840)
[Link] (7 responses)
Another way to fix this race would be to first stop all members of the cgroup in question, then check each PID whether it's still in there. If so, really kill the process, else send SIGCONT. But I agree that a kernel patch would be the best approach to this problem. In fact I wonder why that feature doesn't exist yet.
Posted Feb 27, 2016 0:04 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (6 responses)
The problem with process groups is that forking daemons usually create a new process group, and systemd is making a superficial attempt to handle those uncooperative daemons. I'd guess the idea was that most existing forking daemons don't know anything about cgroups so aren't going to be changing their membership.
As for the kernel maintainers, I never followed the LWN coverage of the systemd debates very closely, but it was my understanding that they were resistant to patches to cgroups which reinvented the wheel of interfaces like process groups. Also, the cgroups data structures are supposedly awkward and inefficient (in part, I assume, because of the nesting semantics) and they were reluctant to allow them to become more deeply embedded in the fundamentals of process management. But maybe I totally misunderstood things.
One possible hack would be to use a seccomp filter to silently ignore attempts to create a new process group. Another might be PID namespaces, though I'm not very familiar with them.
Although, to be clear, IMO all of these options are trying to put lipstick on a pig. I'm not sure systemd _should_ be fixed. I'd rather see all the poorly written software fixed. Teach poorly written daemons to optionally run in the foreground so systemd or any other service manager doesn't have to use hacks to track it. And for daemons that fork new processes, work to improve their correctness, and to not needlessly create new process groups. Basically, _subtract_ code, rather than add hundreds of thousands of lines of new code to the pile.
But things like systemd, Docker, etc, are all the rage these days. Apparently people prefer fixing problems with "full stack" solutions rather than submitting a 5-line diff. Whatever... just keep it all off my lawn :)
Posted Feb 27, 2016 1:57 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Process handles are another way to solve this.
Posted Feb 27, 2016 7:00 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (2 responses)
Process handles are useful for passing off the management of a process. For example, traditional shell-based init.d scripts could acquire a process handle and pass it on to a service manager, in effect obviating the need for a PID file. The service manager wouldn't even need to be PID 1, yet the PID file race issue would be solved just as well (or just as incompletely, depending on your perspective).
It does look like PID namespaces solves this problem. The manual page says that when the PID 1 "init" process in the namespace terminates, all the processes in that namespace are SIGKILL'd. So you could just kill that process without having to bother enumerating the PIDs, perhaps simply by closing its process handle. The problem with that solution--and all the others--is that it's not backwards compatible. Neither the software nor administrators may be expecting the process(es) to be running in a different PID namespace.
Posted Feb 27, 2016 7:16 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Feb 27, 2016 8:31 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
Posted Mar 7, 2016 1:41 UTC (Mon)
by cg909 (guest, #95647)
[Link]
The main problem is that process groups always belong to a session. So every service that spawns user sessions would also need to break out of the process group.
If you use seccomp filters to ignore setpgrp() and setsid() sshd would fail in spectacular ways as all processes in the process group will share the same controlling terminal and so every process spawned by sshd will receive SIGHUP when a session is closed. Also anything spawning a shell might run into problems as shells use process groups to separate tasks.
You'd need "super process groups" which may span multiple sessions and contain multiple process groups.
And this is what cgroups provide.
Posted Jun 14, 2016 17:32 UTC (Tue)
by davidlee (guest, #109327)
[Link]
The "solution" I am using is call supervisord. If I need something controlled from inside the Docker, I do it myself. I note some of the previous comments made derisive comments about the kinds of init scripts folks like me might come up with. So what? I don't need their permission, nor do I need their acceptance. With four decades of script writing, I think I can write one that will do the job.
Yes, I could use systemd for those applications that install scripts it will use. I like the three-line example for Apache. But it was quite unique in that Apache installs SO MUCH STUFF that the example works. Perhaps Nagios might also work. Or Splunk. Or a myriad of other major applications which are mature enough to do so.
One of my recent docker containers was an interface with Dropbox. Nope, no three-liner there. It required too much to configure and set up. Actually, virtually every Docker container I have designed has required setup and configuration -- and, thank you very much, a carefully crafted startup script.
If I have a docker which needs to manage internal processes, I'll stick with solutions like supervisord (there are more options, but this is the one I have settled on). I'd rather use that on the few docker containers I need it than have the weight of systemd in every single docker container I generate. Imagine a busybox container with systemd running...
Posted Feb 27, 2016 12:20 UTC (Sat)
by paulj (subscriber, #341)
[Link] (5 responses)
/me very interested to hear of rebuttals.
Posted Feb 27, 2016 14:14 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (1 responses)
However, it is still *way* better than anything any other init system is doing. Also, this method is able to identify and kill all processes which belong to the current task (again, assuming they're not maliciously forking around), which again is way better than anything else out there.
Posted Feb 27, 2016 18:52 UTC (Sat)
by cortana (subscriber, #24596)
[Link]
Posted Feb 27, 2016 23:42 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Forkbombs are more interesting - you CAN cause PID starvation by launching a forkbomb in an unconfined cgroup. https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt can help against it.
Another problematic case are PID races. SIGSTOP+SIGKILL does the job reliably, SIGSTOP can't be ignored and also forces the process to stick around.
Posted Feb 28, 2016 3:43 UTC (Sun)
by mchapman (subscriber, #66589)
[Link] (1 responses)
It might be possible for two or more cooperating processes to circumvent this by continually SIGCONTing each other, forking new processes along the way. cortana's suggestion of using the freezer controller seems like a better approach.
Posted Feb 28, 2016 6:44 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
It appears that process handles or PID namespaces is the only reliable way.
Posted Feb 28, 2016 20:21 UTC (Sun)
by justincormack (subscriber, #70439)
[Link]
Posted Mar 1, 2016 12:14 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker
Systemd vs. Docker