All this work seems totally useless.
Changes coming for systemd and control groups
Posted Jun 24, 2013 22:36 UTC (Mon) by foom (subscriber, #14868)
I'd expect not (and thus what systemd does is likely safer), but I dunno.
Posted Jul 2, 2013 20:56 UTC (Tue) by zblaxell (subscriber, #26385)
Posted Jul 2, 2013 23:14 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
And you might note, that traditional rc-scripts have nothing of this kind. I once had a very nice 3am trip to our datacenter to power-cycle a server that was stuck at trying to stop BIND during a reboot.
Posted Jul 3, 2013 15:39 UTC (Wed) by zblaxell (subscriber, #26385)
Obviously every system based on rc-scripts has been able to run such a daemon for decades prior to systemd's existence. The page at the link you provided even has a link to such a daemon that is at least 14 years old. I recently had an unsolicited email conversation with one of that daemon's maintainers about their plans to extend the daemon to do more invasive application-specific aliveness checking (I thought the idea wasn't insane, but I probably wouldn't use it because watchdog daemons are trivial to implement while solutions to political problems arising from software integration in critical code paths are not).
The more complicated the shutdown code is, the more likely it is to fail. If we try to stop mostly-stateless server daemons (like BIND) which are explicitly designed for and respond well to a famous widely-deployed 20-year-old system-wide SIGTERM/pause/SIGKILL sequence using anything with more failure modes than exactly that sequence, then embarrassing failure is simply inevitable.
Less is better. systemd has lots of tactical cleverness in its implementation, but at the same time it gets the basic strategy wrong.
I have a test machine running systemd. On February 13, 2013 I executed the 'reboot' command and today I'm still waiting for it to finish. Interestingly, the rest of this particular system's function seems to be unimpaired--including systemctl and systemd services--and I still push software to it for testing regularly. I'm now tempted to take bets on how many years it will take for that machine to reboot. It happens to have two independent battery-backed power supplies... };->
Posted Jul 4, 2013 4:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
Posted Jul 4, 2013 13:02 UTC (Thu) by jubal (subscriber, #67202)
Posted Jul 4, 2013 17:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
Posted Jul 4, 2013 17:59 UTC (Thu) by zblaxell (subscriber, #26385)
The problem in the BIND case isn't SysVInit or rc-style scripts, and it's not systemd's prerogative to solve. The problem is someone put BIND code on the critical path for rebooting. That is the mistake that needs to be corrected. Repeat for daemons we might find in a thousand other packages with code that is spuriously placed where it does not belong.
Server daemons that have special state-preserving needs can have scripts that try to bring them down with a non-blocking timeout (or systemd can do it itself). In practice, such servers don't get rebooted intentionally so the extra code executes only under unusual conditions where criteria for success are strict, or routine conditions (i.e. supervised upgrade of the software) where the criteria for success are greatly relaxed. That means the code doesn't get a lot of field testing, and its worst-case behavior only shows up in situations that are already full of unrelated surprises.
If I'm responsible for an application, then servers are just buildings for my application to live in. I rearrange the interior walls and fixtures of the building for the convenience of my application. If I need to reboot the server, it's because that building is on fire and I need a new one. I'll try to rescue my application state first--asynchronously, and with application-specific tools. When I have finished that, I'll tell the server to reboot. With that reboot request I implicitly guarantee there is no longer state on the server that I care about losing--my application is not running any more, or its state is so badly broken that I've given up. It would be convenient to umount filesystems and clean up state outside of my application if possible (in threads or processes separate from the rebooting thread due to the high risk of failure) but it is never necessary. The only necessary code in this situation is a hard reboot. Anything in the reboot critical path that isn't rebooting is a bug.
If I'm responsible for the server, then applications are cattle and I might want robot cowboys to organize them. This case is the same as the previous case, since my server would effectively be running a single customer-hosting application. systemd makes some sense as that application--although still not necessarily as PID 1, and certainly not as the sole owner of a variety of important kernel features. If a customer stopped paying for service or did something disruptive, I might intentionally destroy their process state with SIGKILL and cgroups. My customer agreement would have the phrase "terminated at any time" sprinkled liberally throughout the text so that nobody can claim to be surprised.
Posted Jul 5, 2013 1:22 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
Bugs ALWAYS happen and they MUST be accounted for. That's why we use OS with memory protection and separate address spaces.
If I have to manually check all the scripts to make sure that an error in a BlinkKeyboardLightDaemon doesn't stop the entire boot process, then such a system has no place except in a trashcan.
Posted Jul 5, 2013 3:14 UTC (Fri) by zblaxell (subscriber, #26385)
I have no complaints about the way systemd starts processes. The insanity starts when it's time to keep processes alive or reboot the system.
If we are still talking about reboot, it sounds like your solution to the problem of having buggy software in a critical code path is to add even more software in the critical path to supervise the buggy software and contain the impact of known bugs without fixing them. Presumably you also run this in some sort of nested container so that if systemd is buggy, some higher level of supervision (maybe another systemd?) can detect the problem and execute even more code in response. That layer could be buggy too, so it's nested supervisor software all the way down? Just the first level (the one with the initial bug) sounds insane to me, and every level of recursion squares the insanity.
"Yo dawg, I heard you like software, so I put some software on your software so you can run code to run your code..."
My approach is to look at the unnecessary code, and realize that even if that code was utterly perfect, it would not do anything more useful than no code at all, but would use more time, space, and power to achieve the null result. The sane thing to do is identify such code and simply remove it.
Posted Jul 5, 2013 3:30 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
>If we are still talking about reboot, it sounds like your solution to the problem of having buggy software in a critical code path is to add even more software in the critical path to supervise the buggy software and contain the impact of known bugs without fixing them.
Yup. My solution is to put a fairly SMALL amount of carefully tested code that can cope with whatever crap is thrown at it.
Your solution is to build a house of cards. Carefully checking each card, because we know that all bugs are easy to spot and fix.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds