Systemd as tragedy
Systemd as tragedy
Posted Jan 31, 2019 11:44 UTC (Thu) by freehck (guest, #91143)Parent article: Systemd as tragedy
http://judecnelson.blogspot.com/2014/09/systemd-biggest-f...
Posted Jan 31, 2019 14:05 UTC (Thu)
by pizza (subscriber, #46)
[Link] (26 responses)
Just because you're used to pain of beating your head against a given wall doesn't mean that continuing to do so is a sound strategy.
Posted Jan 31, 2019 18:32 UTC (Thu)
by zblaxell (subscriber, #26385)
[Link] (9 responses)
The former status quo (sysvinit, upstart, et al, in the last few years before systemd came along) was a series of disastrous regressions. It seemed every other upgrade introduced subtle (or not so subtle) changes in behavior that broke something important. Every time something broke, we would take over maintenance of that component, i.e. replace it with something that met our requirements, and disable the old thing. Eventually we locked down all our critical core services because upstream unapologetically broke every one of them at some point. Throughout this process we maintained compatibility with the former former status quo, i.e. decades-old management applications continued to work, and we'd still use most of the upstream initscripts (with suitable helper tools that solved their most obvious problems).
Transitioning to systemd is a tsunami of regressions. systemd comes with a massive reorganization of core services, fueled in part by the systemd community's encouragement of upstream developers to abandon compatibility with the former status quo (kudos to those developers who refused!). Problem areas that we previously identified and locked down need to be identified and fixed again because they now occur in new places. systemd also adds a bunch of new problems you only get to experience by running systemd on specific hardware/firmware combinations.
Some pieces of systemd are clearly better, but we can't run just those pieces without getting serious regressions at the same time. A few people are trying to break systemd into more easily consumable pieces, but so far that's not better, just a distinct set of regressions for us to manage. The former status quo was never fully integrated in the first place, so it's trivial to pick out fragments of working code (often you don't even need to compile anything).
At present, when we try to integrate systemd into an existing system image, that system image immediately stops working properly, so we don't do that. We have a lot of these images and they have indefinite life cycles, so we could be not using systemd for some decades (or until a catastrophic external event occurs, e.g. amd64 hardware stops being available and in-place upgrades cease to be possible).
Posted Jan 31, 2019 18:40 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
What are the regressions? Systemd can fallback to old SysV behavior if you stop using unit files.
Posted Feb 1, 2019 5:20 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link] (4 responses)
It depends on the method of upgrade. We can do a package manager install (e.g. something like 'apt-get install systemd'), or build and install systemd from source on a self-hosting system, or we can just build a completely new boot image with systemd and throw that on a VM or target device that previously ran a similar image without systemd. Each of these approaches fail in their own unique ways.
So far:
* assorted filesystem failures (filesystems don't mount, get mounted rw, get mounted in the wrong place, get stuck because DNS isn't up yet, get stuck because network down, mount with wrong flags, mount in multiple places, expose wrong filesystems to chroots, doesn't fsck, fscks with wrong flags)
A lot of the above are "systemd has a knob we can turn to fix this, but the knob is in a different place from the one we previously turned." Some of those are fixable and we have fixed them, but the new failure modes just keep coming. Some of them aren't even systemd problems, they're problems with random upstream packages that have introduced behavior changes to accommodate systemd.
We haven't yet seen a system that can just upgrade in place from sysvinit to systemd. Even "install Debian jessie on wiped system previously running Debian squeeze" managed to find a way to fail: our earlier netinst/debootstrap-based system images don't include any packages that touch firmware, but systemd ships with power management turned on by default, and the machine got into a state that the power button couldn't get it out of. I had to take the machine apart to make it boot again.
> Systemd can fallback to old SysV behavior if you stop using unit files.
That's...different from the usual advice I get. Is the idea to start with no unit files and systemd as /sbin/init, then replace one legacy config file with a unit at a time? And how do I explain what I'm doing to apt?
Posted Feb 1, 2019 6:00 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
> * assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
> * changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
> * changes in power management behavior (responds to previously deconfigured inputs, suspend loops, triggers buggy firmware, ignores existing PM configuration)
> * package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
> * missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
> * miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
> * different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
So to recap, you have a pile of crap that barely moves and breaks horribly when somebody breathes in its direction. I have no trouble believing that you have "regressions" with it.
Posted Feb 1, 2019 20:17 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link]
> This has nothing to do with systemd.
The "expose wrong systems to chroots", and "get mounted in the wrong place" are directly caused by systemd. Systemd makes / into a shared namespace, changing to the kernel default behavior (originally the _only_ behavior) which is private. So an application that doesn't expect shared namespace does bind mounts with default options (at one point there were no options), and that application gets moved to a system running systemd, and now suddenly the bind mount appears in places it should not, or can be umounted from places it should not. If it's a non-bind mount then there can be problems with device usage at umount (extra references, can't free the block device to manipulate it)
> BTW, it can track network dependencies for mounting
We usually prefer it not to. Unless the machine has an important database it needs to flush to disk or controls an external device that has specific powerdown requirements, we normally just SIGKILL all processes and go straight to poweroff on shutdown, without considering or modifying the state of filesystems or network devices. We have to have the crash code path because we'll always be forced to use it if the host crashes. There's no point in having a separate, longer shutdown code path with more complexity, interacting subsystems, and failure modes to test, when the project has no feature (like the database state) that needs it.
> > * assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
Sure, /lib/systemd/systemd probably didn't do that. It was more likely a unit file supplied by the distro with some unfortunate defaults set. But only systemd reads unit files, so here we are.
> > * changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
If the choices have to be "inaccurate fstab emulation" and "no fstab emulation" I'd prefer the latter; however, last time I checked, we were talking about upgrades, so our initial condition here is that there's fstab and it has important stuff in it that systemd('s units) should do something about.
If there was a tool that did a one-time conversion from /etc/fstab to an equivalent systemd unit collection, and could report failure if the fstab contents couldn't be exactly represented by systemd (because nobody wants to replicate 20-years-old undocumented quirks of behavior)...it'd probably flag every fstab file we have, so maybe not so helpful.
If systemd didn't process /etc/fstab at all (i.e. it calls mount -av to handle fstab because fstab is a legacy config file for a legacy tool, and anyone who really wants systemd to handle their mounting isn't using fstab any more)...then installing systemd on a legacy system wouldn't change behavior so much. Sure, it'd be slower, but correct (assuming replicating the legacy configuration exactly is correct) is more important than fast.
> > * package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
Sorry, I left out "random-package-that-depends-on-systemd-somehow." apt-get dist-upgrade is never completely trouble free, but when systemd is involved the failures can be spectacular. Very few Debian packages can trigger reboots when they fail--basically only packages that contain a pid 1 process.
> > * missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
Yes it did. We had to dispose of Debian insserv too.
The insserv event made us pivot our private service manager from "a small watchdog that pings a couple of important services on a few critical machines" to "a standard thing that we install everywhere, that handles everything we need done right by itself, and then (maybe) runs an expurgated version of the distro's normal init as a service after that's done."
> > * miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
Only "wipes /tmp on boot" happens with SysV init, and it has an off switch that unit files supplied with systemd don't respect. The cgroup service kill behavior definitely doesn't come with SysV init (though there are more ways to get it than installing systemd).
The backup issues are a consequence of the earlier private/shared mount behavior changes. Sorry, I didn't mean to repeat.
> > * different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
That one's only an issue if you were already using cgroups to wrangle services before systemd was installed on the system, and you were relying on resource constraints to prevent services from interfering with each other or the host in general. There are multiple layouts possible when mounting the cgroup fs, systemd just picked a different one than we did, and it's not possible for different layouts to coexist in the same kernel instance. We can adapt to systemd there, but the problem from an upgrade perspective is that we had to adapt to systemd there.
> So to recap, you have a pile of crap that barely moves and breaks horribly when somebody breathes in its direction.
We have requirements, and people who measure whether we meet them. That makes us naturally inflexible (some might say "rigid") about the changes we can accept from upstream. We can wait a year or two (or ten!) while upstream works their issues out, or switch to another upstream with fewer issues.
We are now living on an island that isolated us from the init daemon wars. We built the island because Debian threw insserv at us, and it was cheaper to move to a mole-free hill than to play whack-a-mole long enough to fix Debian. We kind of like it here--it's a cheap place to live, we're not constantly playing whack-a-mole any more, our machines do their jobs and enable us to do ours despite all the crazy bad stuff that happens to them, and we can still use the important parts of Debian. Why would we leave? More people should live here.
If systemd is going to be another round of whack-a-mole like insserv then we'll just stay here on this nice island thanks. If Debian gets their packaging sorted so installing systemd doesn't break existing systems so hard, we might leave the island (the tiny amount of maintenance work we have to do on the island is cheap, but no work at all is cheaper). No rush--there's plenty of other stuff to do while we're waiting for that to happen.
> I have no trouble believing that you have "regressions" with it.
We have lots of cases of the form "triggering event: systemd install" -> "root cause: new behavior introduced by systemd, its default configuration, or its dependencies" -> "fix: update mature code for the first time in 10-20 years." Other cases end with a more direct reference to a previously fixed bug (like "sometimes xyz service is not running") that now has to be fixed again. If those are not "regressions" then I need a better thesaurus.
Posted Feb 3, 2019 13:48 UTC (Sun)
by judas_iscariote (guest, #47386)
[Link] (1 responses)
So, it exposes bugs on your systems that were ignored by other init.
> assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
Systemd does not set the firewall up, another component does and unless you are using explicitly using systemd-networkd it does not setup network interfaces other than "lo".
> changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
As opossed to never noticing or getting ignored.
> changes in power management behavior (responds to previously deconfigured inputs, suspend loops, triggers buggy firmware, ignores existing PM configuration)
Yes, it does respond to previously deconfigured inputs... I do not know about what suspend loop are you talking about.. neither why you think bugs in firmware have anything to do with systemd.. and what do you mean with existing "PM configuration" .. do you mean runtime configuration that was set with a bunch of mostly useless shell scripts from pm-utils are no longer ran ? this is a distribution policy thing. Most if not all the power management configuration are kernel knobs and usually the kernel sets no policy whatsoever by default.
> package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
Well.. packages are not part of systemd...for the start.
> missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
Missing dependency information will make obvious what in the previous system was that "funny" ocurrence that caused service to fail on a particular phase of the moon.. as you probably already know, rc scripts are provided by the particular distribution you are using and are incompatible with others.
> miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
cgroup service kill behaviour is configurable..and better than it was before. (i.e there was none, if you were lucky things started correctly and that's it). /tmp wiped on boot is configurable and a distribution policy issue..
* different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
Not sure what that means at all.
Posted Feb 5, 2019 16:49 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
We configured the other init to not have those problems. There's no packaged, working upgrade path from anything to systemd, so on upgrade every configuration knob we've ever set over the years instantly reverts to whatever the distro default is today, and that breaks a lot of stuff.
Some of the bugs are unique to systemd, i.e. if systemd didn't exist, we'd deterministically never have the problems. Other inits don't depend on non-default kernel behavior to the extent systemd does, so we hit problems that happen with test systems using systemd and nowhere else.
The original question was "what are the regressions?" so I just summarized the last decade of ways existing code can combine with systemd and fail (so far).
> > changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
Those errors were intentionally ignored (by setting fsck pass field to 0, or using the 'bg' option for NFS). Absent and busted filesystems are a significant use case for us, so all the service applications we run handle it. 'mount -a' normally doesn't even try to mount if the device isn't present at all.
Posted Feb 1, 2019 1:12 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (2 responses)
After 15 years of teaching Linux to new system administrators I can confidently state that IMHO the former status quo really, really sucked. There's nothing like explaining the setup to someone who isn't already invested in it that brings out just how terrible it really was. With hindsight it baffles me (a) how we managed to tolerate it for so long (30+ years!), and (b) how people can seriously believe that, in a world where systemd exists, it is still something worth using.
Posted Feb 1, 2019 1:25 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
For the first few years of my Linux experience I was treating init scripts like something magical. I remember cutting&pasting a script to run Jetty (a Java web server) as a service and spending a whole day trying to understand how to make it launch automatically.
Posted Sep 6, 2019 0:06 UTC (Fri)
by soes (guest, #134247)
[Link]
Posted Feb 1, 2019 1:08 UTC (Fri)
by flussence (guest, #85566)
[Link] (14 responses)
Yes, it's *good* that the LSB trash fire was put out, less so that the explosives used for it levelled the rest of the town.
Posted Feb 1, 2019 1:51 UTC (Fri)
by luya (subscriber, #50741)
[Link] (5 responses)
Systemd is basically what upstart should be would Canonical properly relicense their clause as requested. Ask both Poettering and Renmant, the original author of upstart. (apology for mispelling).
OpenRC still relied on shell script method that is unwanted for the majority of distributions hence. Linux kernel badly needed its own system management. While both upstart and OpenRC tried, they weren't enough because of these shellscript.
Posted Feb 4, 2019 6:14 UTC (Mon)
by flussence (guest, #85566)
[Link] (4 responses)
Floral prose can't hide your total lack of understanding. You speak of OpenRC in the past tense, ignoring the fact that multiple distros use it and that systemd is *still* playing catch-up (it doesn't even use cgroups in a secure way!)
Posted Feb 4, 2019 11:31 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
What exactly is more secure in OpenRC?
Posted Feb 7, 2019 0:29 UTC (Thu)
by luya (subscriber, #50741)
[Link] (2 responses)
Posted Feb 13, 2019 8:19 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Feb 13, 2019 10:51 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Feb 1, 2019 5:25 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link]
Posted Feb 2, 2019 0:19 UTC (Sat)
by rgmoore (✭ supporter ✭, #75)
[Link] (6 responses)
It would be easier to believe that RedHat was ignoring Upstart if they hadn't adopted it as their init system for RHEL 6. Fedora also used it starting with Fedora 9. In practice, most of the major distributions first switched to Upstart (or something other than SysV init) and then moved to systemd because they found it did a better job of solving their problems.
Posted Feb 2, 2019 1:10 UTC (Sat)
by rahulsundaram (subscriber, #21946)
[Link] (5 responses)
Indeed. Red Hat developers involved in Fedora looked at either developing a new system or existing alternatives to sysvinit and choose Upstart
https://fedoraproject.org/wiki/FCNewInit
Red Hat specifially moved to using Upstart in RHEL 6 and had no corporate plan to move away from it. Lennart developed Systemd on his own time (despite discouragement from his manager precisely because RHEL 6 had committed to Upstart already) originally after discussing with Scott, the primary developer of Upstart and it is likely that if Canonical had given up on its insistence on a CLA that Upstart would have been morphed into adopting many of the ideas from Systemd (which themselves were heavily based on Launchd)
Red Hat atleast at that time was heavily developer driven. There was no corporate plan to diss Canonical or anything like that. On the contrary, there was prominent Red Hat contributors running Debian and even contributing to them and several of the major JBoss contributors were big Ubuntu users/fans etc. Development for the enterprise releases at that time was surprisingly loosely organized ( Fun fact: RHEL 6 was originally planning to ship both up2date and yum until I argued in the mailing list fairly last minute that it was pointless to do so and won that debate by merely posting to corresponding os development list in the last minute) and the resulting quality was often driven by passionate heroic developers. Judging by the external signs, even in Fedora, things have better a lot more organized and more process driven but anyone thinking that systemd was somehow corporate driven is just simply mistaken.
Posted Feb 4, 2019 19:15 UTC (Mon)
by jccleaver (guest, #127418)
[Link] (4 responses)
And it's worth reiterating for folks that RHEL 6's Upstart implementation was mostly identical to the previous (traditional) init implementation, at least for anything that most admins had to deal with on a regular basis. (You could go your entire deployment without having to care how getty was started in early boot, for example, and no one was using /etc/inittab for starting things directly by that point anyway.)
Hard to say if a RHEL7 w/o systemd would have tried to move towards using more of upstart's dynamic features, but I doubt too much change ever would be done in RHEL6 when the static SysV-based method worked fine enough. Those that needed dynamic or monitored service management knew how to hook those *into* a SysV framework, which is why the PID1 vs PID2 debate now is so salient.
Posted Feb 5, 2019 22:08 UTC (Tue)
by johannbg (guest, #65743)
[Link] (3 responses)
So the only person that could truly compare and criticize systemd from upstart perspective is the same man that wrote it and recommended that another init system would written as opposed to his being fixed ( hence systemd got bourne )...
So people can contemplate that *fact" while they continue to riding wishfull thinking pony's in circle, throwing not rocks but bricks in the glass house while worshipping the shell god's and chanting rhymes about glorious sysV, openRC and whatnot.
Bottom line systemd can be critize by many things ( and arguably justly so ) but none of which it has been critizised for in this thread.
Comparing legacy shell script based init systems with systemd is comparing apples to oranges...
Posted Feb 6, 2019 0:16 UTC (Wed)
by anselm (subscriber, #2796)
[Link] (1 responses)
The main advantage of systemd is that it actually exists today. Over and over again we hear a lot about how System-V init or for that matter OpenRC, with just a few bits and pieces added to them, could be far better than systemd, but nobody seems to be prepared to do the drudge work to actually prove this by demonstration.
IOW, one Lennart Poettering who actually releases working code is better than ten people who complain about how bad systemd is, and fantasise about the great code somebody (not them) could write that would make init system XYZ obviously superior to systemd.
Posted Feb 6, 2019 23:28 UTC (Wed)
by johannbg (guest, #65743)
[Link]
I might drop by at the next bsd con to see where they are at with this as in if they have realized they need an service manager ( solaris came up with smf what 10 or tweenty years ago ) and if so are discussing what they can learn from systemd and adapt to their own service manager and what they are going to leave out from systemd or if they are still in denial.
Posted Feb 11, 2019 22:55 UTC (Mon)
by oak (guest, #2786)
[Link]
Posted Feb 4, 2019 19:08 UTC (Mon)
by jccleaver (guest, #127418)
[Link]
The "former status quo" is still the "status quo" for RHEL6 users. And it still works just fine.
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Honestly, this sounds like so much nonsense.
Systemd as tragedy
* assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
* changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
* changes in power management behavior (responds to previously deconfigured inputs, suspend loops, triggers buggy firmware, ignores existing PM configuration)
* package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
* missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
* miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
* different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
Systemd as tragedy
This has nothing to do with systemd. BTW, it can track network dependencies for mounting
Nothing to do with systemd.
Don't use fstab.
Not sure about this.
Duh. Nothing to do with systemd.
Debian's inssrv from the last decade has the same issue.
All happened with regular SysV init.
Not sure about that one.
Systemd as tragedy
> Nothing to do with systemd.
> Don't use fstab.
> Duh. Nothing to do with systemd.
> Debian's inssrv from the last decade has the same issue.
> All happened with regular SysV init.
> Not sure about that one.
Systemd as tragedy
Systemd as tragedy
> As opossed to never noticing or getting ignored.
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
- has the situation with regards to how to design systemd dependencies become
more clear ? One of the trouble with systemd is clear and concise explanation of
how to properly design dependencies ? especially to be resilient.
I personally have trouble with to understand that little gem (and i do have
exposure to more very different systems than i expect most people in this group.)
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Should it be the case, OpenRC would be widely adopted in majority of distributions. Reality shows its usage as pointed out is limited to mostly Gentoo based distributions.
Shellscript used for both init and by extension for system management does not cut in modern technologies for a reason.
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
Comparing legacy shell script based init systems with systemd is comparing apples to oranges...
Systemd as tragedy
Systemd as tragedy
Systemd as tragedy
