Systemd as tragedy

Posted Feb 1, 2019 6:00 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
In reply to: Systemd as tragedy by zblaxell
Parent article: Systemd as tragedy

> * assorted filesystem failures (filesystems don't mount, get mounted rw, get mounted in the wrong place, get stuck because DNS isn't up yet, get stuck because network down, mount with wrong flags, mount in multiple places, expose wrong filesystems to chroots, doesn't fsck, fscks with wrong flags)
This has nothing to do with systemd. BTW, it can track network dependencies for mounting

> * assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
Nothing to do with systemd.

> * changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
Don't use fstab.

> * changes in power management behavior (responds to previously deconfigured inputs, suspend loops, triggers buggy firmware, ignores existing PM configuration)
Not sure about this.

> * package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
Duh. Nothing to do with systemd.

> * missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
Debian's inssrv from the last decade has the same issue.

> * miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
All happened with regular SysV init.

> * different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
Not sure about that one.

So to recap, you have a pile of crap that barely moves and breaks horribly when somebody breathes in its direction. I have no trouble believing that you have "regressions" with it.

Systemd as tragedy

Posted Feb 1, 2019 20:17 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

> > * assorted filesystem failures (filesystems don't mount, get mounted rw, get mounted in the wrong place, get stuck because DNS isn't up yet, get stuck because network down, mount with wrong flags, mount in multiple places, expose wrong filesystems to chroots, doesn't fsck, fscks with wrong flags)

> This has nothing to do with systemd.

The "expose wrong systems to chroots", and "get mounted in the wrong place" are directly caused by systemd. Systemd makes / into a shared namespace, changing to the kernel default behavior (originally the _only_ behavior) which is private. So an application that doesn't expect shared namespace does bind mounts with default options (at one point there were no options), and that application gets moved to a system running systemd, and now suddenly the bind mount appears in places it should not, or can be umounted from places it should not. If it's a non-bind mount then there can be problems with device usage at umount (extra references, can't free the block device to manipulate it)

> BTW, it can track network dependencies for mounting

We usually prefer it not to. Unless the machine has an important database it needs to flush to disk or controls an external device that has specific powerdown requirements, we normally just SIGKILL all processes and go straight to poweroff on shutdown, without considering or modifying the state of filesystems or network devices. We have to have the crash code path because we'll always be forced to use it if the host crashes. There's no point in having a separate, longer shutdown code path with more complexity, interacting subsystems, and failure modes to test, when the project has no feature (like the database state) that needs it.

> > * assorted network failures (firewall rules not established before devices up, devices up when they shouldn't be, devices not up when they should be, unwanted DHCP, unwanted autoconfigured IP addresses, firewall rules broken by network device name changes)
> Nothing to do with systemd.

Sure, /lib/systemd/systemd probably didn't do that. It was more likely a unit file supplied by the distro with some unfortunate defaults set. But only systemd reads unit files, so here we are.

> > * changes in error handling behavior (mount failure stops boot dead if the device is not noauto in fstab)
> Don't use fstab.

If the choices have to be "inaccurate fstab emulation" and "no fstab emulation" I'd prefer the latter; however, last time I checked, we were talking about upgrades, so our initial condition here is that there's fstab and it has important stuff in it that systemd('s units) should do something about.

If there was a tool that did a one-time conversion from /etc/fstab to an equivalent systemd unit collection, and could report failure if the fstab contents couldn't be exactly represented by systemd (because nobody wants to replicate 20-years-old undocumented quirks of behavior)...it'd probably flag every fstab file we have, so maybe not so helpful.

If systemd didn't process /etc/fstab at all (i.e. it calls mount -av to handle fstab because fstab is a legacy config file for a legacy tool, and anyone who really wants systemd to handle their mounting isn't using fstab any more)...then installing systemd on a legacy system wouldn't change behavior so much. Sure, it'd be slower, but correct (assuming replicating the legacy configuration exactly is correct) is more important than fast.

> > * package installation failures ('apt-get install random-package' blows up dependency resolver, packages conflict at runtime, init crashes during the upgrade and panics the kernel, system left unbootable)
> Duh. Nothing to do with systemd.

Sorry, I left out "random-package-that-depends-on-systemd-somehow." apt-get dist-upgrade is never completely trouble free, but when systemd is involved the failures can be spectacular. Very few Debian packages can trigger reboots when they fail--basically only packages that contain a pid 1 process.

> > * missing dependency information is now a RC bug (systemd picks an execution order different from the order used for the previous decade)
> Debian's inssrv from the last decade has the same issue.

Yes it did. We had to dispose of Debian insserv too.

The insserv event made us pivot our private service manager from "a small watchdog that pings a couple of important services on a few critical machines" to "a standard thing that we install everywhere, that handles everything we need done right by itself, and then (maybe) runs an expurgated version of the distro's normal init as a service after that's done."

> > * miscellaneous data loss (cgroup service kill behavior for ssh, wipes /tmp on boot, backups fail due to not enough mounts or fill up disk because too many)
> All happened with regular SysV init.

Only "wipes /tmp on boot" happens with SysV init, and it has an off switch that unit files supplied with systemd don't respect. The cgroup service kill behavior definitely doesn't come with SysV init (though there are more ways to get it than installing systemd).

The backup issues are a consequence of the earlier private/shared mount behavior changes. Sorry, I didn't mean to repeat.

> > * different cgroup controller organization (broke cgroup management code, caused a few host crashes and a lot of performance losses)
> Not sure about that one.

That one's only an issue if you were already using cgroups to wrangle services before systemd was installed on the system, and you were relying on resource constraints to prevent services from interfering with each other or the host in general. There are multiple layouts possible when mounting the cgroup fs, systemd just picked a different one than we did, and it's not possible for different layouts to coexist in the same kernel instance. We can adapt to systemd there, but the problem from an upgrade perspective is that we had to adapt to systemd there.

> So to recap, you have a pile of crap that barely moves and breaks horribly when somebody breathes in its direction.

We have requirements, and people who measure whether we meet them. That makes us naturally inflexible (some might say "rigid") about the changes we can accept from upstream. We can wait a year or two (or ten!) while upstream works their issues out, or switch to another upstream with fewer issues.

We are now living on an island that isolated us from the init daemon wars. We built the island because Debian threw insserv at us, and it was cheaper to move to a mole-free hill than to play whack-a-mole long enough to fix Debian. We kind of like it here--it's a cheap place to live, we're not constantly playing whack-a-mole any more, our machines do their jobs and enable us to do ours despite all the crazy bad stuff that happens to them, and we can still use the important parts of Debian. Why would we leave? More people should live here.

If systemd is going to be another round of whack-a-mole like insserv then we'll just stay here on this nice island thanks. If Debian gets their packaging sorted so installing systemd doesn't break existing systems so hard, we might leave the island (the tiny amount of maintenance work we have to do on the island is cheap, but no work at all is cheaper). No rush--there's plenty of other stuff to do while we're waiting for that to happen.

> I have no trouble believing that you have "regressions" with it.

We have lots of cases of the form "triggering event: systemd install" -> "root cause: new behavior introduced by systemd, its default configuration, or its dependencies" -> "fix: update mature code for the first time in 10-20 years." Other cases end with a more direct reference to a previously fixed bug (like "sometimes xyz service is not running") that now has to be fixed again. If those are not "regressions" then I need a better thesaurus.