Debian reconsiders init-system diversity

Posted Nov 18, 2019 4:20 UTC (Mon) by cas (guest, #52554)
In reply to: Debian reconsiders init-system diversity by xnox
Parent article: Debian reconsiders init-system diversity

> Reboot & shutdown speeds matter.

yes, they do.

That's why it pisses me off that on some of my servers running systemd, it can take tens of minutes - or longer - to shutdown or reboot.

It often goes into a seemingly endless wait-loop - I say "seemingly" because I don't have hours or days to wait for the machine to finish shutdown and reboot, and frequently have to just physically power-cycle the machine just so I can reboot it sometime today rather than tomorrow. or next year.

And "hours" is NOT an exaggeration - I have occasionally left machines in that wait-loop state for half a day or a day just to see if it will eventually give up and reboot like I told it to. That patience is not rewarded, the wait is only resolved with a power cycle.

I have NFI why it does that, but if I watch the console while a server is shutting down, I see systemd repeatedly increasing the timeout while it's waiting for something (like stopping a service or unmounting a filesystem or de-configuring a network interface) to stop. and for each repeat it increases the timeout - from tens of seconds to several minutes to tens of minutes and even hours.

why even bother having a timeout on a stop action if you're just going to repeat the wait with an ever-increasing timeout?

I **never** had that problem with sysvinit. My machines still running sysvinit give everything a reasonable timeout to stop cleanly - and then reboot anyway. Which is **exactly** what a server **should** do when an zombie process or whatever won't die - otherwise it'll get stuck waiting forever.

in my experience, systemd does not cope well with error conditions and takes ages to either boot up, shutdown, or reboot if there are any problems - especially with drives or networks. and when shit like that happens, that's exactly the time you want the machine to reboot quickly so you can find the problem and fix it. not wait 10s of minutes or hours for every reboot.

if everything goes well, systemd can boot up in 5 seconds rather than 15. amazing! that's really worth writing home about. if not, systemd might finish booting in hours rather than seconds.

---

And booting up with systemd is no better - on one of my machines, booting it will sometimes trigger an endless (?) loop of adding the same half-dozen routes to the routing table, ending up with MILLIONS of copies of the same half-dozen routes in the routing table (as shown by "ip route list > ip-route-list.txt" - the last time I tried that, ip-route-list.txt was over 15 million lines long...and it was only that "short" because I got bored of waiting for "ip route list" to finish, and hit ^C to kill it). e.g.

# wc -l /root/ip-route-list.txt
15426932 /root/ip-route-list.txt

To stop it, I have to login on the console, down the interface(s), and then manually bring them back up again. and then I have to manually restart the services that were bound to specific IP addresses (such as the ones I only want listening on an internal LAN address rather than a live internet address. The ones that listen on all addresses generally didn't need to be manually restarted). Obviously, this is a completely shit thing to have to do on a server - it's supposed to boot up and start providing its services without manual intervention.

Again, I have NFI why that happens. Neither systemd-analyse nor systemctl status nor journalctl have any useful information to offer.

At least with sysvinit i'd be able to find out which script had a broken loop. At worst, i could add debugging printf statements in likely-suspect scripts. with systemd, it just does what it wants to do when it wants to do it and fuck you if you want to find out why or, cthulhu forbid, override it.

I've got to the point with this infuriating bug that I'm going to P2V clone the entire system into a VM so that I can experiment enough to find out WTF is going on without having my server down for days on end. I'll be able to reboot the VM as often as I need without disturbing the other machines that depend on the services it provides (NFS, DNS cache, dhcpd, squid proxy, and more).

my current guess is that it'll probably turn out to be some stupid race condition - but I don't know where or what's causing it.

and on another of my machines (my mythtv box at home), it boots perfectly with sysvinit but whenever I try to boot it with systemd it fails to boot at all. every single time.

Debian reconsiders init-system diversity

Posted Nov 18, 2019 5:03 UTC (Mon) by sjj (guest, #2020) [Link] (3 responses)

You do know that it would be more productive to open a bug with your distro or vendor or even systemd itself and offer to troubleshoot on your end? Tone down the butthurt, it's only software.

Debian reconsiders init-system diversity

Posted Nov 18, 2019 7:56 UTC (Mon) by cas (guest, #52554) [Link] (2 responses)

So mentioning actual problems experienced because of systemd is "butthurt", is it?

Right, I'll remember that in future. Your amazing and incredibly useful insight completely overshadows my 40+ years of experience in diagnosing and solving computer problems. Ignore problems, and don't expect software to work properly, otherwise you're just being "butthurt". Wonderful.

Or, here's a thought: you so strongly believe that systemd can do no wrong that any evidence that it's less than perfect triggers a defensive reaction of blowhard bullshit lest the cognitive dissonance blow your tiny mind.

BTW, there's little or no point in submitting systemd-related bug reports to debian because most of my fellow DDs have caught the "fuck you, you're doing it wrong, it's your fault" disease from systemd developers. because, of course, systemd can do no wrong.

In any case, there's nothing useful to report because I don't know what the cause is yet. I have some vague suspicions about what the source of the problem might be but as far as reportable facts, I don't have much more than "it's broken (sometimes)", which is about the most useless bug report possible....a complete waste of time for everyone concerned.

Have A Nice Day.

Debian reconsiders init-system diversity

Posted Nov 18, 2019 8:40 UTC (Mon) by zdzichu (subscriber, #17118) [Link]

Well, if you don't report bugs, they won't get fixed. This is the basic building block of opensource communicties. Both Debian and systemd itself have quite extensive pages on how to use various debugging facilities to figure out what's going on. It should be enough to prepare worthwhile bugreport.

But if it still not enough, I'd expect maintainer to provide targeted debugging instructions. If you feel there's no point in submitting bug reports, I'd strongly suggest to change your provider. There are many distributions to choose from.

Debian reconsiders init-system diversity

Posted Nov 21, 2019 13:04 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> I don't have much more than "it's broken (sometimes)", which is about the most useless bug report possible....a complete waste of time for everyone concerned.

At some point, this is still useful to file. Maybe the developers have seen it before and know where to point you for the fix and/or discussion of a fix. Sitting in your corner with your puzzle you won't show anyone complaining that it's "impossible" doesn't do much to convince anyone that it's actually an "impossible" problem. There's no shame in saying "well crap, I'm out of ideas, anyone else have a thought about this problem?".

Debian reconsiders init-system diversity

Posted Nov 18, 2019 12:10 UTC (Mon) by pizza (subscriber, #46) [Link] (1 responses)

> It often goes into a seemingly endless wait-loop - I say "seemingly" because I don't have hours or days to wait for the machine to finish shutdown and reboot, and frequently have to just physically power-cycle the machine just so I can reboot it sometime today rather than tomorrow. or next year.

Every time I've seen systemd stuck waiting on something, I've seen it get reported to the console, and start/complete times are logged. I've not personally seen any shutdown task without a hard upper time limit (and that limit is configurable globally) but I suppose one could exist elsewhere, perhaps as part of some distro/daemon/application-provided unit.

I might add that this is probably not a "systemd bug". The underlying problem is that something is not shutting down cleanly, and the system has been explicitly configured to wait indefinitely for it. In other words, it's most likely due to how your distribution (or software supplier, or admin) integrated the troublesome component.

Have you at least reported this bug? Clearly this is not desired behaviour.

Meanwhile, with respect to your ip routing bug. You didn't say how networking was configured; networkmanager? systemd-networkd? distro (which one?) ifup/down scripts? script that manually invokes ifconfig/ip tools? dhcp or static configuration? Are these routes statically/explcitily configured, or automagic with respect to the networking infrastructure? Details matter, and supplying that (here or as part of a bug report) doesn't require setting up VMs or whatnot.

(FWIW, this sounds a lot like what happened when I wrote an incorrect networkmanager hook script to selectively enable a VPN -- It worked, except when it didn't, ending up with multiple VPN instances stepping on each other's toes with ping times that would grow to 4-5s before the whole thing came crashing down and restarted itself..)

Debian reconsiders init-system diversity

Posted Nov 21, 2019 13:07 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> The underlying problem is that something is not shutting down cleanly, and the system has been explicitly configured to wait indefinitely for it.

I agree. It sounds like some dependency specification problem. I see it as akin to "parallel build is busted, but the serial build works". Well, this means your rules aren't listing their dependencies properly. The right fix is to find that missing link and fix it, not say "well, then just use `-j1` and be happy". Granted, systemd is at times tight lipped about some of these things, but it takes some analysis of the logs of services and the systemd service event logs to figure out.