Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 10:26 UTC (Wed) by cortana (subscriber, #24596)
In reply to: Poettering: Revisiting how we put together Linux systems by NightMonkey
Parent article: Poettering: Revisiting how we put together Linux systems

> Monit, Puppet, Chef, watchdog, and many other programs can do that simply defined task and do it well.

If they do it by running '/etc/init.d/foo status' then, no, they can't.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 15:26 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (12 responses)

If your init scripts suck, then sure. In that case, you'd also have Nagios/Icinga checks making sure that N number of processes matching the name of the daemon are running at all times. But, this misses the overarching point I was making that there are better ways to achieve the goal than making your init system more fragile and brittle.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 15:46 UTC (Wed) by raven667 (subscriber, #5198) [Link] (11 responses)

The init system is not fragile, it is widely deployed and does not seem to be falling over, most of the ongoing work in in the ancillary tool box, like logind, and not in pid 1. Init scripts universally suck, even compared to service monitors like daemontools from the last century, claiming that periodic nagios checks are a replacement for having a real parent/child relationship with started services is silly, that's the near-beer version of service monitoring, not the real thing. Capturing and logging stdout/stderr and the exit code from a service are great features that don't exist in a standard fashion with init scripts, sure you could hack some of that together on an individual script basis but it's not a feature provided to you.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 16:09 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (10 responses)

Icinga/Nagios is not 'near-beer'. It has parent/child relationships, too. You are shouting "we must have a SYSTEM", when I don't know if that's the answer. People have just not used the tools correctly. This is just One More Thing To Go Wrong. And, yay, there will likely be a LPI exam level for figuring out what went wrong. ;)

More work is needed to make the relationship between users and developers LESS obscured, not more. When I reported a core Apache bug to the Apache developers in 1999, I had a fix in 24 hours, and so did everyone else. Now, if instead I just relied on some system to restart Apache, that bug might have gone unnoticed and unfixed, at least for longer.

Systems like this are a band-aid. Putting more and more complex systems in as substitutes for bug fixing and more human-to-human communication are what are the problem.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 16:23 UTC (Wed) by raven667 (subscriber, #5198) [Link] (9 responses)

> Icinga/Nagios is not 'near-beer'. It has parent/child relationships, too

I'm not sure what you are talking about nagios check_procs for eaxmple just shells out to ps to walk /proc, it is not the parent of services and doesn't have the same kind of iron clad handle to their execution state that something like runit or daemontools or systemd has.

> Systems like this are a band-aid.

You are never going to fix every possible piece of software out in the world to never crash, the first step is to admit that such a problem is possible, then you can go about mitigating the risks, not by building fragile systems that pretend the world is perfect and fall apart as soon as something doesn't go right, especially not as some form of self-punishment to cause pain to force bug fixes.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 16:34 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (8 responses)

I am prepared to be wrong about my experience-based opinions. :)

I just think a lot of this is because of decisions made to keep the binary-distro model going.

I'm not interested in fixing all the softwares. :) I am interested in getting the specific software I need, use and am paid to administer in working order. There are certainly many ways to skin that sad, sad cat. ;)

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 22:05 UTC (Thu) by Wol (subscriber, #4433) [Link] (7 responses)

Well, can I just point you at an example (reported here on LWN first hand ...) of a bog-standard bind install. (Dunno the ref, you'll have to search for it.)

The sysadmin did a "shutdown -r". The system (using init scripts) made the mistake of shutting the network down before it shut bind down. Bind - unable to access the network - got stuck in an infinite loop. OOPS!

The sysadmin, 3000 miles away, couldn't get to the console or the power switch, and with no network couldn't contact the machine ...

If a heavily-used program like bind can have such disasters lurking in its sysvinit scripts ...

And systemd would just have shut it down.

Cheers,
Wol

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:02 UTC (Thu) by NightMonkey (subscriber, #23051) [Link] (6 responses)

That a system architecture problem! Never put a system into production (or, hell, into 'staging') without remote power switches. And practice what happens when critical daemons go down (aka outage drills).

http://www.synaccess-net.com/

I don't care how "robust" the OS is. It's just being cheap that gets your organization into these kinds of situations (and that *is* an organizational problem, not just yours as the sysadmin.)

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:15 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Yeah, sure. So then you practice everything (and we did test it), and it works fine in testing. You schedule a time when nobody is around to upgrade the system without interruptions.

And then it breaks because of a race condition that happens in only about 10% of cases.

That sysadmin in this dramatization was me, and the offending script was: http://anonscm.debian.org/cgit/users/lamont/bind9.git/tre... near the line 92.

Of course, REAL Unix sysadmins must live in the server room, spending 100% of their time tweaking init scripts and auditing all the system code to make sure it can NEVER hang. NEVER. Also, they disable memory protection because their software doesn't need it.

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:20 UTC (Thu) by NightMonkey (subscriber, #23051) [Link] (4 responses)

No need to take it personally. But, sad to say, you learned a good lesson the hard way in sysadmining - never rely on one system for anything vital. :)

Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters. I don't like living in the DC, myself.

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Lots of companies can't afford to make a completely redundant infrastructure. Especially if you have a fairly powerful server (that one was handling storage for CCTV streams and lots of other tasks). And anyway, it's non-trivial in any case.

> Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters.
Oh, I've witnessed mainframe hangups. Remote reboot is nice, but that server was from around 2006 so it didn't have IPMI and the datacenter where it was hosted offered only manual reboots.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 4:09 UTC (Fri) by raven667 (subscriber, #5198) [Link] (2 responses)

> - never rely on one system for anything vital. :)

That sounds dangerously close to internalizing and making excuses for unreliable software rather than engineering better systems that work even in the crazy imperfect world, duct taping and RPS to the side of a machine is in addition to not a replacement for making it work right in the first place.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 4:43 UTC (Fri) by NightMonkey (subscriber, #23051) [Link] (1 responses)

It might sound close, but it isn't. :)

I don't think what you are saying are actually separate tasks. All software has bugs.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 14:45 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Sure all software has bugs but the risk and impact are not equally distributed.