Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 16:23 UTC (Wed) by raven667 (subscriber, #5198)
In reply to: Poettering: Revisiting how we put together Linux systems by NightMonkey
Parent article: Poettering: Revisiting how we put together Linux systems

> Icinga/Nagios is not 'near-beer'. It has parent/child relationships, too

I'm not sure what you are talking about nagios check_procs for eaxmple just shells out to ps to walk /proc, it is not the parent of services and doesn't have the same kind of iron clad handle to their execution state that something like runit or daemontools or systemd has.

> Systems like this are a band-aid.

You are never going to fix every possible piece of software out in the world to never crash, the first step is to admit that such a problem is possible, then you can go about mitigating the risks, not by building fragile systems that pretend the world is perfect and fall apart as soon as something doesn't go right, especially not as some form of self-punishment to cause pain to force bug fixes.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 16:34 UTC (Wed) by NightMonkey (subscriber, #23051) [Link] (8 responses)

I am prepared to be wrong about my experience-based opinions. :)

I just think a lot of this is because of decisions made to keep the binary-distro model going.

I'm not interested in fixing all the softwares. :) I am interested in getting the specific software I need, use and am paid to administer in working order. There are certainly many ways to skin that sad, sad cat. ;)

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 22:05 UTC (Thu) by Wol (subscriber, #4433) [Link] (7 responses)

Well, can I just point you at an example (reported here on LWN first hand ...) of a bog-standard bind install. (Dunno the ref, you'll have to search for it.)

The sysadmin did a "shutdown -r". The system (using init scripts) made the mistake of shutting the network down before it shut bind down. Bind - unable to access the network - got stuck in an infinite loop. OOPS!

The sysadmin, 3000 miles away, couldn't get to the console or the power switch, and with no network couldn't contact the machine ...

If a heavily-used program like bind can have such disasters lurking in its sysvinit scripts ...

And systemd would just have shut it down.

Cheers,
Wol

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:02 UTC (Thu) by NightMonkey (subscriber, #23051) [Link] (6 responses)

That a system architecture problem! Never put a system into production (or, hell, into 'staging') without remote power switches. And practice what happens when critical daemons go down (aka outage drills).

http://www.synaccess-net.com/

I don't care how "robust" the OS is. It's just being cheap that gets your organization into these kinds of situations (and that *is* an organizational problem, not just yours as the sysadmin.)

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:15 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Yeah, sure. So then you practice everything (and we did test it), and it works fine in testing. You schedule a time when nobody is around to upgrade the system without interruptions.

And then it breaks because of a race condition that happens in only about 10% of cases.

That sysadmin in this dramatization was me, and the offending script was: http://anonscm.debian.org/cgit/users/lamont/bind9.git/tre... near the line 92.

Of course, REAL Unix sysadmins must live in the server room, spending 100% of their time tweaking init scripts and auditing all the system code to make sure it can NEVER hang. NEVER. Also, they disable memory protection because their software doesn't need it.

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:20 UTC (Thu) by NightMonkey (subscriber, #23051) [Link] (4 responses)

No need to take it personally. But, sad to say, you learned a good lesson the hard way in sysadmining - never rely on one system for anything vital. :)

Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters. I don't like living in the DC, myself.

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 23:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Lots of companies can't afford to make a completely redundant infrastructure. Especially if you have a fairly powerful server (that one was handling storage for CCTV streams and lots of other tasks). And anyway, it's non-trivial in any case.

> Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters.
Oh, I've witnessed mainframe hangups. Remote reboot is nice, but that server was from around 2006 so it didn't have IPMI and the datacenter where it was hosted offered only manual reboots.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 4:09 UTC (Fri) by raven667 (subscriber, #5198) [Link] (2 responses)

> - never rely on one system for anything vital. :)

That sounds dangerously close to internalizing and making excuses for unreliable software rather than engineering better systems that work even in the crazy imperfect world, duct taping and RPS to the side of a machine is in addition to not a replacement for making it work right in the first place.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 4:43 UTC (Fri) by NightMonkey (subscriber, #23051) [Link] (1 responses)

It might sound close, but it isn't. :)

I don't think what you are saying are actually separate tasks. All software has bugs.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 14:45 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Sure all software has bugs but the risk and impact are not equally distributed.