Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Posted Sep 3, 2014 16:23 UTC (Wed) by raven667 (subscriber, #5198)In reply to: Poettering: Revisiting how we put together Linux systems by NightMonkey
Parent article: Poettering: Revisiting how we put together Linux systems
I'm not sure what you are talking about nagios check_procs for eaxmple just shells out to ps to walk /proc, it is not the parent of services and doesn't have the same kind of iron clad handle to their execution state that something like runit or daemontools or systemd has.
> Systems like this are a band-aid.
You are never going to fix every possible piece of software out in the world to never crash, the first step is to admit that such a problem is possible, then you can go about mitigating the risks, not by building fragile systems that pretend the world is perfect and fall apart as soon as something doesn't go right, especially not as some form of self-punishment to cause pain to force bug fixes.
Posted Sep 3, 2014 16:34 UTC (Wed)
by NightMonkey (subscriber, #23051)
[Link] (8 responses)
I just think a lot of this is because of decisions made to keep the binary-distro model going.
I'm not interested in fixing all the softwares. :) I am interested in getting the specific software I need, use and am paid to administer in working order. There are certainly many ways to skin that sad, sad cat. ;)
Posted Sep 4, 2014 22:05 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (7 responses)
The sysadmin did a "shutdown -r". The system (using init scripts) made the mistake of shutting the network down before it shut bind down. Bind - unable to access the network - got stuck in an infinite loop. OOPS!
The sysadmin, 3000 miles away, couldn't get to the console or the power switch, and with no network couldn't contact the machine ...
If a heavily-used program like bind can have such disasters lurking in its sysvinit scripts ...
And systemd would just have shut it down.
Cheers,
Posted Sep 4, 2014 23:02 UTC (Thu)
by NightMonkey (subscriber, #23051)
[Link] (6 responses)
I don't care how "robust" the OS is. It's just being cheap that gets your organization into these kinds of situations (and that *is* an organizational problem, not just yours as the sysadmin.)
Posted Sep 4, 2014 23:15 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
And then it breaks because of a race condition that happens in only about 10% of cases.
That sysadmin in this dramatization was me, and the offending script was: http://anonscm.debian.org/cgit/users/lamont/bind9.git/tre... near the line 92.
Of course, REAL Unix sysadmins must live in the server room, spending 100% of their time tweaking init scripts and auditing all the system code to make sure it can NEVER hang. NEVER. Also, they disable memory protection because their software doesn't need it.
Posted Sep 4, 2014 23:20 UTC (Thu)
by NightMonkey (subscriber, #23051)
[Link] (4 responses)
Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters. I don't like living in the DC, myself.
Posted Sep 4, 2014 23:33 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Again, the answer to system hangs (which are *inevitable* - this is *commodity hardware* we're talking about most of the time, not mainframes) is remote power booters.
Posted Sep 5, 2014 4:09 UTC (Fri)
by raven667 (subscriber, #5198)
[Link] (2 responses)
That sounds dangerously close to internalizing and making excuses for unreliable software rather than engineering better systems that work even in the crazy imperfect world, duct taping and RPS to the side of a machine is in addition to not a replacement for making it work right in the first place.
Posted Sep 5, 2014 4:43 UTC (Fri)
by NightMonkey (subscriber, #23051)
[Link] (1 responses)
I don't think what you are saying are actually separate tasks. All software has bugs.
Posted Sep 5, 2014 14:45 UTC (Fri)
by raven667 (subscriber, #5198)
[Link]
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Wol
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Oh, I've witnessed mainframe hangups. Remote reboot is nice, but that server was from around 2006 so it didn't have IPMI and the datacenter where it was hosted offered only manual reboots.
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems
Poettering: Revisiting how we put together Linux systems