Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for December 5, 2013
Deadline scheduling: coming soon?
LWN.net Weekly Edition for November 27, 2013
ACPI for ARM?
LWN.net Weekly Edition for November 21, 2013
Hm, the former of those is interesting. `Why yes, you *can* reduce your
error rate: just hire more sysadmins!'
Pair admining? Test driven administration? XA (eXtreme Admin'ing)?
Posted Jan 25, 2008 21:00 UTC (Fri) by AnswerGuy (guest, #1256)
"Just" hiring more sysadmins clearly wouldn't reduce your error rate. By itself it would
almost certainly *increase* the number of systems administration errors.
However, I have been mulling over the application of agile development methods to systems
administration. (In fact I even gave a brief talk on that idea at last years LinuxWorld in
Consider some of the possibilities:
Test driven administration:
Configure monitoring for that new server before you configure the server. Alarms should go
off; response procedures should be executed ... a service window should be scheduled (with
estimated date of completion), which should defer further alarms from that source. (Same
applies to each service that's to be deployed). Now you know that the monitoring is doing
something useful. When the monitoring shows the service "going green" then you know you have
configured the service correctly (with respect to the monitoring system --- i.e. DNS or other
directory services, IP addressing, routing, etc). (If you find a corner case --- where
monitoring gives a false "green" status --- try to improve the monitoring to more closely
model a service's *correct* functionality).
Integrate imaging and system's restoration. Image a system, configure it, backup
configuration and initial (test) data, then create a new imaging profile to facilitate
automated re-imaging of the system with automated restore of the configuration and data. Then
wipe the system and re-image it using that profile. Repeat until the system's complete
configuration and data is restored automatically. THEN put the system into production.
There are a number of other ideas along similar veins. One of them is that we might want to
institute a policy ... for critical production servers ... of having our admins work in pairs
(perhaps over a shared GNU screen session) where one of the admins types each command, then
the other confirms that it's safe/correct and hits [Enter] when they both concur. (Better
admins among us have learned to pause before hitting [Enter] when working "live" on mission
critical servers ... take a deep breath ... re-read that command ... perhaps try the "echo" or
"--dry-run" version of it first ... consider the risks ... and *THEN* (maybe) hit [Enter].
But even the best of us gets in a hurry, gets flustered or tired, or just experiences cognitive
(In my case I was an electrician for years before embarking on my IT career --- working with
potentially live wiring offers similar lessons with potentially lethal and immediately painful
consequences for any lapse in due care! And yes, despite all that I did occasionally get
Posted Jan 26, 2008 3:42 UTC (Sat) by giraffedata (subscriber, #1954)
"Just" hiring more sysadmins clearly wouldn't reduce your error rate.
It would if the error rate you're using is disk errors per year per sysadmin, which is what we were talking about.
It underscores the point that there are lots of error rates you can define, and you have to pay attention to your denominators.
Nonetheless, your ideas about reducing errors per something by improving system administration methods are interesting.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds