"Just" hiring more sysadmins clearly wouldn't reduce your error rate. By itself it would
almost certainly *increase* the number of systems administration errors.
However, I have been mulling over the application of agile development methods to systems
administration. (In fact I even gave a brief talk on that idea at last years LinuxWorld in
Consider some of the possibilities:
Test driven administration:
Configure monitoring for that new server before you configure the server. Alarms should go
off; response procedures should be executed ... a service window should be scheduled (with
estimated date of completion), which should defer further alarms from that source. (Same
applies to each service that's to be deployed). Now you know that the monitoring is doing
something useful. When the monitoring shows the service "going green" then you know you have
configured the service correctly (with respect to the monitoring system --- i.e. DNS or other
directory services, IP addressing, routing, etc). (If you find a corner case --- where
monitoring gives a false "green" status --- try to improve the monitoring to more closely
model a service's *correct* functionality).
Integrate imaging and system's restoration. Image a system, configure it, backup
configuration and initial (test) data, then create a new imaging profile to facilitate
automated re-imaging of the system with automated restore of the configuration and data. Then
wipe the system and re-image it using that profile. Repeat until the system's complete
configuration and data is restored automatically. THEN put the system into production.
There are a number of other ideas along similar veins. One of them is that we might want to
institute a policy ... for critical production servers ... of having our admins work in pairs
(perhaps over a shared GNU screen session) where one of the admins types each command, then
the other confirms that it's safe/correct and hits [Enter] when they both concur. (Better
admins among us have learned to pause before hitting [Enter] when working "live" on mission
critical servers ... take a deep breath ... re-read that command ... perhaps try the "echo" or
"--dry-run" version of it first ... consider the risks ... and *THEN* (maybe) hit [Enter].
But even the best of us gets in a hurry, gets flustered or tired, or just experiences cognitive
(In my case I was an electrician for years before embarking on my IT career --- working with
potentially live wiring offers similar lessons with potentially lethal and immediately painful
consequences for any lapse in due care! And yes, despite all that I did occasionally get