"Just" hiring more sysadmins clearly wouldn't reduce your error rate. By itself it would almost certainly *increase* the number of systems administration errors. However, I have been mulling over the application of agile development methods to systems administration. (In fact I even gave a brief talk on that idea at last years LinuxWorld in San Francisco). Consider some of the possibilities: Test driven administration: Configure monitoring for that new server before you configure the server. Alarms should go off; response procedures should be executed ... a service window should be scheduled (with estimated date of completion), which should defer further alarms from that source. (Same applies to each service that's to be deployed). Now you know that the monitoring is doing something useful. When the monitoring shows the service "going green" then you know you have configured the service correctly (with respect to the monitoring system --- i.e. DNS or other directory services, IP addressing, routing, etc). (If you find a corner case --- where monitoring gives a false "green" status --- try to improve the monitoring to more closely model a service's *correct* functionality). Integrate imaging and system's restoration. Image a system, configure it, backup configuration and initial (test) data, then create a new imaging profile to facilitate automated re-imaging of the system with automated restore of the configuration and data. Then wipe the system and re-image it using that profile. Repeat until the system's complete configuration and data is restored automatically. THEN put the system into production. There are a number of other ideas along similar veins. One of them is that we might want to institute a policy ... for critical production servers ... of having our admins work in pairs (perhaps over a shared GNU screen session) where one of the admins types each command, then the other confirms that it's safe/correct and hits [Enter] when they both concur. (Better admins among us have learned to pause before hitting [Enter] when working "live" on mission critical servers ... take a deep breath ... re-read that command ... perhaps try the "echo" or "--dry-run" version of it first ... consider the risks ... and *THEN* (maybe) hit [Enter]. But even the best of us gets in a hurry, gets flustered or tired, or just experiences cognitive hiccoughs). (In my case I was an electrician for years before embarking on my IT career --- working with potentially live wiring offers similar lessons with potentially lethal and immediately painful consequences for any lapse in due care! And yes, despite all that I did occasionally get zapped!) JimD
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds