Posted May 20, 2011 15:27 UTC (Fri) by nix (subscriber, #2304)
Parent article: Scale Fail (part 2)
I've actually embraced single points of failure when there is no monetary alternative (e.g. on my home network). But if I'm going to have a SPOF, make it the *only* SPOF. Thus I have a central server with a RAID array with my home directories and most of my computing power on it. If that machine dies, I'm screwed -- but since failures are rare and there is no alternative (there's no way I can afford *another* huge expensive central server just in case the first one fails, and distributed filesystems aren't good enough to let me do the same thing with fewer than three or four not-very-much-smaller systems), I embrace the SPOF and just make damn sure there is a site-replacement warranty. It *will* fail and I *will* have downtime -- but it will cost less than avoiding the SPOF would.
But for other things (e.g. domestic Internet access), where avoiding the SPOF is easy and failures are common, I'm avoiding like hell.
For corporations past the startup phase, with more than a few people relying on their services and no longer horribly cash-strapped, retaining SPOFs once identified is foolishness. The problem is often identifying the bloody things before they strike, and making sure they don't creep back in afterwards. They can be very hard to spot :(