LWN.net Logo

Scale Fail (part 2)

Scale Fail (part 2)

Posted May 20, 2011 15:27 UTC (Fri) by nix (subscriber, #2304)
Parent article: Scale Fail (part 2)

I've actually embraced single points of failure when there is no monetary alternative (e.g. on my home network). But if I'm going to have a SPOF, make it the *only* SPOF. Thus I have a central server with a RAID array with my home directories and most of my computing power on it. If that machine dies, I'm screwed -- but since failures are rare and there is no alternative (there's no way I can afford *another* huge expensive central server just in case the first one fails, and distributed filesystems aren't good enough to let me do the same thing with fewer than three or four not-very-much-smaller systems), I embrace the SPOF and just make damn sure there is a site-replacement warranty. It *will* fail and I *will* have downtime -- but it will cost less than avoiding the SPOF would.

But for other things (e.g. domestic Internet access), where avoiding the SPOF is easy and failures are common, I'm avoiding like hell.

For corporations past the startup phase, with more than a few people relying on their services and no longer horribly cash-strapped, retaining SPOFs once identified is foolishness. The problem is often identifying the bloody things before they strike, and making sure they don't creep back in afterwards. They can be very hard to spot :(


(Log in to post comments)

Scale Fail (part 2)

Posted May 20, 2011 20:28 UTC (Fri) by b7j0c (subscriber, #27559) [Link]

indeed. embracing SPoF at some level is fundamental to getting on with things.

everything has downtime. there are no 100% solutions. your bank's site will be down. the stock market suspends trading. power to your house fails. no water comes out of the faucet.

remember amazon's last big outage? what was the date? bet you can't tell me unless you look it up. the public moves on, there isn't much to be gained by engineering a path around these scenarios.

engineering around certain types of failure states is pointless, you create a huge opportunity cost with regard to allocating resources to new features that make your service more attractive. being pathological about availability is unrealistic and can be deadly to a business.

Scale Fail (part 2)

Posted May 25, 2011 16:58 UTC (Wed) by baldridgeec (guest, #55283) [Link]

Wasn't their last big outage something like 4 days before your comment?

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds