|
LCA: Andrew Tanenbaum on creating reliable systemsLCA: Andrew Tanenbaum on creating reliable systemsPosted Jan 18, 2007 9:59 UTC (Thu) by filipjoelsson (subscriber, #2622)In reply to: LCA: Andrew Tanenbaum on creating reliable systems by oak Parent article: LCA: Andrew Tanenbaum on creating reliable systems
> "Fault tolerance" should be used only on a system which you do not
Which is pretty much any end user system.
Sure, I'm a gentooer as well as a programmer - I can easily browse around for a patch in bugzilla, or whip up something on my own. But my wife can't, my brothers can't (engineer all), my parents can't. So, in order to let up on the helpdesk in computer matters (ie me) - fault tolerance would be much appreciated. Let the professionals run without fault tolerance, and give the world some stability!
(Log in to post comments)
LCA: Andrew Tanenbaum on creating reliable systems Posted Jan 18, 2007 10:44 UTC (Thu) by oak (subscriber, #2786) [Link] The effort for making things more fault tolerant could be spent onmaking them more bugfree instead. The problem is that in the long run, the end result could be just more fault tolerant system, but not more stable one because bugs aren't found promptly and fixed. Most of the bugs are found by users, not developers.
LCA: Andrew Tanenbaum on creating reliable systems Posted Jan 18, 2007 16:22 UTC (Thu) by mrfredsmoothie (subscriber, #3100) [Link] It is not either/or.
LCA: Andrew Tanenbaum on creating reliable systems Posted Jan 18, 2007 18:23 UTC (Thu) by emkey (guest, #144) [Link] Making a system fault tolerant would in theory mask all bugs. Fixing a bug fixes ONE bug. Thus fault tolerance is a much better short to mid term investment. Also, debugging problems is potentially much easier in the fault tolerant model. For example, many bugs can cause a system to become unresponsive. It is thus nearly impossible to gather data that might help in identifying and solving the problem. With a fault tolerant system you could optionally enter some sort of debugging environment when a particular component failed. This could greatly reduce the amount of time needed to fix problems.
LCA: Andrew Tanenbaum on creating reliable systems Posted Jan 18, 2007 18:55 UTC (Thu) by oak (subscriber, #2786) [Link] Good points, but I've seen "fault tolerance" implementations whichmake the system less responsive[1] and/or obliterate the traces of the actual fault[2]. :-) [1] Windows virus scanning software repeatedly starting some crashing service so that opening any application window takes >20 minutes [2] Linux SW restarting the crashed service which act changes the system HW state that caused the original crash and results in a different crash. You could fix the constant service restarts only by examining the HW state for the first fault So, I would say that if fault tolerance is done, great care would need to be taken that it will really help also in finding and fixing the bugs (by notifying user about the fault, saving data about the fault state, allowing debugging of the fault when it happens etc), not just hiding them. And this code should be fairly simple to assure that it actually works, more complicated code is always harder to maintain and usually contains more bugs...
|
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.