LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:55 UTC (Thu) by oak (subscriber, #2786)
In reply to: LCA: Andrew Tanenbaum on creating reliable systems by emkey
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

Good points, but I've seen "fault tolerance" implementations which
make the system less responsive[1] and/or obliterate the traces of
the actual fault[2]. :-)

[1] Windows virus scanning software repeatedly starting some crashing
service so that opening any application window takes >20 minutes
[2] Linux SW restarting the crashed service which act changes the system
HW state that caused the original crash and results in a different
crash. You could fix the constant service restarts only by examining
the HW state for the first fault

So, I would say that if fault tolerance is done, great care would need
to be taken that it will really help also in finding and fixing the bugs
(by notifying user about the fault, saving data about the fault state,
allowing debugging of the fault when it happens etc), not just hiding
them. And this code should be fairly simple to assure that it actually
works, more complicated code is always harder to maintain and usually
contains more bugs...


(Log in to post comments)

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.