LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 8:35 UTC (Thu) by oak (guest, #2786)
In reply to: LCA: Andrew Tanenbaum on creating reliable systems by drag
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

> Gnome-session can restart applications that crash and such.

This wasn't much of a consolation when I tried to run Ubuntu on
a system that didn't have enough memory. Nautilus died to kernel
OOM-kill and it was always restarted and as a result, the computer
was unusable. If it wouldn't have tried to continously restart
Nautilus, the system would have been usable. (moral: if it fails
too many times in a row, let it rest in peace)


> The concept was that applications at any point should be always at a
> state were they can instantly crap out and recover later.

But you can still lose data...

Btw. According to my limited experience, if there's a "reliability"
feature which papers over software faults, fixing of those faults will
be delayed (or sometimes not fixed at all) because "everything" works
"well enough" and debugging & fixing things is costly.

"Fault tolerance" should be used only on a system which you do not
expect/cannot fix or update.


(Log in to post comments)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 9:59 UTC (Thu) by filipjoelsson (subscriber, #2622) [Link]

> "Fault tolerance" should be used only on a system which you do not
> expect/cannot fix or update.

Which is pretty much any end user system.

Sure, I'm a gentooer as well as a programmer - I can easily browse around for a patch in bugzilla, or whip up something on my own. But my wife can't, my brothers can't (engineer all), my parents can't. So, in order to let up on the helpdesk in computer matters (ie me) - fault tolerance would be much appreciated. Let the professionals run without fault tolerance, and give the world some stability!

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 10:44 UTC (Thu) by oak (guest, #2786) [Link]

The effort for making things more fault tolerant could be spent on
making them more bugfree instead.

The problem is that in the long run, the end result could be just
more fault tolerant system, but not more stable one because bugs
aren't found promptly and fixed. Most of the bugs are found by
users, not developers.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 16:22 UTC (Thu) by mrfredsmoothie (guest, #3100) [Link]

It is not either/or.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:23 UTC (Thu) by emkey (guest, #144) [Link]

Making a system fault tolerant would in theory mask all bugs. Fixing a bug fixes ONE bug. Thus fault tolerance is a much better short to mid term investment. Also, debugging problems is potentially much easier in the fault tolerant model. For example, many bugs can cause a system to become unresponsive. It is thus nearly impossible to gather data that might help in identifying and solving the problem. With a fault tolerant system you could optionally enter some sort of debugging environment when a particular component failed. This could greatly reduce the amount of time needed to fix problems.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:55 UTC (Thu) by oak (guest, #2786) [Link]

Good points, but I've seen "fault tolerance" implementations which
make the system less responsive[1] and/or obliterate the traces of
the actual fault[2]. :-)

[1] Windows virus scanning software repeatedly starting some crashing
service so that opening any application window takes >20 minutes
[2] Linux SW restarting the crashed service which act changes the system
HW state that caused the original crash and results in a different
crash. You could fix the constant service restarts only by examining
the HW state for the first fault

So, I would say that if fault tolerance is done, great care would need
to be taken that it will really help also in finding and fixing the bugs
(by notifying user about the fault, saving data about the fault state,
allowing debugging of the fault when it happens etc), not just hiding
them. And this code should be fairly simple to assure that it actually
works, more complicated code is always harder to maintain and usually
contains more bugs...

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 14:27 UTC (Thu) by pphaneuf (guest, #23480) [Link]

I remember, a very long time ago, Mac OS ("classic") used to be very stable compared to the Windows of the time. And yet, when you looked at the software architecteure, you couldn't help but think this thing ought to fall apart and crash all the time (no memory protection, cooperative multitasking, bounded memory arena, no virtual memory etc). But somehow, it didn't?

Turns out the reason was quite simple. Failures were so spectacular that developers had no choice but to write their software carefully, because when it crashed on them, they had to reboot their entire development environment!

Also, users would tend to notice quickly when their system became less stable, would correlate it to some software they installed recently, then would stop using, or at least would whine about it all the time. So buggy software would just tend not to catch on, because people kicked them off after it crashed their whole system a few times, and they'd tell fellow users to steer clear.

So yes, these are difficult questions. In my opinion, it'd be nice if those automatic recovery features would still notify the user of their action, and try to make the culprit clear, so that there would be some motivation for users to adjust their software usage toward more reliable software, or at least whine on their blogs. ;-)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 20, 2007 1:07 UTC (Sat) by bluefoxicy (guest, #25366) [Link]

That whole argument is silly. Fault tolerant systems don't COME TO A SCREECHING HALT when they have a fault. When the file system driver dies on Minix, it comes back and life goes on. On Linux, the world stops.

Notice that you can keep going on after disk/FS driver crashes? Know what else you can do? Make logs of the state of the driver at crash (ever core dump a file system?). Linux can do this with kexec and some tricks, although you still could suffer data loss from other applications or manage to critically damage the FS.

What else is interesting is drivers are all small and isolated. The only information you need is the state of the driver; and the driver uses itself entirely. To debug a component, you debug that component; you don't have to worry about the blurred, gray lines between drivers and VFS and such. Things are easier to chew in small bites.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds