LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 5:46 UTC (Thu) by elanthis (guest, #6227)
In reply to: LCA: Andrew Tanenbaum on creating reliable systems by drag
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

"I mean, seriously, who wouldn't want to spend a extra 50 bucks on their computer to make up for the 10% drop in performance if that makes the computer much much more reliable?"

The problem with that sentiment, and the whole article, is that it focuses solely on the kernel.

I don't think I've had more than 3 or 4 Linux failures in my life, and most of those were when using very new drivers (or NVIDIA).

I have had X crash or lock, various GNOME and KDE components crash or lock, various regular applications crash and lock more times than I can possibly count. Definitely into the triple digits, if not quadruple by now.

If you take Tanenbaum's suggestion to heart, the 5-10% "penalty" of the micro-kernel design is irrelevant, because you won't just be swapping in a micro-kernel underneath the bloated, unreliable layers we've built on top of Linux. You'll be building an entire new system, bottom to top, with less bloat and more reliability. Will that total system have a 5-10% penalty over my current system? I doubt it. You can't even *begin* to speculate, because there are just far, far too many variables to really judge that.


(Log in to post comments)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 6:25 UTC (Thu) by drag (subscriber, #31333) [Link]

Well in application cases it's probably simplier.

Gnome-session can restart applications that crash and such.

For a while when I logged out of gnome I didn't bother 'logging out', I'd just ctrl-alt-backspace and kill X.

Worked fine, for me. And it was much quicker and guess what? Logging in afterwards seemed a bit quicker also.

Wasn't there a article somewere that read that delt with 'crash proof software' of some sort? (I can't recall it well enough to find it)

The concept was that applications at any point should be always at a state were they can instantly crap out and recover.later. Like a OD that at any point you could sync (truely sync), then kill -9 everything. Next time you reboot everything is back to were you left it.

The other part of the theory is that it allows for much faster shutdowns and reboots. Typically software that has these capabilities is able to recover a session faster then it is able to create a new one, ironicly.

that seems to be the user-land counter part to this Microkernel reliability and that other article "KHB: Recovering Device Drivers: From Sandboxing to Surviving" http://lwn.net/Articles/217119/

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 8:35 UTC (Thu) by oak (subscriber, #2786) [Link]

> Gnome-session can restart applications that crash and such.

This wasn't much of a consolation when I tried to run Ubuntu on
a system that didn't have enough memory. Nautilus died to kernel
OOM-kill and it was always restarted and as a result, the computer
was unusable. If it wouldn't have tried to continously restart
Nautilus, the system would have been usable. (moral: if it fails
too many times in a row, let it rest in peace)


> The concept was that applications at any point should be always at a
> state were they can instantly crap out and recover later.

But you can still lose data...

Btw. According to my limited experience, if there's a "reliability"
feature which papers over software faults, fixing of those faults will
be delayed (or sometimes not fixed at all) because "everything" works
"well enough" and debugging & fixing things is costly.

"Fault tolerance" should be used only on a system which you do not
expect/cannot fix or update.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 9:59 UTC (Thu) by filipjoelsson (subscriber, #2622) [Link]

> "Fault tolerance" should be used only on a system which you do not
> expect/cannot fix or update.

Which is pretty much any end user system.

Sure, I'm a gentooer as well as a programmer - I can easily browse around for a patch in bugzilla, or whip up something on my own. But my wife can't, my brothers can't (engineer all), my parents can't. So, in order to let up on the helpdesk in computer matters (ie me) - fault tolerance would be much appreciated. Let the professionals run without fault tolerance, and give the world some stability!

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 10:44 UTC (Thu) by oak (subscriber, #2786) [Link]

The effort for making things more fault tolerant could be spent on
making them more bugfree instead.

The problem is that in the long run, the end result could be just
more fault tolerant system, but not more stable one because bugs
aren't found promptly and fixed. Most of the bugs are found by
users, not developers.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 16:22 UTC (Thu) by mrfredsmoothie (subscriber, #3100) [Link]

It is not either/or.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:23 UTC (Thu) by emkey (guest, #144) [Link]

Making a system fault tolerant would in theory mask all bugs. Fixing a bug fixes ONE bug. Thus fault tolerance is a much better short to mid term investment. Also, debugging problems is potentially much easier in the fault tolerant model. For example, many bugs can cause a system to become unresponsive. It is thus nearly impossible to gather data that might help in identifying and solving the problem. With a fault tolerant system you could optionally enter some sort of debugging environment when a particular component failed. This could greatly reduce the amount of time needed to fix problems.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 18:55 UTC (Thu) by oak (subscriber, #2786) [Link]

Good points, but I've seen "fault tolerance" implementations which
make the system less responsive[1] and/or obliterate the traces of
the actual fault[2]. :-)

[1] Windows virus scanning software repeatedly starting some crashing
service so that opening any application window takes >20 minutes
[2] Linux SW restarting the crashed service which act changes the system
HW state that caused the original crash and results in a different
crash. You could fix the constant service restarts only by examining
the HW state for the first fault

So, I would say that if fault tolerance is done, great care would need
to be taken that it will really help also in finding and fixing the bugs
(by notifying user about the fault, saving data about the fault state,
allowing debugging of the fault when it happens etc), not just hiding
them. And this code should be fairly simple to assure that it actually
works, more complicated code is always harder to maintain and usually
contains more bugs...

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 14:27 UTC (Thu) by pphaneuf (guest, #23480) [Link]

I remember, a very long time ago, Mac OS ("classic") used to be very stable compared to the Windows of the time. And yet, when you looked at the software architecteure, you couldn't help but think this thing ought to fall apart and crash all the time (no memory protection, cooperative multitasking, bounded memory arena, no virtual memory etc). But somehow, it didn't?

Turns out the reason was quite simple. Failures were so spectacular that developers had no choice but to write their software carefully, because when it crashed on them, they had to reboot their entire development environment!

Also, users would tend to notice quickly when their system became less stable, would correlate it to some software they installed recently, then would stop using, or at least would whine about it all the time. So buggy software would just tend not to catch on, because people kicked them off after it crashed their whole system a few times, and they'd tell fellow users to steer clear.

So yes, these are difficult questions. In my opinion, it'd be nice if those automatic recovery features would still notify the user of their action, and try to make the culprit clear, so that there would be some motivation for users to adjust their software usage toward more reliable software, or at least whine on their blogs. ;-)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 20, 2007 1:07 UTC (Sat) by bluefoxicy (guest, #25366) [Link]

That whole argument is silly. Fault tolerant systems don't COME TO A SCREECHING HALT when they have a fault. When the file system driver dies on Minix, it comes back and life goes on. On Linux, the world stops.

Notice that you can keep going on after disk/FS driver crashes? Know what else you can do? Make logs of the state of the driver at crash (ever core dump a file system?). Linux can do this with kexec and some tricks, although you still could suffer data loss from other applications or manage to critically damage the FS.

What else is interesting is drivers are all small and isolated. The only information you need is the state of the driver; and the driver uses itself entirely. To debug a component, you debug that component; you don't have to worry about the blurred, gray lines between drivers and VFS and such. Things are easier to chew in small bites.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 11:11 UTC (Thu) by nix (subscriber, #2304) [Link]

The crashproof software stuff was another Val Henson special: Failure-oblivious computing.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 8:24 UTC (Thu) by mingo (subscriber, #31122) [Link]

If you take Tanenbaum's suggestion to heart, the 5-10% "penalty" of the micro-kernel design is irrelevant, because you won't just be swapping in a micro-kernel underneath the bloated, unreliable layers we've built on top of Linux. You'll be building an entire new system, bottom to top, with less bloat and more reliability. Will that total system have a 5-10% penalty over my current system? I doubt it. You can't even *begin* to speculate, because there are just far, far too many variables to really judge that.

Yes. The other cost is not performance but flexibility of design and /flexibility of bugfixes/. Both matter very much. You dont win more reliability by making bugs harder to fix. A 'monolithic' kernel's state might be harder to debug, but you've got everything in one place - if you need to change a few drivers to fix a bug in a core infrastructure API - no problem, you just do it. If you need to expose a data structure to another subsystem - no problem.

In a microkernel design you have explicit, documented, relied on APIs (which are more like ABIs) between subsystems, making both the ad-hoc sharing of information and fast fixing of those interfaces alot more cumbersome.

Furthermore, if there's a failure in any of the subsystems, i definitely do not want to hide this fact by having a "restart and try again" feature. I really want to achieve a bug free kernel, not a kernel that appears bug-free.

My opinion is that we'll win far more reliability by concentrating on transparent debugging facilities (static ones such as Sparse and dynamic ones such as [plug alert] lockdep), than via limiting the basic flexibility of the kernel's design. I'd rather burn CPU time on running with lockdep enabled to find deadlocks, than to slow down and hinder /all/ kernel development by forcibly isolating components from each other.

Also, there are some areas and subsystems where isolation wins us /more/ flexibility: for example filesystems. But here Linux already has FUSE, which is an /optional/ feature to write filesystems in user-space. NTFS-3G has already proven (by being leagues better than the in-kernel ntfs driver) that at least for that type of filesystem, and in that stage of its lifecycle, development was faster and more flexible in user-space.

Anyway ... we'll see how this works out. I have a huge amount of respect for Mr. Tanenbaum, his books are great and i am sure he is having tons of fun with Minix - and i definitely agree with him that reliability is the #1 challenge of modern OS design. Diversity of opinion and diversity of approach does not bother me, it will only enrich the end result.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 25, 2007 15:26 UTC (Thu) by tjc (subscriber, #137) [Link]

Furthermore, if there's a failure in any of the subsystems, i definitely do not want to hide this fact by having a "restart and try again" feature.
My understanding is that MINIX 3 will log server/driver crashes and email the developer if so configured. I can't remember if I read this somewhere here, or in one of the whitepapers.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 27, 2007 22:50 UTC (Sat) by pascal.martin (guest, #2995) [Link]

Minix will log server/driver crashes? To disk ? even if the disk driver crashed? :-)

Lets assume the disk driver was restarted. What happens if the disk driver crashes again, because of the activity caused by the crash log? 8-)

That may seems silly, but I have seen similar "death trap" problems in actual life.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 29, 2007 15:22 UTC (Mon) by tjc (subscriber, #137) [Link]

Well yes, there is some chance of that happening, but there's also some chance that you will be hit by a bus and killed before you read this post.

I expect the logging system works in enough cases to be a benefit.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 31, 2007 22:50 UTC (Wed) by tjc (subscriber, #137) [Link]

I just found this bit of information in the paper "Reorganizing UNIX for Reliability"

If crashes reoccur, a binary exponential backoff protocol could be used to prevent bogging down the system with repeated recoveries.

Unfortunately, no specifics are given. It sounds like something from Star Trek TNG.

Data: "Captain, I could use an binary exponential backoff protocol to restart the warp engines."

Picard: "Very good Mr. Data -- make it so!"

http://www.minix3.org/doc/ACSAC-2006.pdf

exponential backoff

Posted Feb 1, 2007 12:54 UTC (Thu) by robbe (guest, #16131) [Link]

Exponential backoff is a standard technique used, for example by mail
servers, in the face of transient failures: after the n-th consequitve
error, wait f * k^n seconds, then retry. Suitable values for f and k
depend on the application -- k is often 2 -> binary exponential backoff.

Example with f = 300, i.e. 5 minutes (a viable value for SMTP):

* First try ... fails
* Wait 5 minutes
* Second try ... fails
* Wait 10 minutes
* Third try ... fails
* Wait 20 minutes
* Fourth try ... fails
* Wait 40 minutes
* Fifth try ...
etc.

It would work the same for OS-component restart, of course with values
for f in the milliseconds.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 12:21 UTC (Thu) by lysse (guest, #3190) [Link]

I thought QNX had settled the whole "microkernels are a performance sink" question a long time ago?

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 13:41 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

I'm not sure it really has... Or, if it has, then it settled it in the
affirmative, in my mind, at least...

I've written commercial code running under versions of QNX from 2.x to 4.x
for many, many years now, and let me tell you: it doesn't even come close to
living up to its hype... Speed? Sure, local IPC via Send()/Receive()/Reply()
is nice and quick, and the same remotely isn't bad, either... But, that's
about the only thing it's got going for it, speedwise... Compare normal
real-world-used standard IPC interfaces, such as pipes or TCP/IP, and the
situation changes dramatically, because they're all welded on almost as an
afterthought, and built on top of that fast SRR messaging, but adding more
layers to go through (and often multiple user-space processes that need to
be communicated with in order to get things done), rather than being first-class
interfaces in their own right... And, reliability? I've seen FAR more
examples of QNX crashing in various unpleasant ways than I ever have seen
from Linux (or pretty much anything outside of a Microsoft OS)... Sure,
most stuff is just a user-space app; but, if your "Dev" app or your "Fsys"
app goes away, you end up pretty well screwed, just as badly as if it were
part of the kernel itself...

QNX has lots of nifty features, and it's great for some specific uses...
For embedded systems, it's probably perfect... But, for a normal server or
desktop/workstation computer for normal everyday use, it's absolutely horrible,
and Linux has it outclassed by miles in every possible area I can think of...
This coming from someone whose main workstation ran QNX 4.x for many years,
so I'm not just making stuff up... When I switched my main dev environment
to Linux, it was absolute nirvana in comparison... Maybe it's because I
come from a Unix background, and QNX is just enough like Unix to make you
frustrated at all the ways it's NOT like Unix... But, whatever it is, I
find myself a LOT happier programming under Linux, that's for sure... (And,
when Linux differs from standard Unix stuff, it's usually in a much more
pleasant and superior way, rather than a frustratingly annoying way... ;-))

(And, note: I have no experience with the newest incarnations of QNX;
"Neutrino", or whatever it is they're calling it these days... 4.2x was the
last version I ever dealt with... So, maybe all my complaints are baseless
these days... *shrug* I've heard they've moved to gcc instead of that lousy
Watcom compiler, so that'd be ONE major improvement right there...)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 17:52 UTC (Thu) by JoeBuck (subscriber, #2330) [Link]

But the X server (at least large parts of it) is like an extended kernel. It runs as root, accesses the hardware directly, without going through the kernel, so bugs have the same ability to toast your system as the kernel does.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 20:54 UTC (Thu) by jamesh (guest, #1159) [Link]

Well, the Minix setup basically required all IO port access to go through the kernel, and the policies for each daemon would say what it was allowed to access.

As for intelligent hardware like a modern GPU that can DMA to arbitrary memory locations, his solution was to use the IOMMU to limit where the device could write to. It wasn't clear whether they've implemented use of the IOMMU like this yet.

I've got no idea what impact this would have on performance of graphics operations.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 22:42 UTC (Thu) by nix (subscriber, #2304) [Link]

So Minix I/O port access always involves at least two ring transitions
*per port I/O*?

Given the timing-sensitive nature of much port I/O that strikes me as both
wildly impractical and somewhat dangerous.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 24, 2007 14:01 UTC (Wed) by kleptog (subscriber, #1183) [Link]

On i386 hardware, it's possible to grant an unprivelidged processes access to particular I/O ports, without having to do any ring transitions. On Linux it's the ioperm() function call.

These days people use memory-mapped I/O so mmap() is what you mostly need.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Feb 2, 2007 13:46 UTC (Fri) by willy (subscriber, #9762) [Link]

Yes, but minix explicitly doesn't do ioperm, it really does call down to the microkernel to do IO port accesses. He talked about how 'evil' mmaped IO was.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 18, 2007 20:14 UTC (Thu) by eklitzke (subscriber, #36426) [Link]

The problem with that sentiment, and the whole article, is that it focuses solely on the kernel. I don't think I've had more than 3 or 4 Linux failures in my life, and most of those were when using very new drivers (or NVIDIA). I have had X crash or lock, various GNOME and KDE components crash or lock, various regular applications crash and lock more times than I can possibly count. Definitely into the triple digits, if not quadruple by now.

I tend to agree with you here. The kernel is very stable -- I've only had one real, bona fide kernel oops in the past 18 months or so (I think it was pdflush that crashed it). And I can't even begin to count how many times X has totally locked up the system (usually after starting a misbehaving Gnome application). But that just means that those applications just need to implement a fault tolerant model as well. It's totally unacceptable that an application can cause X to lock up the whole computer. If X was self-healing that would be spectacular.

A lot of the most modular pieces of software on my system (I am thinking particularly of Postfix and Apache) are also the most stable. TCP/IP is another example of a modular (well, layered) system that is particularly resilient to failure. Certainly this level of modularity isn't needed in all cases, but for any really critical software I think that taking some lessons from the microkernel model is a great idea.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 19, 2007 20:39 UTC (Fri) by dark (subscriber, #8483) [Link]

Still, I'd happily give up 90% of my computing power in exchange for a
reliable system. That'll put is back about 7 years in terms of hardware
development. I was happy enough in 2000, I can live with that :-)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds