LWN.net Logo

Advertisement

Advanced thin client solution for Linux, based on Open Source. Mix Windows and Linux applications on the same desktop. V

Advertise here

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 25, 2007 15:26 UTC (Thu) by tjc (subscriber, #137)
In reply to: LCA: Andrew Tanenbaum on creating reliable systems by mingo
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

Furthermore, if there's a failure in any of the subsystems, i definitely do not want to hide this fact by having a "restart and try again" feature.
My understanding is that MINIX 3 will log server/driver crashes and email the developer if so configured. I can't remember if I read this somewhere here, or in one of the whitepapers.


(Log in to post comments)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 27, 2007 22:50 UTC (Sat) by pascal.martin (subscriber, #2995) [Link]

Minix will log server/driver crashes? To disk ? even if the disk driver crashed? :-)

Lets assume the disk driver was restarted. What happens if the disk driver crashes again, because of the activity caused by the crash log? 8-)

That may seems silly, but I have seen similar "death trap" problems in actual life.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 29, 2007 15:22 UTC (Mon) by tjc (subscriber, #137) [Link]

Well yes, there is some chance of that happening, but there's also some chance that you will be hit by a bus and killed before you read this post.

I expect the logging system works in enough cases to be a benefit.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 31, 2007 22:50 UTC (Wed) by tjc (subscriber, #137) [Link]

I just found this bit of information in the paper "Reorganizing UNIX for Reliability"

If crashes reoccur, a binary exponential backoff protocol could be used to prevent bogging down the system with repeated recoveries.

Unfortunately, no specifics are given. It sounds like something from Star Trek TNG.

Data: "Captain, I could use an binary exponential backoff protocol to restart the warp engines."

Picard: "Very good Mr. Data -- make it so!"

http://www.minix3.org/doc/ACSAC-2006.pdf

exponential backoff

Posted Feb 1, 2007 12:54 UTC (Thu) by robbe (guest, #16131) [Link]

Exponential backoff is a standard technique used, for example by mail
servers, in the face of transient failures: after the n-th consequitve
error, wait f * k^n seconds, then retry. Suitable values for f and k
depend on the application -- k is often 2 -> binary exponential backoff.

Example with f = 300, i.e. 5 minutes (a viable value for SMTP):

* First try ... fails
* Wait 5 minutes
* Second try ... fails
* Wait 10 minutes
* Third try ... fails
* Wait 20 minutes
* Fourth try ... fails
* Wait 40 minutes
* Fifth try ...
etc.

It would work the same for OS-component restart, of course with values
for f in the milliseconds.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.