LWN.net Logo

LCA: Andrew Tanenbaum on creating reliable systems

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 27, 2007 22:50 UTC (Sat) by pascal.martin (guest, #2995)
In reply to: LCA: Andrew Tanenbaum on creating reliable systems by tjc
Parent article: LCA: Andrew Tanenbaum on creating reliable systems

Minix will log server/driver crashes? To disk ? even if the disk driver crashed? :-)

Lets assume the disk driver was restarted. What happens if the disk driver crashes again, because of the activity caused by the crash log? 8-)

That may seems silly, but I have seen similar "death trap" problems in actual life.


(Log in to post comments)

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 29, 2007 15:22 UTC (Mon) by tjc (subscriber, #137) [Link]

Well yes, there is some chance of that happening, but there's also some chance that you will be hit by a bus and killed before you read this post.

I expect the logging system works in enough cases to be a benefit.

LCA: Andrew Tanenbaum on creating reliable systems

Posted Jan 31, 2007 22:50 UTC (Wed) by tjc (subscriber, #137) [Link]

I just found this bit of information in the paper "Reorganizing UNIX for Reliability"

If crashes reoccur, a binary exponential backoff protocol could be used to prevent bogging down the system with repeated recoveries.

Unfortunately, no specifics are given. It sounds like something from Star Trek TNG.

Data: "Captain, I could use an binary exponential backoff protocol to restart the warp engines."

Picard: "Very good Mr. Data -- make it so!"

http://www.minix3.org/doc/ACSAC-2006.pdf

exponential backoff

Posted Feb 1, 2007 12:54 UTC (Thu) by robbe (guest, #16131) [Link]

Exponential backoff is a standard technique used, for example by mail
servers, in the face of transient failures: after the n-th consequitve
error, wait f * k^n seconds, then retry. Suitable values for f and k
depend on the application -- k is often 2 -> binary exponential backoff.

Example with f = 300, i.e. 5 minutes (a viable value for SMTP):

* First try ... fails
* Wait 5 minutes
* Second try ... fails
* Wait 10 minutes
* Third try ... fails
* Wait 20 minutes
* Fourth try ... fails
* Wait 40 minutes
* Fifth try ...
etc.

It would work the same for OS-component restart, of course with values
for f in the milliseconds.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds