User: Password:
Subscribe / Log in / New account

KHB: Failure-oblivious computing

KHB: Failure-oblivious computing

Posted Jun 29, 2006 9:29 UTC (Thu) by job (guest, #670)
Parent article: KHB: Failure-oblivious computing

Isn't there a risk that this is effectively bug hiding, so no one will fix them?

(Log in to post comments)

KHB: Failure-oblivious computing

Posted Jun 30, 2006 9:22 UTC (Fri) by oak (guest, #2786) [Link]

Note that this is not a general solution for "fixing" buggy programs, but a way to increase reliability of programs in situations where:
  • Uptime / not crashing
  • (Performance / speed)
are more important than program working correctly.

This might be the case where the handled data is either:

  • Redundant
  • Not written, just read and sent somewhere else (another machine or process)
  • You don't care about the data as much as of the rest of the service ("good enough" data reliability is satisfactory)

Even in those kind of situations I would assume this feature to be enabled only after the software:

  • Development phase has ended and SW has been deployed in place(s) where it's hard to update (e.g. set-top boxes)
  • Has been pretty throughly tested in an environment where similar bugs cause program e.g. to dump core

I would say that for this thing to be generally useful, following should be possible:

  • Changing the program without re-compiling to terminate/dump core instead
  • This run-time configurability would still be fast enough I'm pretty sure administrators will still want to be able to debug the problems they will encounter.

The more you value the data the program handles, the less you want it to continue after there's some problem in handling the data. Compare for example a program that manipulates / writes the same data / files constantly (e.g. database server) to a program that acts as a filter for a data that's different each time (e.g. mail server) or doesn't write it at all (e.g. www-server).

KHB: Failure-oblivious computing

Posted Jun 30, 2006 9:40 UTC (Fri) by oak (guest, #2786) [Link]

Btw. an example of a common open source library that by default doesn't
terminate an application which has an error, is Glib (used by Gtk
and many other projects). By default it just logs Glib Warnings
and Critical errors to the console and lets the program hobble along.

In GUI environment end-users don't see these messages as they don't
(usually) start programs from the console. These errors can be turned
to abort() (i.e. program termination) with an environment variable.

The Glib default behavior allows program to corrupt it's internal data
structures e.g. through double-frees. However, the Gnome apps have seemed
to work fairly OK although the developers haven't always had had time to
fix all of those errors, so I guess they had fixed the most problematic
ones before release. ;-)

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds