Posted Jun 30, 2006 9:22 UTC (Fri) by oak (subscriber, #2786)
[Link]
Note that this is not a general solution for "fixing" buggy
programs, but a way to increase reliability of programs in
situations where:
Uptime / not crashing
(Performance / speed)
are more important than program working correctly.
This might be the case where the handled data is either:
Redundant
Not written, just read and sent somewhere else
(another machine or process)
You don't care about the data as much as of the rest of the service
("good enough" data reliability is satisfactory)
Even in those kind of situations I would assume this feature
to be enabled only after the software:
Development phase has ended and SW has been deployed in
place(s) where it's hard to update (e.g. set-top boxes)
Has been pretty throughly tested in an environment where similar
bugs cause program e.g. to dump core
I would say that for this thing to be generally useful,
following should be possible:
Changing the program without re-compiling to terminate/dump core
instead
This run-time configurability would still be fast enough
...as I'm pretty sure administrators will still want to be able to
debug the problems they will encounter.
The more you value the data the program handles, the less you want
it to continue after there's some problem in handling the data.
Compare for example a program that manipulates / writes the same
data / files constantly (e.g. database server) to a program that
acts as a filter for a data that's different each time (e.g. mail
server) or doesn't write it at all (e.g. www-server).
KHB: Failure-oblivious computing
Posted Jun 30, 2006 9:40 UTC (Fri) by oak (subscriber, #2786)
[Link]
Btw. an example of a common open source library that by default doesn't
terminate an application which has an error, is Glib (used by Gtk
and many other projects). By default it just logs Glib Warnings
and Critical errors to the console and lets the program hobble along.
In GUI environment end-users don't see these messages as they don't
(usually) start programs from the console. These errors can be turned
to abort() (i.e. program termination) with an environment variable.
The Glib default behavior allows program to corrupt it's internal data
structures e.g. through double-frees. However, the Gnome apps have seemed
to work fairly OK although the developers haven't always had had time to
fix all of those errors, so I guess they had fixed the most problematic
ones before release. ;-)