Editors note: this is the second in the Kernel Hacker's Bookshelf
series by Valerie Henson; if you missed it, the first article is over here.
Computer programs have bugs. As programmers, we know that this is
inevitable, given the trade-off in time and money against creating a
perfect system. Systems with nearly-zero bug counts exist (e.g., the
Shuttle software, only 17 bugs in 420,000 lines of code over the last
11 releases) but they require vast amounts of work to achieve this
level of correctness, work that is completely unjustifiable for most
programs (such as desktop operating systems). But we're programmers,
it's our job to replace time and money with smart ideas.
What would happen if when a program had a memory error - and it
detected that error, ignored it, and drove happily on, oblivious to
the failure? You would expect that this would result in horrible
errors and obscure crashes. But what if it worked - or even made
things better? For example, failing to check the size of a memory
copy operation can result in a buffer overflow attack. Could we do
something clever that would both paper over the memory error and keep
the application running, more or less on track?
Martin Rinard and a few of his colleagues got to wondering about this
question and decided to test it - and found that the answer was yes,
you can automatically handle memory bugs in a better, safer way than
either ignoring the bug or terminating the program. I first heard of
their technique, Failure-Oblivious
, at their talk at OSDI 2004
. The talk
was quite lively; if there was a "Most Laughs per Minute" award,
Martin Rinard would have won it.
The explanation of how failure-oblivious computing is implemented
might seem utterly crazy, but stick with me. Remember, the amazing
thing about failure-oblivious computing is that when you implement it,
it works! (At least for quite a few useful applications.) The basic
idea is to detect memory errors - out-of-bound reads, out-of-bound
writes - and instead of killing the program, handle otherwise fatal
errors by turning them into relatively benign bugs. Detecting the
memory errors requires a "safe-C compiler" - a C compiler that adds
run-time memory access checks.
Safe-C compilers (and languages that always check memory accesses)
have been around for a long time. When they detect a memory error,
the process gets a segmentation fault, and usually exits shortly
thereafter. In failure-oblivious computing, the application never
even knows the memory error happened. In the case of an out-of-bounds
write, the write is silently thrown away and execution continues.
Handling out-of-bounds reads is slightly harder. In this case, a
made-up value is manufactured and returned.
How do you pick which value to return? Two observations lie behind
the answer. First, 0 and 1 are the most common values in computation.
Second, sometimes the program is looking for a particular value before
returning, such as searching for a particular ASCII character in a
string, or iterating through a loop 100 times. The result is a series
of return values that looks something like this:
0, 1, 2, 0, 1, 3, 0, 1, 4,...
So you throw away invalid writes, and make up stuff to return for
invalid reads. Crazy, right? But crazy like a fox.
Why does it work?
Failure-oblivious computing is targeted at a particular class of
applications, ones with short error-propagation distances - in other
words, applications that have relatively short execution paths which
return without affecting much global state. This includes a rather
useful class of applications, such as web servers, mail servers, and
mail readers. It does not include applications like scientific
modeling software, in which one wrong value can fatally corrupt the
final answer. Software programs which handle incoming requests and
return to a waiting state, or have many independent threads of
execution are good candidates for failure-oblivious computing.
Another reason failure-oblivious computing works is because memory
errors are transformed into input errors. Since the programs have to
deal with invalid or malicious input already, often the result is an
anticipated error, one the program knows how to deal with cleanly.
For example, a buffer overflow attack on Sendmail uses a malformed,
too-long, illegal email address to overwrite some other part of the
program's memory. This technique silently discards the writes that go
beyond the buffer, and Sendmail continues on to check the validity of
the input - whether or not it's a correctly formed email address.
Answer: No, so throw it away and go on to the next request. At this
point, Sendmail is back in known territory and the error has stopped
A limitation of this technique is the cost of memory bounds checking.
Applications that need to access memory frequently will probably not be
good candidates for this technique. However, applications that are
limited by I/O time, or only need to complete before the human user
notices a delay, won't be much impacted by the cost. Indeed, humans
can't detect delays below about 100 milliseconds - an eternity in
Failure-oblivious computing in practice
Rinard and his co-authors evaluated failure-oblivious computing with
versions of several commonly used open source applications with known
buffer overflow attacks: Sendmail, Pine, Apache, and Midnight
Commander. They ran three versions of each program: an unaltered
version, one using just safe-C compilation, and one transformed into a
failure-oblivious program. In each case, the failure-oblivious
version performed acceptably (sometimes better), did not create any
new bugs, and did not suffer any security breaches.
One example was the Pine mail reader. It had a bug in processing the
"From" field for display in the message index. It needed to add a '\'
character in front of certain characters, but allocated a too-small
buffer to copy it into. Some "From" fields could overflow the buffer
and cause the program to segfault and die. The safe-C version of the
program dies as well, because all it can do is detect the buffer
overflow. The failure-oblivious version threw away the writes beyond
the end of the buffer, and then went on to behave exactly correctly!
The length of the "From" field displayed in the index is shorter than
the length of the buffer, so the fact that it was truncated too early
is unobservable. When the user reads a particular message, a
different code path correctly displays the "From" field. Now an email
message that would cause Pine to die every time it was started could
be correctly displayed and handled.
The performance of failure-oblivious Pine was 1.3 to 8 times slower
times on certain tasks, but the total elapsed time to respond to user
input was still in the low milliseconds range. For interactive use,
the slowdown is acceptable. In the case of the Apache server bug, the
performance of the failure-oblivious server was actually better than
either of the other two versions. The higher performance was due to
the fact that the bug would kill an Apache thread each time it was
encountered, incurring the overhead of creating a replacement thread.
The failure-oblivious version did not have the overhead of constantly
killing and restarting threads and could server requests much faster.
Especially exciting is the use of failure-oblivious computing for
widely used network servers, such as Apache and Sendmail. The paper
has in-depth examinations of how buffer overflow bugs are prevented
and indeed ignored by the failure-oblivious versions of these and
What failure-oblivious computing means for Linux
Linux has a huge variety of techniques for improving system security
in the face of bugs. SELinux, various stack protection schemes,
capabilities - all these techniques help cut down but don't eliminate
security problems. Failure-oblivious computing would fill one niche,
and in some cases will be the best solution due to the ability to
continue running after a normally-fatal memory error. Wouldn't it be
nice if, when everyone else is suffering from some brand-new zero-day
attack, your system is not only secure but still up and running?
More importantly, this paper teaches the value of experimentation with
obviously crazy ideas. Even after seeing the talk and reading the
paper and talking to the author, I still find it a little
mind-boggling that failure-oblivious computing works. Even more fun
is understanding why it works - a good reason to read the full paper
yourself. I am certain that computers (and computer science) will
continue to surprise us for many years to come.
[Do you have a favorite textbook or systems paper? Of course you do.
Send your suggestions to:
val dot henson at gmail dot com
Valerie Henson is a Linux kernel
developer working for Intel. Her interests include file systems,
networking, women in computing, and walking up and down large
mountains. She is always looking for good systems programmers, so
send her some email and introduce yourself.]
to post comments)