November 21, 2006
This article was contributed by Valerie Henson
If you're a programmer, you've reviewed a lot of code - at minimum,
your own code (or at least we hope so). It doesn't take a lot of code
reviewing before you start recognizing familiar bugs - failure to drop
a lock on the error exit path, dereferencing a pointer just after it's
been proven to be null, forgetting to mark a buffer dirty. Before
long, the sense of deja vu is overpowering. You might even begin to
entertain the sneaking suspicion that half of code review work could
be done by a trained chimpanzee, a 10-line script, or someone from
marketing.
(Some of) The Solutions
As it turns out, that suspicion is correct. A lot of software errors
can be found automatically, in fact, surprisingly automatically. The
automatic checking we'll discuss falls into two main categories:
static and dynamic. Static checking runs on the source code and
doesn't require integration with a running system. It is often better
at exploring all execution paths, but often explores impossible
execution paths (resulting in false positives), and usually can't deal
with things like function pointers. Dynamic checking runs on a live
system, which produces more accurate results but requires more
invasive techniques and usually can't explore as many execution paths
(though fusion with model-checking techniques can work around this;
see eXplode later in this article). The good news is that automatic
error checking techniques are compatible; we can use them all and get
the best of all worlds.
In this article, we'll review several papers describing some of the
most practical and promising approaches, all from Dawson Engler's
research group at Stanford. Many LWN readers will already be familiar
with metacompilation (known as "the Stanford checker" in kernel
circles) at a high level, but the approach rewards deeper study.
Another approach, named EXE, uses symbolic execution, in which an
instrumented program self-generates complex error-triggering inputs.
We'll also look at eXplode, a light-weight system inspired by
model-checking which quickly and efficiently checks file systems and
other software for correct crash recovery. All of these approaches
are compatible with the Linux kernel (requiring more or less code
modification but generally less) and have found many real-world bugs
resulting in system panic, data corruption, or security holes.
Finally, we'll quickly review a variety of existing open source tools
for automatically error-checking programs. With any luck, in a few
years' time we'll have scripts doing the trained chimpanzee code
review work instead of Linux kernel maintainers.
The Papers
We'll start with one of the most intellectually intriguing approaches,
using code instrumentation and symbolic execution to automatically
generate complex test inputs that trigger serious bugs. The paper is
Automatically
generating malicious disks using symbolic execution, by Junfeng
Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler, and
appeared in
IEEE
Security and Privacy 2006. (Another longer, more detailed paper on
the topic is
EXE:
Automatically Generating Inputs of Death, by Cristian Cadar, Vijay
Ganesh, Peter Pawlowski, David Dill, and Dawson Engler and appeared in
ACM Computer
Communications and Security 2006). The basic idea is that you
begin executing the program with a "symbolic" input. As the program
runs, the EXE system uses compiled-in instrumentation to keep track of
the tests done on the input data. These tests create constraints on
what the input data can be. Once the system has a set of constraints,
it tries to solve them and come up with a set of allowed inputs. It
then checks the allowed inputs to figure out if they will cause one of
a known set of errors, such as dividing by zero, allowing access to
arbitrary memory locations, triggering an assertion, etc.
In this paper, the authors apply the system to the Linux file system
mount code for ext2, ext3, and JFS. In this case, the system starts
out with a symbolic representation of all possible disk images
("inputs"), and gradually whittles away allowed disk images at each
point in the mount code, based on actions such as:
if (sbi->s_frag_size == 0)
goto cantfind_ext2;
It then checks all disk images allowed at any particular point to see
if any of them causes one of the bugs the system can detect. For
example, the statement:
sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;
Would be flagged as triggering a divide by zero error without the
prior check pruning out all inputs with sb->s_frag_size equal to zero.
The advantage of this approach over simply generating random inputs is
that random error generation can't go very deep in testing code paths
because the random input will nearly always fail during the first few
input checks. For example, random input testing for the file system
mount code would almost always fail out at the check of the superblock
magic number. Another pleasant quality of this approach is that it
generates test inputs that trigger the bug detected by the system.
Many other automatic error checkers are plagued by false positives;
this system hands you the exact input that triggers the supposed bug. It can
be accurately described as a error test case generating system in
addition to an error checking system. The prospect is enough to make
a systems programmer salivate.
The next paper is eXplode: a
Lightweight, General System for Finding Serious Storage System
Errors, by Junfeng Yang, Can Sar, and Dawson Engler, which
appeared in OSDI
2006. eXplode tests file systems (and more complex storage
software stacks) by generating all possible disks that could be the
result of a crash, and then automatically checking them using
verification programs, such as fsck and programs that check for
"correct" file system topology (e.g., the existence of the path
"/a/b/" after creating and properly syncing it). The sequence of
events leading up to an incorrect disk is recorded through some
minor, not terribly intrusive instrumentation. Some minor
modifications to Linux are needed to deterministically replay a
sequence of events; mainly, the execution order of threads must be
maintained, which they approximate using thread priorities. They also
modify Linux to make certain error cases (such as memory allocation
failure) more common.
eXplode works for more than just file systems, it also works for
databases on top of file systems, file systems on top of RAID,
software configuration systems, or any combination of the above. This
is due to the stackable, modular nature of the routines for creating,
mutating, and checking disks. Each layer in the storage stack fills
out the following routines:
- init: one-time initialization, such as formatting a file system partition or creating a fresh database.
- mount: set up the storage system so that operations can be performed on it.
- unmount: tear down the storage system; used by eXplode to clear
the storage system's state so it can explore a different one.
- recover: repair the storage system after an eXplode-simulated
crash.
- threads: return the thread IDs for the storage system's kernel threads (to help control non-determinism).
The client code must also provide routines that mutate the storage
system (such as by creating a file) and that check the file system for
correctness, above and beyond the recover routine. When running,
eXplode (1) calls all the init() routines for each element in the
stack in order, (2) calls all the mount() routines, (3) run the mutate
routine, forking children at "choice points", places where execution
could go in one direction or another, (4) at appropriate points,
generate all possible crash disks (due to incomplete and/or reordered
writes), run the recover routines, and then run the checker routine,
(5) repeat steps 3 and 4 until the user gets bored.
A lot of hard work is needed to make this execute quickly and explore
"interesting" parts of the state space, but the results are quite
good, and a big improvement over their earlier system, FiSC.
Sections 7 through 9 of the eXplode paper describe many of the
interesting (and sometimes amusing) bugs eXplode found in Linux and
various software running on Linux, such as Berkeley DB and Subversion.
One of the least pleasant is a bug in the way ext2 handles fsync().
From the paper:
The ext2 bug is a case where an implementation error points out a
deeper design problem. The bug occurs when we: (1) shrink a file "A"
with truncate and (2) subsequently creat, write, and fsync a second
file "B." If file B reuses the indirect blocks of A freed via
truncate, then following a crash e2fsck notices that A's indirect
blocks are corrupt and clears them, destroying the contents of B. (For
good measure it then notices that A and B share blocks and "repairs" B
by duplicating blocks from A). Because ext2 makes no guarantees about
what is written to disk, fundamentally one cannot use fsync to safely
force a file to disk, since the file can still have implicit
dependencies on other file system state (in our case if it reuses an
indirect blocks for a file whose inode has been cleared in memory but
not on disk).
While it is well known that ext2 makes very few guarantees on the
state of the file system, it is surprising that even an fsync()
call does
not make any guarantees about the state of file system on disk.
eXplode also found an error in JFS, which does make fairly strong
guarantees, in which an fsync()'d file could lose all its data when a
directory inode is reused as a file inode.
One of the primary goals of eXplode is ease of use and extension to
new systems with only minor effort. The eXplode system runs on a
live, running Linux kernel instance with only minor modifications.
These modifications could be trivially rewritten to be configurable as
a compile-time option (CONFIG_EXPLODE?), making them a reasonable
candidate for integration in the mainline kernel. The checking
interface allows programmers to check new systems (pretty much
anything that runs on Linux and stores data on disks) by writing only
a few lines of code. While the current interface uses C++, it seems
relatively easy to add other front ends using C or shell scripts. The
authors are considering open sourcing the code and are very interested
in hearing more from kernel developers about how to make eXplode more
attractive for everyday use.
Our final paper is Bugs as
Deviant Behavior: A General Approach to Inferring Errors in Systems
Code, by Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and
Benjamin Chelf, and appeared in SOSP
2001. The basic idea is to create a framework for static code
analysis which allows programmers to write extremely simple
descriptions of rules that code should follow. Most readers will be
familiar with the simpler applications of this work from the many bug
reports produced by the Stanford checker and reviewed on the
linux-kernel mailing list. This paper goes above and beyond this
level of code analysis and describes on a statistical approach to
inferring relationships between functions and variables, looking for
deviations from the norm, and then ranking and ordering the results so
that the deviations most likely to yield bugs are near the top of the
list. For example, the system can infer relationships such as "only
modify variable X in between calls to spin_lock(Y) and spin_unlock(Y)"
- without writing a rule that explicitly lays out this relationship.
It could almost be described as meta-meta-compilation - the system not
only checks the rules automatically, it infers the rules
automatically.
A more recent paper, From
Uncertainty to Belief: Inferring the Specification Within, by Ted
Kremenek, Paul Twohey, Godmar Back, Andrew Y. Ng, and Dawson Engler,
which appeared in OSDI
2006, pushes these ideas even further with a technique that is
capable of inferring more complex rules using a combination of
statistical inference, compiler analysis, and machine learning. For a
system such as Linux where lines of code far outweigh lines of
documentation, this approach has great merit. I find myself doing a
human version of this statistical analysis every time I attempt to use
an undocumented network driver framework function.
A fun footnote is the slides from a talk entitled Weird
things that surprise academics trying to commercialize a static
checking tool. Check out the slides entitled "Myth: the C (or
C++) language exists" or "No, your tool is broken: that's not a bug."
What does this mean for Linux?
A lot of great, practical ideas for automatically finding errors are
coming out of research these days. The existing implementations may
not be practical or available for Linux (for example, the
metacompilation work has been commercialized and will remain closed
source for the indefinite future), but this work can often inspire
useful (though usually not as complete) open source
implementations.
On the static code analysis side, both sparse and smatch implement some useful
checks. sparse is already integrated into the kernel build system; smatch,
unfortunately, appears to have stalled.
Annotations like __must_check are producing voluminous (and sometimes
mystifying) compiler warnings. A lot of checks are integrated
directly into gcc, but this requires a programmer with knowledge of
gcc and a fairly long release cycle turnaround time before the check
becomes available. The general-purpose nature of these checks also
means that they sometimes generate many false positives, especially on
systems software, and have to be explicitly turned off again. A
framework that allows gcc to be extended with metacompilation style
checks without requiring recompilation of gcc might be more helpful.
When it comes to dynamic code analysis, Linux has quite a few special
purpose error checkers which can be configured in or out of the
kernel, or turned on and off at boot time. One of the most exciting
is lockdep, the lock
correctness validator written by Ingo Molnar and Arjan van de Ven. It
observes lock acquisition during runtime, and looks for invalid or
inconsistent use of locks (such as reacquiring locks or acquiring
locks in a different order). Even nicer would be a generic framework
for implementing dynamic code checkers, perhaps using part of the
SystemTap framework.
File system testers are coming back into vogue. fsx is a file system
stress tester that does a bunch of known-stressful operations to a
file system and checks the results. fsfuzz
is one of many useful tools for randomly altering file systems to
expose bugs.
There are many other useful automatic testing/chimpanzee-replacement
tools; I encourage you to describe your favorites in the comments.
Happy debugging!
(
Log in to post comments)