KHB: Automating bug hunting
(Some of) The Solutions
As it turns out, that suspicion is correct. A lot of software errors can be found automatically, in fact, surprisingly automatically. The automatic checking we'll discuss falls into two main categories: static and dynamic. Static checking runs on the source code and doesn't require integration with a running system. It is often better at exploring all execution paths, but often explores impossible execution paths (resulting in false positives), and usually can't deal with things like function pointers. Dynamic checking runs on a live system, which produces more accurate results but requires more invasive techniques and usually can't explore as many execution paths (though fusion with model-checking techniques can work around this; see eXplode later in this article). The good news is that automatic error checking techniques are compatible; we can use them all and get the best of all worlds.In this article, we'll review several papers describing some of the most practical and promising approaches, all from Dawson Engler's research group at Stanford. Many LWN readers will already be familiar with metacompilation (known as "the Stanford checker" in kernel circles) at a high level, but the approach rewards deeper study. Another approach, named EXE, uses symbolic execution, in which an instrumented program self-generates complex error-triggering inputs. We'll also look at eXplode, a light-weight system inspired by model-checking which quickly and efficiently checks file systems and other software for correct crash recovery. All of these approaches are compatible with the Linux kernel (requiring more or less code modification but generally less) and have found many real-world bugs resulting in system panic, data corruption, or security holes.
Finally, we'll quickly review a variety of existing open source tools for automatically error-checking programs. With any luck, in a few years' time we'll have scripts doing the trained chimpanzee code review work instead of Linux kernel maintainers.
The Papers
We'll start with one of the most intellectually intriguing approaches, using code instrumentation and symbolic execution to automatically generate complex test inputs that trigger serious bugs. The paper is Automatically generating malicious disks using symbolic execution, by Junfeng Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler, and appeared in IEEE Security and Privacy 2006. (Another longer, more detailed paper on the topic is EXE: Automatically Generating Inputs of Death, by Cristian Cadar, Vijay Ganesh, Peter Pawlowski, David Dill, and Dawson Engler and appeared in ACM Computer Communications and Security 2006). The basic idea is that you begin executing the program with a "symbolic" input. As the program runs, the EXE system uses compiled-in instrumentation to keep track of the tests done on the input data. These tests create constraints on what the input data can be. Once the system has a set of constraints, it tries to solve them and come up with a set of allowed inputs. It then checks the allowed inputs to figure out if they will cause one of a known set of errors, such as dividing by zero, allowing access to arbitrary memory locations, triggering an assertion, etc.In this paper, the authors apply the system to the Linux file system mount code for ext2, ext3, and JFS. In this case, the system starts out with a symbolic representation of all possible disk images ("inputs"), and gradually whittles away allowed disk images at each point in the mount code, based on actions such as:
if (sbi->s_frag_size == 0)
goto cantfind_ext2;
It then checks all disk images allowed at any particular point to see
if any of them causes one of the bugs the system can detect. For
example, the statement:
sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;
Would be flagged as triggering a divide by zero error without the
prior check pruning out all inputs with sb->s_frag_size equal to zero.
The advantage of this approach over simply generating random inputs is that random error generation can't go very deep in testing code paths because the random input will nearly always fail during the first few input checks. For example, random input testing for the file system mount code would almost always fail out at the check of the superblock magic number. Another pleasant quality of this approach is that it generates test inputs that trigger the bug detected by the system. Many other automatic error checkers are plagued by false positives; this system hands you the exact input that triggers the supposed bug. It can be accurately described as a error test case generating system in addition to an error checking system. The prospect is enough to make a systems programmer salivate.
The next paper is eXplode: a Lightweight, General System for Finding Serious Storage System Errors, by Junfeng Yang, Can Sar, and Dawson Engler, which appeared in OSDI 2006. eXplode tests file systems (and more complex storage software stacks) by generating all possible disks that could be the result of a crash, and then automatically checking them using verification programs, such as fsck and programs that check for "correct" file system topology (e.g., the existence of the path "/a/b/" after creating and properly syncing it). The sequence of events leading up to an incorrect disk is recorded through some minor, not terribly intrusive instrumentation. Some minor modifications to Linux are needed to deterministically replay a sequence of events; mainly, the execution order of threads must be maintained, which they approximate using thread priorities. They also modify Linux to make certain error cases (such as memory allocation failure) more common.
eXplode works for more than just file systems, it also works for databases on top of file systems, file systems on top of RAID, software configuration systems, or any combination of the above. This is due to the stackable, modular nature of the routines for creating, mutating, and checking disks. Each layer in the storage stack fills out the following routines:
- init: one-time initialization, such as formatting a file system partition or creating a fresh database.
- mount: set up the storage system so that operations can be performed on it.
- unmount: tear down the storage system; used by eXplode to clear the storage system's state so it can explore a different one.
- recover: repair the storage system after an eXplode-simulated crash.
- threads: return the thread IDs for the storage system's kernel threads (to help control non-determinism).
A lot of hard work is needed to make this execute quickly and explore "interesting" parts of the state space, but the results are quite good, and a big improvement over their earlier system, FiSC. Sections 7 through 9 of the eXplode paper describe many of the interesting (and sometimes amusing) bugs eXplode found in Linux and various software running on Linux, such as Berkeley DB and Subversion. One of the least pleasant is a bug in the way ext2 handles fsync(). From the paper:
While it is well known that ext2 makes very few guarantees on the state of the file system, it is surprising that even an fsync() call does not make any guarantees about the state of file system on disk. eXplode also found an error in JFS, which does make fairly strong guarantees, in which an fsync()'d file could lose all its data when a directory inode is reused as a file inode.
One of the primary goals of eXplode is ease of use and extension to new systems with only minor effort. The eXplode system runs on a live, running Linux kernel instance with only minor modifications. These modifications could be trivially rewritten to be configurable as a compile-time option (CONFIG_EXPLODE?), making them a reasonable candidate for integration in the mainline kernel. The checking interface allows programmers to check new systems (pretty much anything that runs on Linux and stores data on disks) by writing only a few lines of code. While the current interface uses C++, it seems relatively easy to add other front ends using C or shell scripts. The authors are considering open sourcing the code and are very interested in hearing more from kernel developers about how to make eXplode more attractive for everyday use.
Our final paper is Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, by Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf, and appeared in SOSP 2001. The basic idea is to create a framework for static code analysis which allows programmers to write extremely simple descriptions of rules that code should follow. Most readers will be familiar with the simpler applications of this work from the many bug reports produced by the Stanford checker and reviewed on the linux-kernel mailing list. This paper goes above and beyond this level of code analysis and describes on a statistical approach to inferring relationships between functions and variables, looking for deviations from the norm, and then ranking and ordering the results so that the deviations most likely to yield bugs are near the top of the list. For example, the system can infer relationships such as "only modify variable X in between calls to spin_lock(Y) and spin_unlock(Y)" - without writing a rule that explicitly lays out this relationship. It could almost be described as meta-meta-compilation - the system not only checks the rules automatically, it infers the rules automatically.
A more recent paper, From Uncertainty to Belief: Inferring the Specification Within, by Ted Kremenek, Paul Twohey, Godmar Back, Andrew Y. Ng, and Dawson Engler, which appeared in OSDI 2006, pushes these ideas even further with a technique that is capable of inferring more complex rules using a combination of statistical inference, compiler analysis, and machine learning. For a system such as Linux where lines of code far outweigh lines of documentation, this approach has great merit. I find myself doing a human version of this statistical analysis every time I attempt to use an undocumented network driver framework function.
A fun footnote is the slides from a talk entitled Weird things that surprise academics trying to commercialize a static checking tool. Check out the slides entitled "Myth: the C (or C++) language exists" or "No, your tool is broken: that's not a bug."
What does this mean for Linux?
A lot of great, practical ideas for automatically finding errors are coming out of research these days. The existing implementations may not be practical or available for Linux (for example, the metacompilation work has been commercialized and will remain closed source for the indefinite future), but this work can often inspire useful (though usually not as complete) open source implementations.On the static code analysis side, both sparse and smatch implement some useful checks. sparse is already integrated into the kernel build system; smatch, unfortunately, appears to have stalled. Annotations like __must_check are producing voluminous (and sometimes mystifying) compiler warnings. A lot of checks are integrated directly into gcc, but this requires a programmer with knowledge of gcc and a fairly long release cycle turnaround time before the check becomes available. The general-purpose nature of these checks also means that they sometimes generate many false positives, especially on systems software, and have to be explicitly turned off again. A framework that allows gcc to be extended with metacompilation style checks without requiring recompilation of gcc might be more helpful.
When it comes to dynamic code analysis, Linux has quite a few special purpose error checkers which can be configured in or out of the kernel, or turned on and off at boot time. One of the most exciting is lockdep, the lock correctness validator written by Ingo Molnar and Arjan van de Ven. It observes lock acquisition during runtime, and looks for invalid or inconsistent use of locks (such as reacquiring locks or acquiring locks in a different order). Even nicer would be a generic framework for implementing dynamic code checkers, perhaps using part of the SystemTap framework.
File system testers are coming back into vogue. fsx is a file system stress tester that does a bunch of known-stressful operations to a file system and checks the results. fsfuzz is one of many useful tools for randomly altering file systems to expose bugs.
There are many other useful automatic testing/chimpanzee-replacement tools; I encourage you to describe your favorites in the comments. Happy debugging!
| Index entries for this article | |
|---|---|
| GuestArticles | Aurora (Henson), Valerie |
