LWN.net Logo

KHB: Automating bug hunting

November 21, 2006

This article was contributed by Valerie Henson

If you're a programmer, you've reviewed a lot of code - at minimum, your own code (or at least we hope so). It doesn't take a lot of code reviewing before you start recognizing familiar bugs - failure to drop a lock on the error exit path, dereferencing a pointer just after it's been proven to be null, forgetting to mark a buffer dirty. Before long, the sense of deja vu is overpowering. You might even begin to entertain the sneaking suspicion that half of code review work could be done by a trained chimpanzee, a 10-line script, or someone from marketing.

(Some of) The Solutions

As it turns out, that suspicion is correct. A lot of software errors can be found automatically, in fact, surprisingly automatically. The automatic checking we'll discuss falls into two main categories: static and dynamic. Static checking runs on the source code and doesn't require integration with a running system. It is often better at exploring all execution paths, but often explores impossible execution paths (resulting in false positives), and usually can't deal with things like function pointers. Dynamic checking runs on a live system, which produces more accurate results but requires more invasive techniques and usually can't explore as many execution paths (though fusion with model-checking techniques can work around this; see eXplode later in this article). The good news is that automatic error checking techniques are compatible; we can use them all and get the best of all worlds.

In this article, we'll review several papers describing some of the most practical and promising approaches, all from Dawson Engler's research group at Stanford. Many LWN readers will already be familiar with metacompilation (known as "the Stanford checker" in kernel circles) at a high level, but the approach rewards deeper study. Another approach, named EXE, uses symbolic execution, in which an instrumented program self-generates complex error-triggering inputs. We'll also look at eXplode, a light-weight system inspired by model-checking which quickly and efficiently checks file systems and other software for correct crash recovery. All of these approaches are compatible with the Linux kernel (requiring more or less code modification but generally less) and have found many real-world bugs resulting in system panic, data corruption, or security holes.

Finally, we'll quickly review a variety of existing open source tools for automatically error-checking programs. With any luck, in a few years' time we'll have scripts doing the trained chimpanzee code review work instead of Linux kernel maintainers.

The Papers

We'll start with one of the most intellectually intriguing approaches, using code instrumentation and symbolic execution to automatically generate complex test inputs that trigger serious bugs. The paper is Automatically generating malicious disks using symbolic execution, by Junfeng Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler, and appeared in IEEE Security and Privacy 2006. (Another longer, more detailed paper on the topic is EXE: Automatically Generating Inputs of Death, by Cristian Cadar, Vijay Ganesh, Peter Pawlowski, David Dill, and Dawson Engler and appeared in ACM Computer Communications and Security 2006). The basic idea is that you begin executing the program with a "symbolic" input. As the program runs, the EXE system uses compiled-in instrumentation to keep track of the tests done on the input data. These tests create constraints on what the input data can be. Once the system has a set of constraints, it tries to solve them and come up with a set of allowed inputs. It then checks the allowed inputs to figure out if they will cause one of a known set of errors, such as dividing by zero, allowing access to arbitrary memory locations, triggering an assertion, etc.

In this paper, the authors apply the system to the Linux file system mount code for ext2, ext3, and JFS. In this case, the system starts out with a symbolic representation of all possible disk images ("inputs"), and gradually whittles away allowed disk images at each point in the mount code, based on actions such as:

    if (sbi->s_frag_size == 0)
	goto cantfind_ext2;
It then checks all disk images allowed at any particular point to see if any of them causes one of the bugs the system can detect. For example, the statement:

    sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;
Would be flagged as triggering a divide by zero error without the prior check pruning out all inputs with sb->s_frag_size equal to zero.

The advantage of this approach over simply generating random inputs is that random error generation can't go very deep in testing code paths because the random input will nearly always fail during the first few input checks. For example, random input testing for the file system mount code would almost always fail out at the check of the superblock magic number. Another pleasant quality of this approach is that it generates test inputs that trigger the bug detected by the system. Many other automatic error checkers are plagued by false positives; this system hands you the exact input that triggers the supposed bug. It can be accurately described as a error test case generating system in addition to an error checking system. The prospect is enough to make a systems programmer salivate.

The next paper is eXplode: a Lightweight, General System for Finding Serious Storage System Errors, by Junfeng Yang, Can Sar, and Dawson Engler, which appeared in OSDI 2006. eXplode tests file systems (and more complex storage software stacks) by generating all possible disks that could be the result of a crash, and then automatically checking them using verification programs, such as fsck and programs that check for "correct" file system topology (e.g., the existence of the path "/a/b/" after creating and properly syncing it). The sequence of events leading up to an incorrect disk is recorded through some minor, not terribly intrusive instrumentation. Some minor modifications to Linux are needed to deterministically replay a sequence of events; mainly, the execution order of threads must be maintained, which they approximate using thread priorities. They also modify Linux to make certain error cases (such as memory allocation failure) more common.

eXplode works for more than just file systems, it also works for databases on top of file systems, file systems on top of RAID, software configuration systems, or any combination of the above. This is due to the stackable, modular nature of the routines for creating, mutating, and checking disks. Each layer in the storage stack fills out the following routines:

  • init: one-time initialization, such as formatting a file system partition or creating a fresh database.
  • mount: set up the storage system so that operations can be performed on it.
  • unmount: tear down the storage system; used by eXplode to clear the storage system's state so it can explore a different one.
  • recover: repair the storage system after an eXplode-simulated crash.
  • threads: return the thread IDs for the storage system's kernel threads (to help control non-determinism).
The client code must also provide routines that mutate the storage system (such as by creating a file) and that check the file system for correctness, above and beyond the recover routine. When running, eXplode (1) calls all the init() routines for each element in the stack in order, (2) calls all the mount() routines, (3) run the mutate routine, forking children at "choice points", places where execution could go in one direction or another, (4) at appropriate points, generate all possible crash disks (due to incomplete and/or reordered writes), run the recover routines, and then run the checker routine, (5) repeat steps 3 and 4 until the user gets bored.

A lot of hard work is needed to make this execute quickly and explore "interesting" parts of the state space, but the results are quite good, and a big improvement over their earlier system, FiSC. Sections 7 through 9 of the eXplode paper describe many of the interesting (and sometimes amusing) bugs eXplode found in Linux and various software running on Linux, such as Berkeley DB and Subversion. One of the least pleasant is a bug in the way ext2 handles fsync(). From the paper:

The ext2 bug is a case where an implementation error points out a deeper design problem. The bug occurs when we: (1) shrink a file "A" with truncate and (2) subsequently creat, write, and fsync a second file "B." If file B reuses the indirect blocks of A freed via truncate, then following a crash e2fsck notices that A's indirect blocks are corrupt and clears them, destroying the contents of B. (For good measure it then notices that A and B share blocks and "repairs" B by duplicating blocks from A). Because ext2 makes no guarantees about what is written to disk, fundamentally one cannot use fsync to safely force a file to disk, since the file can still have implicit dependencies on other file system state (in our case if it reuses an indirect blocks for a file whose inode has been cleared in memory but not on disk).

While it is well known that ext2 makes very few guarantees on the state of the file system, it is surprising that even an fsync() call does not make any guarantees about the state of file system on disk. eXplode also found an error in JFS, which does make fairly strong guarantees, in which an fsync()'d file could lose all its data when a directory inode is reused as a file inode.

One of the primary goals of eXplode is ease of use and extension to new systems with only minor effort. The eXplode system runs on a live, running Linux kernel instance with only minor modifications. These modifications could be trivially rewritten to be configurable as a compile-time option (CONFIG_EXPLODE?), making them a reasonable candidate for integration in the mainline kernel. The checking interface allows programmers to check new systems (pretty much anything that runs on Linux and stores data on disks) by writing only a few lines of code. While the current interface uses C++, it seems relatively easy to add other front ends using C or shell scripts. The authors are considering open sourcing the code and are very interested in hearing more from kernel developers about how to make eXplode more attractive for everyday use.

Our final paper is Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, by Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf, and appeared in SOSP 2001. The basic idea is to create a framework for static code analysis which allows programmers to write extremely simple descriptions of rules that code should follow. Most readers will be familiar with the simpler applications of this work from the many bug reports produced by the Stanford checker and reviewed on the linux-kernel mailing list. This paper goes above and beyond this level of code analysis and describes on a statistical approach to inferring relationships between functions and variables, looking for deviations from the norm, and then ranking and ordering the results so that the deviations most likely to yield bugs are near the top of the list. For example, the system can infer relationships such as "only modify variable X in between calls to spin_lock(Y) and spin_unlock(Y)" - without writing a rule that explicitly lays out this relationship. It could almost be described as meta-meta-compilation - the system not only checks the rules automatically, it infers the rules automatically.

A more recent paper, From Uncertainty to Belief: Inferring the Specification Within, by Ted Kremenek, Paul Twohey, Godmar Back, Andrew Y. Ng, and Dawson Engler, which appeared in OSDI 2006, pushes these ideas even further with a technique that is capable of inferring more complex rules using a combination of statistical inference, compiler analysis, and machine learning. For a system such as Linux where lines of code far outweigh lines of documentation, this approach has great merit. I find myself doing a human version of this statistical analysis every time I attempt to use an undocumented network driver framework function.

A fun footnote is the slides from a talk entitled Weird things that surprise academics trying to commercialize a static checking tool. Check out the slides entitled "Myth: the C (or C++) language exists" or "No, your tool is broken: that's not a bug."

What does this mean for Linux?

A lot of great, practical ideas for automatically finding errors are coming out of research these days. The existing implementations may not be practical or available for Linux (for example, the metacompilation work has been commercialized and will remain closed source for the indefinite future), but this work can often inspire useful (though usually not as complete) open source implementations.

On the static code analysis side, both sparse and smatch implement some useful checks. sparse is already integrated into the kernel build system; smatch, unfortunately, appears to have stalled. Annotations like __must_check are producing voluminous (and sometimes mystifying) compiler warnings. A lot of checks are integrated directly into gcc, but this requires a programmer with knowledge of gcc and a fairly long release cycle turnaround time before the check becomes available. The general-purpose nature of these checks also means that they sometimes generate many false positives, especially on systems software, and have to be explicitly turned off again. A framework that allows gcc to be extended with metacompilation style checks without requiring recompilation of gcc might be more helpful.

When it comes to dynamic code analysis, Linux has quite a few special purpose error checkers which can be configured in or out of the kernel, or turned on and off at boot time. One of the most exciting is lockdep, the lock correctness validator written by Ingo Molnar and Arjan van de Ven. It observes lock acquisition during runtime, and looks for invalid or inconsistent use of locks (such as reacquiring locks or acquiring locks in a different order). Even nicer would be a generic framework for implementing dynamic code checkers, perhaps using part of the SystemTap framework.

File system testers are coming back into vogue. fsx is a file system stress tester that does a bunch of known-stressful operations to a file system and checks the results. fsfuzz is one of many useful tools for randomly altering file systems to expose bugs.

There are many other useful automatic testing/chimpanzee-replacement tools; I encourage you to describe your favorites in the comments. Happy debugging!


(Log in to post comments)

KHB: Automating bug hunting

Posted Nov 22, 2006 15:57 UTC (Wed) by nix (subscriber, #2304) [Link]

I note with some amusement that the fascinating eXplode paper describes at one point testing `an expensive commercial [version control] system we did not have source code for, denoted as expENSiv (its license precludes naming it directly)', and then proceeds to demonstrate the utter futility of such ludicrous license clauses by describing that version control system well enough to enable anyone to google up the real name of the version control system in about ten seconds. (The necessary description was only a few words, and I can't see how they could have mentioned *anything* about it without including that description.)

I shall not mention the name of that version-control system myself since the mention of ridiculous license terms in association with a version control system itself provides enough information to enable readers of LWN to unambiguously determine which version control system it is. The terms are particularly ridiculous because one of the bugs they found in it they also found in Subversion and several other version control systems as well, so would hardly serve as a reason not to buy it. (The ridiculous license terms, on the other hand, *would* be a reason not to buy it, as would its proprietary nature.)

KHB: Automating bug hunting

Posted Nov 23, 2006 0:47 UTC (Thu) by bronson (subscriber, #4806) [Link]

Does anybody even use ExpENSiv anymore? Their recovery tool requires running "-r check -f"... the dangers of idiotic license restrictions combined with command-line idioms. :)

ExpENSiv

Posted Nov 26, 2006 20:32 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

It would be nice if someone would say here what product it is, anyway. Just for fun, I looked at the paper and tried to google some of the words it uses in connection with the product, but came up dry. There are lots of LWN readers who don't have enough background in code control systems to figure it out.

ExpENSiv

Posted Nov 27, 2006 5:06 UTC (Mon) by JoeBuck (subscriber, #2330) [Link]

Just Google "-r check -f".

ExpENSiv

Posted Nov 29, 2006 3:10 UTC (Wed) by giraffedata (subscriber, #1954) [Link]

BitKeeper, then.

The quotes make a lot of difference in Googling.

ExpENSiv

Posted Nov 28, 2006 22:29 UTC (Tue) by roelofs (guest, #2599) [Link]

There are lots of LWN readers who don't have enough background in code control systems to figure it out.

Apparently it's a proprietary revision-control product about which LWN readers heard a great deal, particularly in early 2005. ;-)

Greg

_must_check

Posted Nov 23, 2006 8:26 UTC (Thu) by ldo (subscriber, #40946) [Link]

I personally don't understand the point of _must_check. I would prefer to see a GCC option that caused all calls to functions that return non-void results to produce compile-time errors unless the result is

  • used in an expression (including an assignment), or
  • explicitly cast to void.

Thus, for example, a call to unlink(2) like this:

unlink(filename);

would be illegal. If you really wanted to ignore the result, you would have to write:

(void)unlink(filename);

which would make it clear to any reader that a function result was being thrown away.

_must_check

Posted Nov 23, 2006 11:09 UTC (Thu) by gnb (subscriber, #5132) [Link]

That would be fine if APIs were designed with that in mind, but as things
stand you probably need to be more selective because of commonly used
functions that return values no-one cares about. The most obvious
user-space example is printf. A quick look at google codesearch suggests
that about 0% of the world's programming population checks the return
value (understandably). So a global option probably wouldn't get used
much due to the amount of code that would need "fixing". Having a
case-by-case "no, really _must_ check this one" is messier but likely
to meet less resistance.

_must_check

Posted Nov 23, 2006 15:12 UTC (Thu) by nix (subscriber, #2304) [Link]

Also, (void)blah is *really ugly*. To me casts to void look *wrong*. They only really make sense in the context of C++ templates (for the same reason that returning `values of type void' can sometimes make sense there).

Re: _must_check

Posted Nov 24, 2006 1:30 UTC (Fri) by ldo (subscriber, #40946) [Link]

>Also, (void)blah is *really ugly*. To me casts to void look *wrong*.

It's either

(void)unlink(filename);

or

ignore_unwanted_result = unlink(filename);

Which would you prefer?

Re: _must_check

Posted Nov 24, 2006 21:58 UTC (Fri) by nix (subscriber, #2304) [Link]

Since the whole point of __attribute__((warn_unused_result)) is that it
should be applied only to functions where it is nearly always a mistake to
ignore the result at all, the question is academic.

The problem with (in effect) adding that attribute to every function is
that it *would* require one ugly workaround or another, and thus would
encourage using such workarounds even for those functions where it *is* an
error to ignore the result. This would eliminate a large part of the point
of warn_unused_result, and reduce net security.

(I've seen exactly this happen on codebases that frequently get attacked
by IMHO flawed lint tools that *do* emit such warnings as you propose.
(void)foo() crops up whenever foo()'s result is ignored, *even when
ignoring that result is in fact a bad idea*.)

Re: _must_check

Posted Nov 30, 2006 2:05 UTC (Thu) by ldo (subscriber, #40946) [Link]

>Since the whole point of __attribute__((warn_unused_result)) is that
>it should be applied only to functions where it is nearly always a
>mistake to ignore the result at all, the question is academic.

On the contrary, the question is far from academic. This business of having to ignore return values by assigning them to rubbish variables (as opposed to simply casting them to void) has already infected the Linux kernel--and indeed, led to bugs there. A clean, uniform solution is needed.

Re: _must_check

Posted Nov 28, 2006 4:30 UTC (Tue) by xoddam (subscriber, #2322) [Link]

How about
#define do (void)
?

Re: _must_check

Posted Nov 28, 2006 23:56 UTC (Tue) by nix (subscriber, #2304) [Link]

That particular choice of name kinda wrecks do/while loops, don't you
think?

Re: _must_check

Posted Nov 29, 2006 2:03 UTC (Wed) by xoddam (subscriber, #2322) [Link]

Oh yes. I'll get my coat.

_must_check

Posted Dec 2, 2006 22:02 UTC (Sat) by jzbiciak (✭ supporter ✭, #5246) [Link]

(void)strcpy(buf, "That's just plain silly.\n");

for (i = 0; i < 100; (void)i++)
    (void)printf("%s", buf);

(void)fflush(stdout);


I think if you required (void) casts on all ignored return values, you'd quickly find yourself adding them blindly, and/or probably going (void)just this (void)side of (void)insane. (void)(void)(VOID)

_must_check

Posted Dec 2, 2006 22:03 UTC (Sat) by jzbiciak (✭ supporter ✭, #5246) [Link]

Oh, and imagine requiring that for C++. Recall that operator= returns the assigned value.... (void)a = b; Ok, maybe that's a bit over-the-top, but really, where do you draw the line?

KHB: Automating bug hunting

Posted Nov 26, 2006 5:14 UTC (Sun) by error27 (subscriber, #8346) [Link]

I've kinda started working on smatch again. It uses sparse instead of gcc as a backend now. You can see the pre-patches here, but it's pretty rough and sucky...
http://smatch.sourceforge.net/sparse/

excessive focus on Engler's group

Posted Nov 27, 2006 7:43 UTC (Mon) by JoeBuck (subscriber, #2330) [Link]

Dawson Engler and group are brilliant, and Coverity is a useful tool, but there's a hell of a lot of other useful research out there.

While I'm normally the last to say anything good about Microsoft, their SLAM project is scary-impressive, and they've snapped up a lot of world-class formal verification researchers to work on it. Their device driver checkers appear to be more powerful and rigorous than Coverity's in that they yield proofs (not that there are no bugs, but that a whole series of properties are always satisfied on all code paths). And now they are making their tools available to Windows driver writers outside Microsoft.

Also, many techniques that Valerie appears to credit Engler's group for actually originated in the hardware verification world, particularly symbolic simulation, hybrid symbolic/concrete state space exploration and constraint solving to create testcases with desired properties. But this is to be expected, because the standard for hardware perfection is much higher, and bugs can be much more costly (after Intel lost $470 million on the Pentium bug, they started spending a hell of a lot more money on advanced formal verification techniques).

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds