By Jonathan Corbet
January 22, 2010
Taras Glek works for Mozilla, but he is not a browser hacker; instead, he
works on GCC and other tools aimed at making the browser development
process better. It is, he says, a good job. While carrying out his
duties, Taras has been able to put a new GCC feature to work in ways which
may prove to be useful well beyond Mozilla.
Development tools are important; they can help us to produce software more
quickly and with far fewer problems. Unfortunately, Taras says, we are
stuck in the stone age of software development, using tools from the
1970's. Our code base is growing, though, to the point that developers
often cannot understand the entirety of even a single application. We need
some way to amplify our capabilities so that we can continue to make more
powerful applications; static analysis tools can bring some of those
capabilities.
Static analysis, in essence, treats the code as data which is then the
subject of further analysis. It has often been seen as a backwater, an
area of primarily academic interest. When static analysis tools have found their way
into more common use, it has generally been in their ability to find
certain classes of bugs. But there's more that can be done with these
tools: finding API abuse, generating library bindings, improved code base
visualization, and more. Static analysis has been put to use with Mozilla
to find dead code; thousands of lines of code have been found to be
completely unused, despite the fact that engineers were putting their time
into maintaining it.
The Mozilla project has an especially strong need for good tools. It is a
huge code base (1.7 million lines of C++ and 1 million lines of
JavaScript); humans just do not scale to that amount of code. This code
base is under constant optimization work, so refactorings are frequent.
Without some help, keeping this code in good condition is a major challenge.
Much of Taras's work seems to be aimed at mitigating some of the pains that
come with C++ development. One of those pains is that the language is just
about impossible to parse; the parser must actually instantiate types
before it can complete its job. So anybody who wants to analyze C++ code
must first find a decent parser for it. The available options are
limited. The LLVM compiler is promising, but it's going to be another year
or two before it's really ready for prime time. The Elsa tool can be used, but it's
essentially unmaintained and not really guaranteed to be correct.
The one other option - one which is known to have a complete C++ parser -
is GCC. But the GCC code has a bit of a nasty reputation, so Taras started
off using Elsa for his work. Eventually, though, he turned back to GCC for something
more solid, and hasn't looked back - the hairiness of GCC has, perhaps, been
exaggerated. But, more to the point, the upcoming GCC 4.5 release is,
he says, "the most exciting release ever." The reason for that is the
long-delayed addition of the plugin API, which became possible once the runtime library license
exemption finally went into place. With this API, analysis code can
easily hook into the compiler and inspect code at whatever stage of the
process best suits its needs.
Beyond plugins, GCC has a few other features which make it suitable for
static analysis work. The ability to attach attributes to objects in the
compiled code makes it easy to pass hints through to later processing
steps. The new pass manager brings a relatively modern structure to a
compiler which did not originally have one. And the GIMPLE intermediate
representation provides much of the rest of what's needed for code which
needs to inspect other code.
There are a few interesting plugins in the works.
One of them is the LLVM compiler, which can be plugged in to perform the
back-end functions for GCC. Another is milepost,
which uses a brute-force approach to figure out the optimal settings of the
command-line flags for a specific body of code. Then, there are "the
hydras," which are Taras's work.
These plugins take an interesting approach, in that the actual
analysis work is done in JavaScript scripts. The idea was originally seen
as amusing - "wouldn't it be fun to put Spidermonkey into GCC?" - but it
has actually worked out well. JavaScript is a relatively nice, concise
language which makes it easy to implement the needed capabilities.
The first plugin is Dehydra, so named
because the control flow graph in Mozilla somewhat resembles a Hydra
monster. Dehydra produces a JSON-like representation of the objects found
in a C++ program; individual JavaScript scripts can then use this
representation to analyze the program. The Treehydra plugin,
instead, provides a JavaScript interface to the GIMPLE representation of
the program; it can be used for more traditional sorts of static analysis
tasks.
One of the pains that come with large C++ programs is that simply finding
code can be difficult. It's not always clear which method will be invoked
in a specific situation, even in the absence of things like macro tricks.
To help with this problem, Dehydra has been used as the base of a source browsing tool
called DXR; it's like
LXR, but with a great deal of semantic
information thrown in. DXR users
can find types defined by macros, look up parent class information, and so
on. There's also a call graph tool which can find all the callers of a
specific method; that's important in C++, where overloading can make
grep thoroughly unusable for this kind of task.
It is, Taras says, "Eclipse-like stuff," except that, unlike Eclipse,
it scales to a Mozilla-size code base.
Various other tools have been written. The final.js script (a
dozen lines of code which can be seen on this
page) looks
for C++ methods tagged with the "final" attribute; any attempt
to override those methods will result in a compilation error. It is, in
other words, a port of the Java final keyword to C++. A checker
which might be interesting in other environments - including the kernel -
is flow.js, which can add a constraint that all exits from a
function must flow through a specific label. Consider this common kernel
pattern:
if (something wrong)
goto out;
/* Do some real work */
out:
release_locks();
free_memory();
cancel_self_destruct()
return something;
It's a common mistake to add a return statement to the middle of a
function like this, shorting out the cleanup code; flow.js can
catch errors like that at compile time.
Additional modules include must-override.js, which can mark
methods which must be overridden (but which cannot be virtual);
outparams.js, which ensures that any output function parameters
have been set on a successful return from the function, and
stack.js, which enforces a requirement that specific classes only
be instantiated on the stack, since the garbage collector is not prepared
to deal with them. Taras is also working on a checker for variables which
shadow class members - a mistake which GCC does not catch now.
For the time being, this work is mostly used within the Mozilla project,
though Taras would clearly like to see users from the wider community. He
looks forward to a day when libraries are distributed with a plugin which
ensures that the library is being used correctly. Another nice feature
would be a distribution-wide DXR, enabling cross-package source browsing.
For now, though, we have a set of tools that serves as a good proof of the
concept that GCC plugins can be used for static analysis.
(
Log in to post comments)