LWN.net Logo

Making static analysis easier

By Jonathan Corbet
June 22, 2011
It has been clear for some years now that static analysis tools have the potential to greatly increase the quality of the software we write. Computers are well placed to analyze source code and look for patterns which could indicate bugs. The "Stanford Checker" (later commercialized by Coverity) found a great many defects in a number of free software code bases. But within the free software community itself, the tools we have written are relatively scarce and relatively primitive. That situation may be coming to an end, though; we are beginning to see the development of frameworks which could become the basis for a new set of static analysis tools.

The key enabling changes have been happening in the compiler suites. Compilers must already perform a detailed analysis of the code in order to produce optimized binaries; it makes sense to make use of that analysis for other purposes as well. Some of that already happens in the compiler itself; GCC and LLVM can produce a much wider set of warnings than was once possible. These warnings are a good start, but much more can be done. That is especially true if projects can create their own analysis tools for project-specific checks; projects of any size tend to have invariants and rules of their own which go beyond the rules for the source language in general.

The FSF was, for years, hostile to the idea of making it easy to plug analysis modules into GCC, fearing that a plugin mechanism would enable the creation of proprietary modules. After some years of deliberation, the FSF rewrote the license exception for its runtime modules in a way that addressed the proprietary module worries; since then, GCC has had plugin module support. The use of that feature has been relatively low, so far, but there are signs that the situation may be beginning to change.

An early user of the plugin mechanism was the Mozilla project, which created two modules (Dehydra and Treehydra) to enable the writing of analysis code in JavaScript. These tools have seen some use within Mozilla, but development there seems to have slowed to a halt. The mailing list is moribund and the software does not appear to have seen an update in some time.

An alternative is GCC MELT. This project provides a fairly comprehensive plugin which allows the writing of analysis code in a Lisp-like language. This code is translated to C and turned into a plugin which can be invoked by the compiler. MELT is extensively documented; there are also slides from a couple of tutorials on its use.

MELT seems to be a capable system, but there do not appear to be a lot of modules written for it in the wild. One does not need to look at the documentation for long to understand why; the "basic hints" start with: "You first need to grasp GCC main internal representations (notably Tree & Gimple & Gimple/SSA)." MELT author Basile Starynkevitch's 130-slide presentation on MELT [PDF] does not get past the introductory GCC material until slide 85. MELT, in other words, requires a fairly deep understanding of GCC; it's not something that an outsider can pick up quickly. The lack of easy examples to work from is not helpful either.

More recently, David Malcolm has announced the release of a new framework which enables the creation of plugins as Python scripts which run within the compiler. His immediate purpose is to create tools for the development of the Python system itself; the most significant checker in his code tries to ensure that object reference counts are managed properly. But he sees the tool as potentially being useful for a number of other projects and even for prototyping new features for GCC itself.

At a first glance, David's gcc-python-plugin mechanism suffers from the same difficulty as MELT - the initial learning curve is steep. It is also a very young and incomplete project; David has, by his own admission, only brought out the functionality he had immediate need for. The analysis code seems more approachable, though, and the mechanism for running scripts directly in the compiler seems more natural than MELT's compile-from-Lisp approach. It may be that this plugin will attract more users and developers than MELT as a result.

Or it may just be that your editor, being rather more proficient in Python than in Lisp, naturally likes the Python-based solution better.

In any case, one conclusion is clear: writing static analysis plugins for GCC is currently too hard; even capable developers who approach the problem will need to dedicate a significant chunk of time to understanding the compiler before they can begin to achieve anything in this area. The efforts described above are a big step in the right direction, but it seems clear that they are the foundations upon which more support code must be built. It's hard to say when it will reach the tipping point that inspires a flood of new analysis code, but it's easy to say that we are not there yet.

GCC is not where all the action is, though; there is also an interesting static analysis tool which has been built with the LLVM clang compiler. Documentation of this tool is scarce, but it appears to be capable of detecting some kinds of memory leaks, null pointer dereferences, the computation of unused values, and more. Some patches have been posted to add a plugin feature to this tool, but they do not seem to have proceeded very far yet.

Back in May, John Smith ran the checker on several open source projects to see what kind of results would emerge. Those results have been posted on the net; they show the kind of potential problems that can be found and the nice HTML output that the checker can create. Some of the warnings are clearly spurious - always a problem with static analysis tools - but others seem worth looking into. In general, the clang static analyzer seems, like the other tools mentioned here, to be in a relatively early state of development. Things are moving fast, though; this tool is worth keeping an eye on.

Actually, that is true of the static analysis area in general. The lack of good analysis tools has been a bit of a mystery - given the number of developers we have, one would think that a few would scratch that particular itch. Your editor would not have minded living in a world with one less version control system but with better analysis tools. But the nature of free software development is that people work on problems that interest them. As the foundations of our static analysis tools get better, one can hope that more developers will find those foundations interesting to build on. The entire development community will benefit from the results.


(Log in to post comments)

Standalone static analysers

Posted Jun 23, 2011 11:33 UTC (Thu) by epa (subscriber, #39769) [Link]

You might have mentioned long-standing tools such as 'splint', an evolution of the classic 'lint' program that also supports annotations to help track null pointers and other bugs; Linus's 'sparse' checker, and BLAST among others.

Admittedly, there are so many it's hard to know where to start! There might be room for a 'mother of all static checkers' tool which runs all of them on a piece of code and collects together the results.

Making static analysis easier

Posted Jun 23, 2011 19:19 UTC (Thu) by JohnMorris (subscriber, #73531) [Link]

A quick sampling of the reports showed lots of "Argument with 'nonnull' attribute passed null", all of which (in the admittedly small sample) were false alarms.

Static analysis is hard, I guess...

Making static analysis easier

Posted Jun 25, 2011 3:08 UTC (Sat) by error27 (subscriber, #8346) [Link]

The thing of it is, that you always end up with 100% false positives because people fix the real bugs and mess up your statistics.

Making static analysis easier

Posted Jun 23, 2011 22:14 UTC (Thu) by ortalo (subscriber, #4654) [Link]

NB: In the 2nd part of the articles on "Undefined behaviour" at LLVM blog (at http://blog.llvm.org/2011/05/what-every-c-programmer-shou...) some additional interesting tools for static analysis surrounding LLVM are mentionned by the author (Chris Lattner), especially KLEE (http://klee.llvm.org/) and C-Semantics (http://code.google.com/p/c-semantics/).
AFAIR I found KLEE approach pretty interesting; but I don't know these tools applicability.

Making static analysis easier

Posted Jun 23, 2011 23:07 UTC (Thu) by dave_malcolm (subscriber, #15013) [Link]

Thanks for mentioning my GCC plugin. FWIW I just blogged about it, with a bit more information (and the closest this gets to screenshots) here:
http://dmalcolm.livejournal.com/5931.html

One of the other appealing things about using Python for this is that it makes it relatively easy to add a UI.

PyPy actually has an OpenGL-based GUI embedded in its compiler, for debugging. I don't know if I want to go that far, but there are plenty of possibilities.

As for other static analysis approaches, I looked at a few others, listed here:
https://fedoraproject.org/wiki/Features/StaticAnalysisOfC...
but the GCC plugin approach was most appealing to me, given my own preferences for Python, and that I'm using GCC anyway.

Debian Automated Code Analysis, other tools

Posted Jun 24, 2011 9:41 UTC (Fri) by guus (subscriber, #41608) [Link]

I have used several static analysis tools, and they all have their own pros and cons. Cppcheck is quite a nice program that has the advantage of doing static analysis for every possible combination of #ifdefs in your code. Coccinelle, mentioned on LWN before, is another great tool.

One should check out all the tools mentioned in the original article and the comments. However, that is quite a lot of work. Coverity allows you to submit software and have them check it, but it is a commercial service. Debian has recently started the Debian Automated Code Analysis project (http://qa.debian.org/daca/), which runs (for now a small selection of) static analysis tools on ALL the software that is in Debian.

Making static analysis easier

Posted Jun 24, 2011 12:39 UTC (Fri) by dps (subscriber, #5725) [Link]

I would actually like to get into static analysis but would need a few good pointers and have to spend at least a month in an academic library. I know formal methods but not data flow analysis or theorem proving. I suspect many developers lack either some of the assumed knowledge or access to academic literature. The work that came before Coverity was done in academia.

Unfortunately I currently lack the time and starting points. I may also lack some of the assumed knowledge because nobody knows all of mathematics.

Making static analysis easier

Posted Jun 24, 2011 20:13 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I think if I was interested in technology to stop coding errors, I'd look into higher level languages instead. The static code checker verifies that the coder has implemented some paradigm in a low level language to achieve some high level result. A higher level language would let you state what result you want explicitly.

Of course, the static checker has the advantage that you can use it to improve the huge body of existing low-level code.

Making static analysis easier

Posted Jun 24, 2011 20:15 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

... the initial learning curve is steep.

The learning curve is shallow. A learning curve isn't a hill you climb; it's a graph of productivity vs time.

Making static analysis easier

Posted Jun 24, 2011 20:17 UTC (Fri) by corbet (editor, #1) [Link]

You know, that's pedantic beyond the call of duty. My "learning curve" is amount of learning required against time; it's steep. That aligns with the common use of this term, and you know it.

Making static analysis easier

Posted Jun 24, 2011 20:39 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

No, I don't know that and I don't believe it. I know in common usage, "steep learning curve" means you have to learn a lot to use it, but I don't think most people who use it that way visualize any graph at all. In particular, "amount of learning required against time" doesn't fit the context here. The amount of learning required for these tools changes rapidly with time?

It's because the term is used by so many to refer to something that is hard to learn that I commented. I think many readers will appreciate knowing that that usage is completely inconsistent with the original, and logical, sense of the term. I certainly would, if not for the fact that the first time I ever heard the term was from an industrial engineer who drew out the graph for me.

Making static analysis easier

Posted Jun 24, 2011 22:04 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

if you think of the learning curve with the vertical axis being the amount of learning needed, and the horizontal axis being productivity, then a steep learning curve == having to learn a lot before a tool is useful makes perfect sense

and this matches how the phrase is commonly used.

Making static analysis easier

Posted Jun 24, 2011 22:09 UTC (Fri) by nix (subscriber, #2304) [Link]

Indeed. e.g. "Emacs: A learning curve like a plumb line."

Plumb lines are vertical, not horizontal. Steep learning curves are hard to climb and hard to learn. I have never heard anyone use 'steep learning curve' to indicate that something was *easy* to learn. This smacks of overanalysis and hypercorrection on giraffedata's part. (The 'original sense of the term' is irrelevant to this, as is logic: this is language, where what matters is customary use and the final arbiters are not rational adults but generations of young children. Also, 'hacker' is unreclaimable, much though we may wish otherwise.)

steep learning curve

Posted Jun 25, 2011 0:14 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

this is language, where what matters is customary use

Many people believe this; many don't. I'm talking only to the latter.

Similarly, many people say all that matters in language is that people know what you mean. Others find a lot more to language than that (actually hardly anybody truly believes it's only about being understood; they put effort into "correct" spelling and verb conjugation, for example. But some believe it more than others).

steep learning curve

Posted Jun 25, 2011 15:38 UTC (Sat) by bronson (subscriber, #4806) [Link]

Wonderful. Another one-man effort to change language by correcting people on lwn. What on earth is the motivation here?

This reminds me of the "two times as fast" means "three times faster" person.

I guess you can attempt to change common usage all you want. Just please don't waste time correcting people on LWN when it's PERFECTLY CLEAR what they mean?

steep learning curve

Posted Jun 25, 2011 15:48 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Another one-man effort to change language by correcting people on lwn.

Nope. That's not what you're seeing.

steep learning curve

Posted Jun 27, 2011 15:28 UTC (Mon) by nye (guest, #51576) [Link]

>Similarly, many people say all that matters in language is that people know what you mean

Even if you don't believe that at all, surely you must accept that having people know what you mean is at least *an important part*?

And surely you must see that being the one person in all the world to invert the meaning of a commonly used term because it doesn't fit your mental model of the term does not help that goal?

And then surely it must occur to you that if your mental model of the term is not in line with everyone else's, then perhaps it is *your* model which is incorrect?

steep learning curve

Posted Jun 27, 2011 15:39 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

Even if you don't believe that at all, surely you must accept that having people know what you mean is at least *an important part*?

Yes. Consequently, I would not write "shallow learning curve" in a sentence like this. But I wouldn't write "steep learning curve" either. I would probably opt for plainness and say something like, "it takes a while to get up to speed" or a somewhat more scientific "high learning cost."

But in a context where I thought the actual learning curve would help explain things, I would talk about the true shape of the curve and if the audience wasn't industrial engineers (I think they're called "operations researchers" now), I would make sure I explained the terminology first.

Making static analysis easier

Posted Jun 24, 2011 22:11 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

You do realize that the assignment of axis in a graph is arbitrary? The idea of a "steep learning curve" results from plotting the required effort on the vertical axis and productivity on the horizontal axis. Plotted the other way, of course, the exact same curve is "shallow" rather than "steep".

You were taught an older version of the phrase, and there is nothing wrong with that, but in public writing it is more important to consider how the majority of your audience will interpret your words. As you said yourself, "in common usage, 'steep learning curve' means you have to learn a lot to use it." That means that using the phrase in any other sense, without further explanation, will lead to a failure in communication which you could have easily prevented.

learning curve

Posted Jun 25, 2011 0:22 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

I see that one can construct a graph for which "steep learning curve" makes sense, but I still don't believe people who use the term are actually thinking of such a graph, because I don't know of, and have a hard time imagining, any scientific analysis of learning that uses that graph. For one thing, "effort" is a tough metric. It would probably have to be reduced to time, and people nearly universally put time on the horizontal axis.

I believe people say "steep learning curve" for the same reason they say "I could care less" (which means, though not logically, "I don't care"): they heard or thought they heard someone say that sequence of sounds in a certain context.

learning curve

Posted Jun 25, 2011 3:08 UTC (Sat) by nybble41 (subscriber, #55106) [Link]

> I still don't believe people who use the term are actually thinking of such a graph

I agree. Most people don't think of things in mathematical terms. The typical mental image is probably that of climbing a hill, where horizontal movement reflects progress toward a goal (e.g. mastery of a program or technique), and vertical movement corresponds to the energy put into learning it. The slope would thus be the energy (effort/experience/time) required to achieve the next level of proficiency.

> people nearly universally put time on the horizontal axis

"Universally" may go a bit far, but in many cases that does makes sense, particularly where one is plotting samples gathered at regular intervals. As a rule of thumb, where there is a distinction, parameters (questions) are plotted on the horizontal axis and results / data (answers) on the vertical, and where time is involved it is usually the parameter. However, when the question answered by the graph is "How much time is required to achieve a given level of proficiency?", proficiency is the parameter and time is the result.

learning curve

Posted Jun 25, 2011 3:35 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

The typical mental image is probably that of climbing a hill,

That's what I've always assumed, although in a somewhat simpler derivation: a steep hill is a difficult one. If you had to go to a library to learn how to use a static code analyzer, which one would you prefer: the one at the top of a gentle incline, or the one at the top of a steep hill. I think I've even heard the metaphorical "I don't want to have to climb the learning curve again."

I have no doubt that in the 50s when people were starting to scientifically analyze business processes, the people on the fringes kept hearing "steep learning curve" in comparisons between easy and difficult process changes and just unconsciously connected "steep" with the difficult one. Of course, the industrial engineers using the term aren't even interested in how hard it is on the workers; they care about how many widgets per hour the workers make and how quickly that rises as they figure things out.

I also note that a "steep price" is a high one. I can only guess at the origin of that.

Making static analysis easier

Posted Jun 25, 2011 0:14 UTC (Sat) by jschrod (subscriber, #1646) [Link]

Maybe, just maybe, you should learn about what language is about. Have you heard about it, human communication? You know, humans as in these blobs of flesh that walk around you? Well, maybe not.

While Jon knows a lot about human communication, it's his hob -- your knowledge is quite obviously more of the infantile kind. A tipp: Look up the meaning of "colloquial" in a dictionary. It just might, like, open a whole new world for you.

Making static analysis easier

Posted Jun 25, 2011 15:42 UTC (Sat) by bronson (subscriber, #4806) [Link]

> A learning curve ... is a graph of productivity vs time

No, that would be called a productivity curve.

learning curves

Posted Jun 25, 2011 15:59 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

No, that would be called a productivity curve.

This particular time vs productivity graph is called that (and it is called that -- I'm not making this up) because it represents the productivity change due to learning only. And BTW, it doesn't even mean all learning, e.g. the time you spend reading the manual before you try to use the tool isn't included. The paper that introduced the term (I don't know when it was; I haven't seen it) was about the fact that people just get faster at a task as they continue doing it, because they figure it out as they go.

There is most probably another graph called something like "fatigue curve" showing how productivity goes down over time as a shift progresses.

Making static analysis easier

Posted Jun 30, 2011 20:09 UTC (Thu) by tarasglek (guest, #65350) [Link]

Thanks for mentioning my static analysis efforts. Things aren't quite as dead as they seam.

I wrote a blog post inspired by this article on what went right/wrong with Dehydra/Treehydra.

http://blog.mozilla.com/tglek/2011/06/30/dehyratreehydra-...

Making static analysis easier (GCC MELT)

Posted Jun 30, 2011 20:55 UTC (Thu) by bstarynk (guest, #63409) [Link]

Thanks for mentioning GCC MELT (I am its main author, Basile Starynkevitch). See http://gcc-melt.org/ for more.
(I do know that MELT is still lacking documentation).

Pierre Vittet is actually coding some simple warnings within a Google Summer of Code project, and he is using MELT.

I just added a small analysis pass, coded in MELT, to analyze MELT run-time (the manually coded parts). And it did found some bugs.

Strangely, nobody mentionned Frama-C, http://frama-c.com, a powerful free (LGPL licence, coded in Ocaml) self-sufficient static analyzer for critical C code (you can formalize the specification of your code within special comments in some first-order logic formalism called ACSL, and the tool proves that your code follows the formalized specification). As far as I know, Frama-C don't accept all the languages accepted by GCC (C++, Ada, Fortran, Go, ...) but only C. Frama-C is developed by colleagues at CEA LIST.

Using GCC as a basis for static analysis has pros & conses. The pros is essentially taking profit of the huge GCC infrastructure (more than five millions lines of code). The cons, is GCC complexity (which explains its power), which you should tame: to use GCC internal powerful machinery (in C plugins, with python, with MELT, ...) you will need to understand it, and that means understanding GCC pass management and GCC internal representations [Gimple, Tree]. You can't take profit of GCC without understanding its internals! And indeed, GCC was not designed as a library at first; it is mostly a powerful compiler, not a library to process source code!

MELT aims at making a bit less hard the development of such GCC based simple static analysis. You still need to understand (partly) the internals of GCC, but MELT offers you a higher level language to code, in more concise way, your analyzer. (MELT is not a competitor to Frama-C).

Also, static analysis is a very hard subject by itself (and the theories behind like model checking and abstract interpretation are very difficult by themselves; algorithms are often non-linear, so heuristics are essential in practice). This explain why such tools don't flourish. And the economical incentive for such tools is much lower than the industries (processor makers!) needing a powerful GCC compiler.

I also would like more free static analyzers to appear, but that will take time (and money).

Coding some static analysis is not only a matter of Python vs Ocaml vs Lispy dialects like MELT. It is a difficult subject. MELT aims to make that a bit less difficult (and Python could also help, however AFAIK it does not provide a powerful pattern matching ability, which is essential when operating on internal representation of source code, as known by GCC).

If some people wanted to code specialized static analyzers in MELT for free software (e.g. to help developers using or coding GTK, Qt, Linux kernel, MySQL engine, ...) I would be delighted to help them. One of the major issues is that you need to know very well the target free software (Gtk etc...) to specify and code some analyzers for them. And you also need to understand (perhaps not in full details) the internals of GCC to use it (thru MELT, or even thru a Python wrapper).

Regards

Making static analysis easier

Posted Jul 1, 2011 8:31 UTC (Fri) by Piervit (guest, #76167) [Link]

Hello,

As bstarynk said, I am working for a Google Summer Of Code on a plugin for GCC. I am using MELT for this. If you are interrested I will make a conference about my plugin and MELT at the RMLL/LSM (Libre Software Meeting, http://2011.rmll.info/?lang=en). I will speak French but my slides will be in English.

I knew nothing about functional languages and Lisp before discovering MELT (this was before starting the GSOC). As it is a new DSL, I feel lispy syntax is pertinent because it is pretty simple syntax, only using parenthesis (I could really simply create a syntax file for my text editor). This is subtle because such a simple syntax has a lot of expressiveness.

Sometimes MELT is hard to understand, but this comes mainly from how it is linked to GCC, and I guess (without knowing much) that it would not be easier to use an existing script language. You have to make a clear difference between data from MELT and from GCC, but you will have the same issue with a script language.

The reality is also that I spent many time understanding GCC, because even if you use MELT or python to provide some abstractions, you have to get a good idea of the internal representation of GCC and of how pass are run.

About the plugin, I am developing, the idea is to offer a customizable tools for simple statical analysis: A number of tests will be available (such as detecting that we test the value returned by a function call, or detecting that a call to a precise function implies a call to another given function in a same function body....). When compiling a project the user may choose to a some tests with related parameters (to know which function should be tested).

My project repository: https://gitorious.org/talpo/talpo

I feel such project would have been difficult to write in C (or I would have need more than GSOC time).

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds