LCA: Static analysis with GCC plugins

By Jonathan Corbet
January 22, 2010

Taras Glek works for Mozilla, but he is not a browser hacker; instead, he works on GCC and other tools aimed at making the browser development process better. It is, he says, a good job. While carrying out his duties, Taras has been able to put a new GCC feature to work in ways which may prove to be useful well beyond Mozilla.

Development tools are important; they can help us to produce software more quickly and with far fewer problems. Unfortunately, Taras says, we are stuck in the stone age of software development, using tools from the 1970's. Our code base is growing, though, to the point that developers often cannot understand the entirety of even a single application. We need some way to amplify our capabilities so that we can continue to make more powerful applications; static analysis tools can bring some of those capabilities.

Static analysis, in essence, treats the code as data which is then the subject of further analysis. It has often been seen as a backwater, an area of primarily academic interest. When static analysis tools have found their way into more common use, it has generally been in their ability to find certain classes of bugs. But there's more that can be done with these tools: finding API abuse, generating library bindings, improved code base visualization, and more. Static analysis has been put to use with Mozilla to find dead code; thousands of lines of code have been found to be completely unused, despite the fact that engineers were putting their time into maintaining it.

The Mozilla project has an especially strong need for good tools. It is a huge code base (1.7 million lines of C++ and 1 million lines of JavaScript); humans just do not scale to that amount of code. This code base is under constant optimization work, so refactorings are frequent. Without some help, keeping this code in good condition is a major challenge.

Much of Taras's work seems to be aimed at mitigating some of the pains that come with C++ development. One of those pains is that the language is just about impossible to parse; the parser must actually instantiate types before it can complete its job. So anybody who wants to analyze C++ code must first find a decent parser for it. The available options are limited. The LLVM compiler is promising, but it's going to be another year or two before it's really ready for prime time. The Elsa tool can be used, but it's essentially unmaintained and not really guaranteed to be correct.

The one other option - one which is known to have a complete C++ parser - is GCC. But the GCC code has a bit of a nasty reputation, so Taras started off using Elsa for his work. Eventually, though, he turned back to GCC for something more solid, and hasn't looked back - the hairiness of GCC has, perhaps, been exaggerated. But, more to the point, the upcoming GCC 4.5 release is, he says, "the most exciting release ever." The reason for that is the long-delayed addition of the plugin API, which became possible once the runtime library license exemption finally went into place. With this API, analysis code can easily hook into the compiler and inspect code at whatever stage of the process best suits its needs.

Beyond plugins, GCC has a few other features which make it suitable for static analysis work. The ability to attach attributes to objects in the compiled code makes it easy to pass hints through to later processing steps. The new pass manager brings a relatively modern structure to a compiler which did not originally have one. And the GIMPLE intermediate representation provides much of the rest of what's needed for code which needs to inspect other code.

There are a few interesting plugins in the works. One of them is the LLVM compiler, which can be plugged in to perform the back-end functions for GCC. Another is milepost, which uses a brute-force approach to figure out the optimal settings of the command-line flags for a specific body of code. Then, there are "the hydras," which are Taras's work. These plugins take an interesting approach, in that the actual analysis work is done in JavaScript scripts. The idea was originally seen as amusing - "wouldn't it be fun to put Spidermonkey into GCC?" - but it has actually worked out well. JavaScript is a relatively nice, concise language which makes it easy to implement the needed capabilities.

The first plugin is Dehydra, so named because the control flow graph in Mozilla somewhat resembles a Hydra monster. Dehydra produces a JSON-like representation of the objects found in a C++ program; individual JavaScript scripts can then use this representation to analyze the program. The Treehydra plugin, instead, provides a JavaScript interface to the GIMPLE representation of the program; it can be used for more traditional sorts of static analysis tasks.

One of the pains that come with large C++ programs is that simply finding code can be difficult. It's not always clear which method will be invoked in a specific situation, even in the absence of things like macro tricks. To help with this problem, Dehydra has been used as the base of a source browsing tool called DXR; it's like LXR, but with a great deal of semantic information thrown in. DXR users can find types defined by macros, look up parent class information, and so on. There's also a call graph tool which can find all the callers of a specific method; that's important in C++, where overloading can make grep thoroughly unusable for this kind of task. It is, Taras says, "Eclipse-like stuff," except that, unlike Eclipse, it scales to a Mozilla-size code base.

Various other tools have been written. The final.js script (a dozen lines of code which can be seen on this page) looks for C++ methods tagged with the "final" attribute; any attempt to override those methods will result in a compilation error. It is, in other words, a port of the Java final keyword to C++. A checker which might be interesting in other environments - including the kernel - is flow.js, which can add a constraint that all exits from a function must flow through a specific label. Consider this common kernel pattern:

    if (something wrong)
    	goto out;
    /* Do some real work */

  out:
    release_locks();
    free_memory();
    cancel_self_destruct()
    return something;

It's a common mistake to add a return statement to the middle of a function like this, shorting out the cleanup code; flow.js can catch errors like that at compile time.

Additional modules include must-override.js, which can mark methods which must be overridden (but which cannot be virtual); outparams.js, which ensures that any output function parameters have been set on a successful return from the function, and stack.js, which enforces a requirement that specific classes only be instantiated on the stack, since the garbage collector is not prepared to deal with them. Taras is also working on a checker for variables which shadow class members - a mistake which GCC does not catch now.

For the time being, this work is mostly used within the Mozilla project, though Taras would clearly like to see users from the wider community. He looks forward to a day when libraries are distributed with a plugin which ensures that the library is being used correctly. Another nice feature would be a distribution-wide DXR, enabling cross-package source browsing. For now, though, we have a set of tools that serves as a good proof of the concept that GCC plugins can be used for static analysis.

Index entries for this article
Conference	linux.conf.au/2010

LCA: Static analysis with GCC plugins

Posted Jan 22, 2010 21:14 UTC (Fri) by Trelane (subscriber, #56877) [Link]

Interesting article! I look forward to having more analysis tools available. Hopefully, some of these are incorporated/cloned by GNU. :)

LCA: Static analysis with GCC plugins

Posted Jan 22, 2010 22:01 UTC (Fri) by jreiser (subscriber, #11027) [Link] (7 responses)

a checker for variables which shadow class members - a mistake which GCC does not catch now. That may violate the coding policy of a particular project. However it is 100% legal C++ which has legitimate uses, particularly in machine-assisted manipulation of large bodies of code and in adapting to restricted environments.

LCA: Static analysis with GCC plugins

Posted Jan 23, 2010 0:22 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (6 responses)

A great many things that are a bad idea in general are, sometimes, useful or even necessary. All the "low hanging fruit" has been picked - stuff like "printf("%d and %d", number, string)" is now captured without any special tools. So you're left with things that might be correct, but probably aren't.

Even some of that gets warned about, some classic beginners mistakes with precedence in C will get reported as warnings by compilers these days - if you really wanted that precedence you need merely add parentheses making it clear (to the compiler AND the poor maintenance programmer coming along after you) to shut up the warning.

LCA: Static analysis with GCC plugins

Posted Jan 23, 2010 23:04 UTC (Sat) by ikm (guest, #493) [Link] (2 responses)

> So you're left with things that might be correct, but probably aren't.

C++ is naturally quite context-dependent. And shadowing in it is quite useful -- you don't need to come up with some crazy variable names just because all the saner ones are already used. I shadow stuff all the time and it never really backfired. And I do enjoy nice, concise, context-dependent names. So I don't share the "probably aren't" part.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 10:34 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

I guess the tool could give you a warning for the unintentional cases, and when you really do want to shadow a class member you can flag it with some new keyword or magic comment:

int x; /* shadow */

LCA: Static analysis with GCC plugins

Posted Feb 4, 2010 12:09 UTC (Thu) by adobriyan (subscriber, #30858) [Link]

> int x; /* shadow */

Don't overload comments.

#ifdef __CHECKER__
#define __shadow ...
#else
#define __shadow
#endif

int __shadow x;

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 17:42 UTC (Mon) by iabervon (subscriber, #722) [Link] (2 responses)

I'd actually say that the goal is finding things that are possibly correct but misleading. It is well-defined what happens if you shadow a variable in C++, but someone who encounters a variable reference two pages from the start of a method that matches one of the method arguments is likely to overlook the fact that there's a shadowing declaration on the middle page.

Similarly, the precedence of shifts is unintuitive when combined with addition and subtraction; a line that combines these without parentheses is likely to either get the wrong value, or to confuse people.

Static analysis tools can do things better than language definitions in this area, too. A static analysis tool can decide whether to warn about something based on the textual size of the scope in which a variable is shadowed, representing an approximation of the likelihood of people not seeing the shadowing declaration, which is not something the language definition should be trying to estimate.

Shadowing a class member: wrong?

Posted Jan 29, 2010 16:35 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

I'd say an even more basic goal is to provide the coder with tools to assist in his personal coding style. If you like to use local variables that shadow class members, fine. Don't turn on this checker. But if you avoid that because you find you do it wrong too often, turn on the checker.

Personally, I wouldn't use this, but would really like to see a checker that catches every time I refer to a class member implicitly (i.e. without "this->". To prevent confusion about what is a class member and what is local, I observe a discipline of always referring to the former with this->, but get burned sometimes when I neglect to define a local variable and implicitly get a class member instead.

Shadowing a class member: wrong?

Posted Jan 29, 2010 17:55 UTC (Fri) by iabervon (subscriber, #722) [Link]

Obviously, it has to be on a project scale instead of an individual scale; it doesn't help if you never make a mistake based on a variable shadowing, if it's not common in a particular project and other people make mistakes because of that code.

In general, it's probably best to either always use "this->" or never use it (and never shadow members) within a given codebase. Either way works, but any mixture leads to patches which can't be reviewed easily (that is, you can't tell from the function header and 3 lines of context whether a patch is right).

LCA: Static analysis with GCC plugins

Posted Jan 22, 2010 23:24 UTC (Fri) by jonabbey (guest, #2736) [Link] (2 responses)

FindBugs is a great tool for static analysis in the Java environment. Having something similar in C++ would be tremendous.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 18:18 UTC (Mon) by dh (subscriber, #153) [Link] (1 responses)

Thanks for this extremely useful tip. I wasn't aware of that tool so far,
but it seems truly amazing!

Best regards,
Dirk

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 20:17 UTC (Mon) by jonabbey (guest, #2736) [Link]

Yeah, I've gotten a lot of great help from FindBugs in my own Java project (Ganymede).

Kind of frightening how many bad and incorrect practices FindBugs can catch, actually.

It'll sober you right up. ;-)

MELT

Posted Jan 23, 2010 6:46 UTC (Sat) by patrick_g (subscriber, #44470) [Link] (2 responses)

For GCC 4.5 there is also the MELT plugin (Middle End Lisp Translator) from Basile Starynkevitch.
With MELT you can implement optimisations or static analysis passes in a LISP dialect and the plugin generate C code at the end.
The tutorial is very interesting and there is also a how-to write a pass in MELT.

MELT

Posted Jan 23, 2010 17:44 UTC (Sat) by nix (subscriber, #2304) [Link]

MELT is an absolutely awesome tool. If only its memory consumption was
lower... but then Moore's Law (still very much in effect for RAM) will
render that unimportant soon enough.

MELT

Posted Feb 6, 2010 21:19 UTC (Sat) by bstarynk (guest, #63409) [Link]

I am the author of MELT (a GCC plugin infrastructure which provides a
higher-level lisp-like language to code GCC extensions in). See
http://gcc.gnu.org/wiki/MiddleEndLispTranslator and download the MELT
branch.

MELT does indeed need resources. The big bottleneck is not directly MELT
[the translation from MELT to C is quite quick - the entire MELT translator
is 35KLOC of MELT code which translates to nearly 600KLOC of generated C
code; the translation process of MELT is 20 seconds and uses a reasonable
amount of RAM, less than 300Mb), it is the compilation time of the C code
generated by MELT (and since MELT is written in MELT, you need to compile
the generated C code of the MELT translator; that may take more than an
hour if you compile that code with -O2).

MELT data (in particular closures and objects) is initialized by
straighforward but large C code, there is no serialization or reading of
data files for that. So the initialization routine of each MELT module is a
huge sequential C routine (typically 50KLOC) which essentially fills a big
struct-ure (of thousands of fields). And the compilation of that particular
big routine is stressing the GCC compiler (in particular when optimized
with -O2).

However, I made very recently some significant progress on that issue. The
biggest generated C code file was about 6Mbytes a few weeks ago, but is now
only 3Mbytes, notably because I did split its original MELT version in two
files. So if you tried MELT a month ago, you could try again!

I am also working on an alternative, non-lispy, syntax to MELT. I hope that
an infix syntax might attract more people to MELT.

Feel free to ask me more questions about MELT, perhaps on the
gcc@gcc.gnu.org mailing list (with MELT in the subject line) or directly by
email to me.

Regards.

PS. I will probably release MELT [as a gcc 4.5 plugin] as soon as gcc 4.5
is released. But you can get MELT using subversion with
svn co svn://gcc.gnu.org/svn/gcc/branches/melt-branch GCC-MELT
as a GCC branch, or you could get a 4.5 snapshot and a few files from the
MELT branch & compile it as a gcc-4.5 plugin

--
Basile Starynkevitch (the GCC MELT guy).

LCA: Static analysis with GCC plugins

Posted Jan 23, 2010 12:03 UTC (Sat) by ahornby (subscriber, #3366) [Link] (3 responses)

"Unlike eclipse it scales to a mozilla-size code base"

I can't comment on eclipse's c++ or javascript modes, but the java side definitely scales to large projects. I use it on 3-4 million line code bases regularly. The static analysis is amazing, and people use it all the time to find type hierarchies etc - its right there in the IDE when you right click on a type.

An amazing tool. I used to use xemacs with all the customization, speedbar, tags etc (even wrote some modes for in house languages). Eclipse blows that away. Also massively better than vs.net - visual studio still doesn't have the static analysis or compile as you type abilities of eclipse.

Having this stuff inside your dev environment is worth its weight in gold.

LCA: Static analysis with GCC plugins

Posted Jan 23, 2010 14:38 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

" Also massively better than vs.net - visual studio still doesn't have the
static analysis or compile as you type abilities of eclipse."

VisualStudio 2010 has an online C++ error checker, it uses EDG front-end
internally to check for errors. And it also has static analysis for C++.

LCA: Static analysis with GCC plugins

Posted Jan 24, 2010 13:10 UTC (Sun) by marcH (subscriber, #57642) [Link] (1 responses)

> An amazing tool. I used to use xemacs with all the customization, speedbar, tags etc (even wrote some modes for in house languages). Eclipse blows that away.

Eclipse has a full-featured Java compiler (JDT Core, originated in IBM VisualAge, a Smalltalk-based IDE). Everything the main article says about GCC plugins, Eclipse already has it for Java.

Programming languages are complicated. You cannot really afford to duplicate the parser in every tool. Even when you are a LISP wizard. So compiler plugins are the way to go.

> Having this stuff inside your dev environment is worth its weight in gold.

Agreed 200%

Plus refactoring

Posted Feb 3, 2010 23:33 UTC (Wed) by man_ls (guest, #15091) [Link]

Refactoring is a joy to work with in Eclipse. Move methods around, abstract a behavior, implement a subclass... Most Java code would be really rigid without this, but with Eclipse it turns into soft clay.

LCA: Static analysis with GCC plugins

Posted Jan 23, 2010 17:47 UTC (Sat) by marcH (subscriber, #57642) [Link] (11 responses)

> So anybody who wants to analyze C++ code must first find a decent parser for it. The available options are limited.

Non-free options exist, see for instance:
http://www.mathworks.com/products/polyspaceclientc/descri...

But it is nice to see free options making progress.

LCA: Static analysis with GCC plugins

Posted Jan 24, 2010 1:00 UTC (Sun) by njs (subscriber, #40338) [Link] (10 responses)

There exist exactly 3 real C++ parsers: Microsoft, GNU, and EDG. (Mathworks is almost certainly licensing from EDG.) Including proprietary options does broaden the field, but it's still pretty limited...

That said, it would certainly have been more short-term efficient for Mozilla to just license EDG for their instrumentation tools. I'm glad they didn't.

LCA: Static analysis with GCC plugins

Posted Jan 24, 2010 14:02 UTC (Sun) by marcH (subscriber, #57642) [Link] (7 responses)

> There exist exactly 3 real C++ parsers: Microsoft, GNU, and EDG.

Wow, this is a bold statement :-) Is C++ such a small world?

What about Sun compiler for instance? EDG lists it as a different "dialect".

> Mathworks is almost certainly licensing from EDG.

... by ruling out the two others?

In any case this is where the backend comes from:
http://pagesperso-orange.fr/Christele.Faure/AlainDeutsch....
This product existed years before Mathworks bought it.

LCA: Static analysis with GCC plugins

Posted Jan 24, 2010 20:34 UTC (Sun) by JoeBuck (subscriber, #2330) [Link] (3 responses)

Sun's compiler accepts a non-standard dialect, which they froze long ago and won't fix based on backward compatibility arguments. They provide two different "standard libraries", the default, broken one and a better one based on STLPort. Getting modern C++ code to pass Sun's compiler is a real chore.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 0:59 UTC (Mon) by roc (subscriber, #30627) [Link] (2 responses)

IBM's C++ compiler also has its own front end.

Only 3 real C++ parsers

Posted Jan 29, 2010 16:47 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

IBM's C++ compiler also has its own front end.

Meaning what (I don't know what a front end is)? Does it use one of the three stated real C++ parsers? Does it have its own parser which, like Sun's isn't real because it can't parse standard C++?

Only 3 real C++ parsers

Posted Jan 30, 2010 18:21 UTC (Sat) by nix (subscriber, #2304) [Link]

Front end: the thing which takes a high-level language (e.g. C++) and
yields a language-independent, relatively machine-independent
representation for optimization and translation into machine-dependent
form. (In recent versions of GCC, this intermediate representation is
GIMPLE).

(There is no formal name for this machine-independent part that I know of,
but I've always heard it referred to as the 'middle-end'. A thousand
toplogists may scream in pain but language doesn't need to make
sense. :) )

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 10:24 UTC (Mon) by dgm (subscriber, #49227) [Link] (1 responses)

>> There exist exactly 3 real C++ parsers: Microsoft, GNU, and EDG.

>Wow, this is a bold statement :-) Is C++ such a small world?

Well, how many _complete_ Java parsers do exist? How many C#?, Python? Perl?

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 15:30 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

I know at least 5 Java _compilers_. Granted, for Java 1.4, not 1.5. But still...

C++ parsers: A small world

Posted Jan 25, 2010 14:26 UTC (Mon) by dwheeler (guest, #1216) [Link]

Yes, it really is a small world for *real* C++ parsers. You might be able to add a few, but C++ is really painful to parse. Thus, most people try to use a pre-existing parser rather than rolling their own.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 13:16 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

According to http://lwn.net/Articles/370843/, Microsoft licenses EDG's C++ parser, so that means only two exist.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 13:59 UTC (Mon) by jwakely (subscriber, #60262) [Link]

IIUC they license it for tools within VS, but their compiler uses their own parser, not the EDG one.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 14:03 UTC (Mon) by jwakely (subscriber, #60262) [Link] (6 responses)

> It's a common mistake to add a return statement to the middle of a function like this, shorting out the cleanup code; flow.js can catch errors like that at compile time.

It's a dumb mistake to use goto like that in C++, the cleanup code should go in a destructor. Removing that sort of fail from the mozilla code would make flow.js unnecessary.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 14:20 UTC (Mon) by clugstj (subscriber, #4020) [Link] (5 responses)

That example was referring to kernel code, which is in "C" and won't soon be translated to "C++".

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 14:38 UTC (Mon) by jwakely (subscriber, #60262) [Link] (4 responses)

Oops, my mistake. I misread it as saying flow.js was used for checking that idiom in Mozilla code, and could also be used for other code such as the kernel. The idiom makes perfect sense in C.

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 14:45 UTC (Mon) by jwakely (subscriber, #60262) [Link] (1 responses)

Actually it looks like I was right:
https://bugzilla.mozilla.org/show_bug.cgi?id=441362
https://bugzilla.mozilla.org/show_bug.cgi?id=443286
https://bugzilla.mozilla.org/show_bug.cgi?id=442358

LCA: Static analysis with GCC plugins

Posted Feb 10, 2010 3:44 UTC (Wed) by bzbarsky (guest, #63464) [Link]

Yeah, those are all code that used to be C code, then was converted to compiling as C++ with minimal changes made in the process. Now it's starting to use C++ idioms as needed to make the code more understandable, etc.

multiple returns in C

Posted Jan 28, 2010 3:46 UTC (Thu) by thedevil (guest, #32913) [Link] (1 responses)

Actually there's a better way to do this even in C:

static int foobar_return_handler(void* some_pointer, int destruct, int result)
{
  if (some_pointer)
    free(some_pointer);
  if (destruct)
    do_destruct();
  return result;
}

int foobar()
{
  void* some_pointer = malloc(sizeof(some_type));
  if (!some_pointer)
    return foobar_return_handler(0, 1, -1);
  int foo_result = foo(some_pointer);
  if (foo_result < 0)
    return foobar_return_handler(some_pointer, 1, foo_result);
  int bar_result = bar(some_pointer, foo_result);
  if (bar_result < 0)
    return foobar_return_handler(some_pointer, 1, bar_result);
  return foobar_return_handler(some_pointer, 0, bar_result);
}

multiple returns in C

Posted Jan 28, 2010 10:04 UTC (Thu) by dgm (subscriber, #49227) [Link]

Another idiom I often use is separating resource allocation from actual computation:

int foobar1 (some_type * some_pointer)
{
  int result = foo (some_pointer);
  if (result < 0)
    result = bar (some_pointer, result);
  return result;
}

int foobar()
{
  int result = -1;
  void * some_pointer = malloc (sizeof (some_type));
  if (some_pointer) {
    result = foobar1 (some_pointer);
    free (some_pointer);
  }
  return result;
}

LCA: Static analysis with GCC plugins

Posted Jan 25, 2010 21:03 UTC (Mon) by davide.del.vento (guest, #59196) [Link]

Note that milepost is machine learning, the exact opposite of brute force approach that you mentioned - And maybe you want to link its own website, namely http://ctuning.org/ , instead of a thirty part address

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 5:13 UTC (Tue) by pjm (guest, #2080) [Link] (13 responses)

The article fails to point out an important limitation of using a gcc plugin (or other gcc output) for static analysis: the analysis result is only valid for *this build*, and is not in general valid even for different builds for the same platform, and certainly not guaranteed for when other people build the software from source with a different compiler.

For example, different compilation options can lead to different __builtin_constant_p results (and hence different code paths and different analysis results); and obviously different #include file contents or different architectures can lead to different analysis results.

This isn't a mere theoretical concern: an example that's already bitten the Mozilla folk is a bug caused by argument evaluation order being unspecified in C[*1], and indeed differing in practice among platforms of interest to Mozilla.

This limitation may be fine for proprietary software and other projects where most users use original-developer-supplied binaries, whereas most open-source projects must take gcc-based analysis results with a grain of salt.

That's not to say that gcc-based static analysis isn't useful for source-based projects (and Mozilla use it despite being aware of the limitation): it can still point to cases where a bad thing (buffer overflow or whatever) is possible, but it can't show that something is impossible. Thus, for example, it can't with certainty say that a section of code is dead code.

[*1] Ref: Section Function calls in the C99 spec, para #10.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 11:06 UTC (Tue) by marcH (subscriber, #57642) [Link] (11 responses)

> an example that's already bitten the Mozilla folk is a bug caused by argument evaluation order being unspecified in C[*1], and indeed differing in practice among platforms of interest to Mozilla.

Surely you can plug into gcc parser while ignoring gcc's evaluation order? And consider all possible orders in your analysis.

By the way two different results are also allowed for this:
( x-- ) * ( x++ )

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 11:54 UTC (Tue) by pjm (guest, #2080) [Link] (10 responses)

(i) If I understand the suggestion correctly, then the results would still be specific to a single preprocessor output. Making an analysis valid for all possible <stdio.h> etc. contents involves much more work. Though I'll grant that making an analysis specific to a single set of preprocessor outputs is better than making an analysis specific to a single build.

(ii) By using less of gcc, there's more work for the analysis code to do. For example, you need to work out what all the valid evaluation orders are, and all the possibilities for other things unspecified in C. Which brings us to:

(iii) x-- * x++ is actually undefined rather than unspecified[*1], so it would be a bug for such code to exist at all; but yes, there are many things that the C spec describes as being unspecified (i.e. having more than one allowable behaviour).

[*1] Ref: Section Expressions in the C99 spec, para #2. Or para #3 of section Expressions in the C++ spec, if we're talking about C++ code such as Mozilla.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 12:54 UTC (Tue) by marcH (subscriber, #57642) [Link] (6 responses)

> If I understand the suggestion correctly, then the results would still be specific to a single preprocessor output.

Yes, I was only commenting on the evaluation order example. I feel like this specific example does not really make shine your argument.

> By using less of gcc, there's more work for the analysis code to do. For example, you need to work out what all the valid evaluation orders are, and all the possibilities for other things unspecified in C.

Yes. But I do not think you can seriously escape from working out all the possible evaluation orders, can you? Apart from switching to a better specified language (or to one discouraging side-effects). Because even gcc could change its evaluation order from one release to the next.

> x-- * x++; is actually undefined rather than unspecified[*1]

Oups, sorry. It is also too convoluted. So maybe this one is unspecified but defined:

( x++ ) * x

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 14:14 UTC (Tue) by pjm (guest, #2080) [Link] (1 responses)

> But I do not think you can seriously escape from working out all the possible [behaviours of unspecified constructs], can you?

We should clarify that one needn't explicitly enumerate the possible behaviours, but yes, the fact that there are many valid behaviours for a given set of .c/.cc files is ultimately why analysis using gcc plugins doesn't in general give reliable results for Free Software (in the sense more precisely described in my initial post).

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 21:22 UTC (Tue) by paulj (subscriber, #341) [Link]

A static analyser should be able to warn just fine about multiple stores to
objects in the same sequence point (aliases excepted). Such code is inherently
buggy.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 22:16 UTC (Tue) by Azazel (subscriber, #3724) [Link] (3 responses)

> > x-- * x++; is actually undefined rather than unspecified[*1]
>
> Oups, sorry. It is also too convoluted. So maybe this one is unspecified but defined:
>
> ( x++ ) * x

Also undefined, I believe. Consider:

x++ * x

is equivalent to:

tmp = x
x = x + 1
tmp * x

However, C and C++ do not guarantee the order in which the last two occur, so the
final result may be:

tmp * tmp

tmp * (tmp + 1)

or something else entirely (e.g., nasal daemons) if the compiler is clever
enough to notice.

Az.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 26, 2010 23:11 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

> tmp * tmp
> or
> tmp * (tmp + 1)
> or
> something else entirely

I think it is NOT "something else entirely", but either the first or the second. This is what I meant by "unspecified but defined".

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 28, 2010 4:35 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (1 responses)

Evaluation of an expression that involves reading and writing the same memory location, or writing to it twice, has undefined behaviour, except where (1) a single operation combines reading and writing (e.g. ++) (2) the two operations are ordered by a 'sequence point'. (Note that 'sequence points' are not actually points in a sequence, but define a partial ordering relation.)

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 28, 2010 9:59 UTC (Thu) by marcH (subscriber, #57642) [Link]

My bad, I trusted Paragraph 7.12 "Order of evaluation" in Harbison and Steele's venerable "C Reference manual". Quoting it:
> but the effect will be as if it chose one argument, evaluated fully, then chose another argument,
> [...]
> similar holds for binary expressions

Search "but the effect will", page 254 in: http://www.careferencemanual.com/errata.htm

Anyway I doubt there is any good reason to ever throw a side-effect into a larger expression. I find this discussion entertaining, but actually relevant to bad programmers only.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 27, 2010 16:19 UTC (Wed) by jzbiciak (guest, #5246) [Link] (2 responses)

(ii) By using less of gcc, there's more work for the analysis code to do. For example, you need to work out what all the valid evaluation orders are,

You just need to scan the list once and work out if any of the arguments have side effects. If two arguments have side effects on objects that may alias (and I imagine you can consult GCC's alias analysis to get a "YES/MAYBE/NO" answer), throw a warning.

That doesn't require generating and analyzing all N! potential orderings among N arguments. It just requires an O(M²) loop to check pairwise for all pairs, where M is the number arguments with side effects. M is generally rather small, too.

And I'd hope GCC has functions to help you find the arguments with side effects.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 28, 2010 0:15 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> If two arguments have side effects on objects that may alias throw a warning.

I think a warning should be thrown even if just one argument has a side-effect on any other.

But to be honest a side-effect in argument passing should not even pass code reviews in the first place. Any side-effect deserves at least a (practically free) temporary variable.

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 28, 2010 16:20 UTC (Thu) by jzbiciak (guest, #5246) [Link]

I think a warning should be thrown even if just one argument has a side-effect on any other.

You know, that's closer to what I really meant. I guess I got in a hurry and didn't completely think through what I was suggesting. Basically, if any argument looks like it could potentially modify any other argument, flag it.

But to be honest a side-effect in argument passing should not even pass code reviews in the first place. Any side-effect deserves at least a (practically free) temporary variable.

I agree. There was a similar (but not identical) situation that bit me once when porting DOOM to one of our DSPs at work for a demo. In the case of DOOM, it wasn't argument order to a function call, but rather order of evaluation of terms in an expression.

The weird behavior I was seeing was that the built-in demo (which serves as something like a test case) wouldn't run correctly because the player and/or bad guys wouldn't take exactly the same course they did when the demo was recorded. The DOOM engine is supposed to be deterministic--if you run through the course in exactly the same way every time, then everything should behave identically.

DOOM has a "random number" function (P_Random()) that just returns the next unsigned byte out of a table of random bytes, keeping a pointer to which one is next. It resets this pointer at the start of the game. This allows the engine to ask for a "random" number and get a deterministic stream each time. That function was fine. The problem was with how it was being called.

There were several places in the code where the engine wanted a signed random number that had more of a triangular distribution. To get that, it would do this: P_Random()-P_Random() That idiom appears about a dozen places in the code.

You can guess what happened. GCC would make the two function calls in left-to-right order. Our DSP's compiler, however, made them in right-to-left order. *d'oh*

Same class of problem (getting burned by side effects where order isn't defined), just a different manifestation. It was this particular case I had in mind when I made my original comment. That said, your point is still quite valid. foo(x, ++x) is no better than foo(++x, ++x).

Caveat: GCC-based analysis unreliable for Free Software

Posted Jan 27, 2010 7:10 UTC (Wed) by njs (subscriber, #40338) [Link]

Anyway, AFAICT a lot of what they want to do isn't fancy provably correct elimination of dead code, etc., but "simple" rewriting (e.g., to tear out pointless abstraction layers), where just getting a real AST is enough to get started (though macros etc. will still make your life real interesting).