|
|
Subscribe / Log in / New account

A static-analysis framework for GCC

By Jake Edge
December 4, 2019

One of the features of the Clang/LLVM compiler that has been rather lacking for GCC may finally be getting filled in. In a mid-November post to the gcc-patches mailing list, David Malcolm described a new static-analysis framework for GCC that he wrote. It could be the starting point for a whole range of code analysis for the compiler.

According to the lengthy cover letter for the patch series, the analysis runs as an interprocedural analysis (IPA) pass on the GIMPLE static single assignment (SSA) intermediate representation. State machines are used to represent the code parsed and the analysis looks for places where bad state transitions occur. Those state transitions represent constructs where warnings can be emitted to alert the user to potential problems in the code.

There are two separate checkers that are included with the patch set: malloc() pointer tracking and checking for problems in using the FILE * API from stdio. There are also some other proof-of-concept state machines included: one to track sensitive data, such as passwords, that might be leaked into log files and another to follow potentially tainted input data that is being used for array indexes and the like.

The malloc() state machine is found in sm-malloc.cc, which is added by this patch, looks for typical problems that can occur with pointers returned from malloc(): double free, null dereference, passing a non-heap pointer to free(), and so on. Similarly, one of the patches adds sm-file.c for the FILE * checking. It looks for double calls to fclose() and for the failure to close a file.

The handling of diagnostic output required additional features to support the new types of warnings. For example, in order to provide more easily interpreted warnings, the code path leading to a detected problem needs to be determined and stored. That information, including the warning triggered and the locations in the code (both line number and position in the line) that trigger the warning, will be recorded so that it can be displayed by the compiler. There are examples in the cover letter as well as links to some colorized output such as this one (also seen below).

[Warning message]

Beyond that, Malcolm also extended the diagnostic facility to allow more metadata to be added to the warnings. In particular, he linked them to entries in the Common Weakness Enumeration (CWE) list, but other kinds of metadata could also be associated with the diagnostic message. In a terminal that is capable of it, the CWE number is a clickable link to the entry's web page (e.g. CWE-690).

The analyzer pass is invoked with the ‑‑analyzer command-line option; there are options to turn on or off the individual warnings as well. It is implemented, currently, as a GCC "in-tree" plugin—one that would be distributed with GCC itself. But Richard Biener suggested that it might be better to simply build the analyzer into GCC—with a configuration option to disable it. He also mentioned rewriting the GCC plugin API, perhaps with an eye toward plugins that could work with both GCC and LLVM, but that is clearly a much longer term project.

Malcolm said that he chose to use the plugin API in part as a way to indicate the immaturity of the code:

My reasoning here is that the analyzer is middle-end code, but isn't as mature as the rest of the middle-end (but I'm working on getting it more mature).

I want some way to label the code as a "technology preview", that people may want to experiment with, but to set expectations that this is a lot of new code and there will be bugs - but to make it available to make it easier for adventurous users to try it out.

[...] I went down the "in-tree plugin" path by seeing the analogy with frontends, but yes, it would probably be simpler to just build it into GCC, guarded with a configure-time variable. It's many thousand lines of non-trivial C++ code, and associated selftests and DejaGnu tests.

In its current state, the analyzer adds roughly 2.5% to the GCC code base, but that did not deter Jakub Jelinek and Biener from preferring it to simply be built into GCC. Malcolm seems favorably disposed, as well, so that switch will be coming. In the meantime, he has posted an update to the patch set to fix some link-time optimization (LTO) compatibility issues.

The "Rationale" section of the cover letter describes the motivation and goals behind the analyzer project:

There's benefit in integrating a checker directly into the compiler, so that the programmer can see the diagnostics as he or she works on the code, rather than at some later point. I think that if the analyzer can be made sufficiently fast that many people would opt-in to deeper but more expensive warnings. (I'm aiming for 2x compile time as my rough estimate of what's reasonable in exchange for being told up-front about various kinds of pointer snafu).

Overall, the reaction has been quite positive to the idea; there has been some code review going on in the thread as well. As Eric Gallager noted, there have been lots of user requests over the years for warnings of the sort that the analyzer could produce. At this point, it looks like there is a way forward to address that missing feature in GCC. With luck, a few years down the road ‑‑analyzer will be widely used in the free-software world, which can only help produce better code for our projects.



to post comments

A static-analysis framework for GCC

Posted Dec 5, 2019 9:16 UTC (Thu) by error27 (subscriber, #8346) [Link] (1 responses)

The malloc test looks like how I used to write Smatch checks ten years ago. It's a mistake. One of the comments raises that question: "// TODO: or should this be a different state machine?" The answer is yes, you want lots of little checks.

There are too many states. You don't need start state, or end state. The state transitions are actually not interesting at all. What you do need is a &merged state. In Smatch merging groups of states is the where all the magic happens.

First create a module to record all the values of all the variables. The malloc() test is tracking conditions to try tell when the pointer is NULL vs non-NULL which makes every check too complicated. The value tracking module will be complicated but it's re-used by everything.

In Smatch there are two automatic states which every check inherits &undefined and &merged. Then most checks will just have one state after that. There would be three separate checks for unchecked malloc, double frees, and memory leaks. If we are dereferencing a variable and the value tracker says it can be NULL and the malloc check says it is &malloced then print a warning about an unchecked malloc. The &freed check is even easier. If we're dereferencing or freeing a variable and it's &freed on any path then complain.

Smatch has a leak checker but it's too conservative so it misses a lot of bugs. I guess in this case you really would want to re-use the allocation and free information but I would still make it a separate check. I would export an is_freed() function from the freed check. Then the approach would be at the end of parsing a function 1) Are we on an error path (use the value tracker for this)? Smatch is pretty kernel centric but you could have a project specific hook for this. 2) If so, iterate through all the allocated pointers. 3) Complain if they are non-NULL and not is_freed().

Smatch is ten years ahead of where GCC is at this point. But you could just copy Smatch in just a year or two because I took so many wrong turns and because I am a slow typist...

A static-analysis framework for GCC

Posted Dec 5, 2019 12:04 UTC (Thu) by roblucid (guest, #48964) [Link]

This is such a generous awesome post to read, I do hope cross-fertilisation of ideas and experience benefits bi-directionally!

A static-analysis framework for GCC

Posted Dec 6, 2019 11:59 UTC (Fri) by jezuch (subscriber, #52988) [Link]

I'm a sucker for static analysis, so even though I don't use C or C++ anymore, it's a huge YES from me!

A question, though: how many warnings it reports on gcc's code base itself? :)


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds