|
|
Log in / Subscribe / Register

Improvements to GCC's -fanalyzer option

By Jonathan Corbet
September 23, 2021

LPC
For the second year in a row, the GNU Tools Cauldron (the annual gathering of GNU toolchain developers) has been held as a dedicated track at the online Linux Plumbers Conference. For the 2021 event, that track started with a talk by David Malcolm on his work with the GCC -fanalyzer option, which provides access to a number of static-analysis features. Quite a bit has been happening with -fanalyzer and more is on the way with the upcoming GCC 12 release, including, possibly, a set of checks that have already found at least one vulnerability in the kernel.

When GCC is invoked with -fanalyzer, it runs a module that creates an "exploded graph" combining information on the state of the program's control and data flow. That state includes an abstract representation of memory contents, known constraints on the values of variables, and information like whether the code might be running in a signal handler. The analyzer then uses this graph to try to explore all of the interesting paths through the code to see what might happen.

The GCC 10 release added 15 new warnings for potential errors like freeing memory twice, using memory after freeing it, unsafe calls within signal handlers, possible writing of sensitive data to a log file, and more. The [David Malcolm] double-free detection was the motivating factor that led to the development of many of those checks. Five more warnings came with GCC 11, including the ability to check whether memory obtained from one allocator is not accidentally freed to a different one and a check for undefined behavior in bit shifts. Also in GCC 11 is support for plugins that extend the analyzer. One example plugin can be found in the test suite; it checks for misuse of the Python global interpreter lock (GIL).

Recently, Malcolm was wondering what he should focus on for GCC 12. He had originally thought that he would add C++ support, but then concluded that he would rather improve the functionality for C instead. Before coming to that conclusion, he had implemented new and delete support in GCC 11 and was looking at exception handling as the next challenge, but that turned out to be "quite involved". Meanwhile, Google Summer of Code student Ankur Saini had added support for virtual functions. So some progress had been made, but Malcolm's work will continue to be focused on C for now.

There were a couple of problems for C that drew his attention, one of which is buffer-overflow detection. He has an experimental implementation that can capture the size of dynamic allocations in symbolic form and issue diagnostics about reads and writes that might go beyond that size. But determining when those warnings should be generated is hard; the code as it is now produces far too many false positives ("a wall of noise") and is not really useful.

It occurred to him, though, that there could be a way to find one specific class of problems: places where an attacker is able to influence whether an access is valid or not. That leads to the problem of taint detection and determining where the trust boundaries in the program are, which turns out to be a hard problem for arbitrary C code. It might, though, be possible to annotate specific programs and get useful diagnostics. His attention went to the kernel, which has a well-defined trust boundary and an API for moving data across that boundary. By annotating functions like copy_from_user() and system-call handlers, it might be possible to find code that does not properly sanitize user-supplied data.

GCC has an attribute (access) that describes how data moves through a particular variable or function. Malcolm added two new values (untrusted_read and untrusted_write) for that attribute to mark data that is read from (or written to) an untrusted location. Thus, for example, the data read into the kernel by copy_from_user() would be marked as untrusted_read. He added a new tainted attribute for functions as well; it indicates that all arguments to that function should be treated as untrusted. By tweaking one macro in the kernel headers, he was able to mark all of the system-call handlers in the kernel with this tainted attribute. Similar things can be done with, for example, callbacks for kernel-generated filesystems.

With those annotations, there are two categories of problems that the analyzer can detect: information leaks and using tainted data. Information leaks happen when uninitialized data is written back to user space. This case is relatively easy to detect — or, at least, he thought it would be. As an example of this type of problem, Malcolm brought up CVE-2017-18549, a driver bug that wrote random stack data back to user space. The uninitialized data that was written in this case was padding within a structure that had otherwise been fully initialized; the analyzer was able to find this problem. Getting this to work required refactoring the handling of uninitialized-data tracking; it was not a small task.

A similar problem comes about in code that reads data from user space, modifies it, then copies the result back, possibly to a different location. If the read fails, the kernel may be working with uninitialized data, which it will then duly write to user space. Handling this required bifurcating the analysis to handle the case when copy_from_user() fails. Once that was done, the analyzer also gained the ability to handle realloc(), which has three possible outcomes.

The tainted-data case comes about when, for example, a user-supplied value is used as an array index. It is harder to detect but also seems more important, since vulnerabilities of this type can often be exploited to compromise the kernel. Consider, for example, CVE-2011-0521, where kernel code would read a signed "size" value, check it against the maximum allowable value, then use it without checking for a negative value. The improved analyzer is able to catch this case.

He is still working on a prototype implementation of this functionality to show to the world; as part of that, he has developed the world's worst kernel module, which contains as many problems as he can come up with. Making the analyzer work with the full kernel, though, is complicated by the fact that the kernel uses a lot of inline assembly code. He has added some basic handling for this, but it doesn't look at the actual opcodes.

He's been running the result on upstream kernels, and has found one real vulnerability already; it has been reported but is not yet fixed or disclosed. For this reason, Malcolm's latest work still lives in an internal company repository; it finds vulnerabilities and he doesn't want to release a zero-day-finding tool until the problems it turns up have been fixed. Much of the rest of the work is in the GCC 12 trunk now. He hopes to be able to finish this work and upstream it by the end of the GCC 12 stage 1 period (this page describes the GCC development-cycle stages).

More information on this project can be found in the GCC wiki.

Index entries for this article
ConferenceLinux Plumbers Conference/2021


to post comments

Improvements to GCC's -fanalyzer option

Posted Sep 23, 2021 23:17 UTC (Thu) by ncm (guest, #165) [Link]

Quality reportage like this keeps me renewing my subscription.

Improvements to GCC's -fanalyzer option

Posted Sep 23, 2021 23:57 UTC (Thu) by ncm (guest, #165) [Link] (9 responses)

On the topic of C++, detecting use of op[] on a moved-from std::vector object, or op*/op-> on a moved-from std::unique_ptr would very welcome. Numerous other always-wrong operations on moved-from library objects may be caught, each a potential bug.

There are even more such to be found on user-defined objects with library-object members. I.e., a user-defined object with a std::unique_ptr member and compiler-generated move of the member is in the same condition as the moved-from std::unique_ptr. The user-defined, (commonly) inline delegating members that use it can be looked through, bypassing the abstraction.

Static analysis for C++ is arguably more valuable than for C. While C is overwhelmingly more likely to contain bugs you would hope to catch, C users are even more unlikely to use your analyzer, or to act on its advice: If they cared that much about correctness, they would have abandoned C long ago. (E.g., even Sqlite developers, who care much than average, famously suppress compiler warnings.) And, since C++ and its Standard Library implicitly provide the compiler enormously more information about the programmer's intentions, it has that much more to work with.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 2:40 UTC (Fri) by Paf (subscriber, #91811) [Link] (7 responses)

“ While C is overwhelmingly more likely to contain bugs you would hope to catch, C users are even more unlikely to use your analyzer, or to act on its advice: If they cared that much about correctness, they would have abandoned C long ago. (E.g., even Sqlite developers, who care much than average, famously suppress compiler warnings.)”

This is such a strange series of statements. I work on a large-ish public C project which uses warn-as-error. This is standard in the other C large projects I’ve touched on as well, and the kernel just started doing it too. If it’s true the SQLite people are suppressing warnings (in any general sense, there may be some specific one they object to), then that’s both terrible and unusual.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 3:32 UTC (Fri) by ncm (guest, #165) [Link] (3 responses)

Congratulations on heeding warnings. Would that your colleagues all did, too. That the Linux kernel "just started" hints at the more common pattern. Don't know about the BSDs.

SQLite has a manifesto you may consult. As noted, they do care much [more] than average, but have their own way of acting on it.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 5:38 UTC (Fri) by ncm (guest, #165) [Link] (2 responses)

... and, I would be remiss not to mention that you may switch to compiling with a C++ compiler anytime, with typically just a few minutes' work per 10 or 50 kloc. Then you can begin modernizing at leisure to claim the extra static analysis C++ already does; probably starting in the more actively evolving parts of your system, deleting now-redundant utility apparatus as you go. Gcc and gdb did this some time ago, without drama, and have reaped serious benefit since.

The same was just done in the Transmission torrent client: https://news.ycombinator.com/item?id=28503546

This would not mean that a library, if that is what you have, would then only be usable by C++ clients. A C++ compiler is perfectly able to generate libraries with the same, old C ABI.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 11:45 UTC (Fri) by ballombe (subscriber, #9523) [Link] (1 responses)

I agree with this advice except that C++ is moving away from C compatibility.

Improvements to GCC's -fanalyzer option

Posted Sep 27, 2021 0:58 UTC (Mon) by ncm (guest, #165) [Link]

Rather, C is moving away from C++ backward-compatibility, although they recently abandoned "variable-length arrays", the most incompatible of the recent moves.

But it is still quite easy to upgrade even bleeding-edge C to C++. And, of course, the overwhelming majority of C code is very far from that.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 3:57 UTC (Fri) by wahern (subscriber, #37304) [Link] (2 responses)

I just downloaded the latest release of SQLite: https://www.sqlite.org/2021/sqlite-autoconf-3360000.tar.gz

The default build on macOS does not pass any -W flags, either to enable or disable diagnostics. However, adding -Wall doesn't result in any diagnostic messages, either.

My guess is that they likely develop with diagnostics enabled, but refrain from enabling them in release builds. Compilers, including GCC and clang, have a somewhat checkered history when it comes to changing the scope of diagnostics, as well as adding new diagnostics to aggregated sets like -Wall, resulting in false positives and, when using -Werror or something similar, build failures. IOW, enabling diagnostics in release builds poses forward compatibility issues.

If a project is as well maintained as SQLite, seeing constant development using up to date toolchains, there's little reason to enable diagnostics in release builds. You're just inviting a stream of patches to "quiet" those diagnostics with obfuscating code. See, e.g., the fopen suppression hack mentioned elsethread despite the -fanalyzer diagnostic being a false positive, at least as described. (If fopen could return stdin, the environment would need to also support balanced fclose calls, either by making fclose a noop or using a reference counter. It's common for static analyzers to fail to find and prove the code path that balances fopen/fclose, malloc/free, etc, especially in the absence of annotations. A language like Rust only works because it leans heavily on some very strict lexical rules, making flow analysis in the resulting graph more tractable.)

Most projects aren't as well maintained as SQLite, though. -Wall is a good default, and the stream of mostly nuisance diagnostic fixes is the price you pay for relying on the internet to give you notice when compiler diagnostics have changed (improved and/or regressed--not mutually exclusive qualities).

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 5:49 UTC (Fri) by ncm (guest, #165) [Link]

We don't need to guess; they explained their policy at some length right in their FAQ: https://sqlite.org/faq.html#q17 . Their policy has actually changed, somewhat; constant pestering from users about warnings has led them to silence the warnings they get in their own builds, anyway.

C makes it hard enough to deduce programmers' true intent that warnings there are a hit-or-miss affair. More-modern languages offer more scope, and often allow what would at best be a warning in C to be a hard error instead.

Improvements to GCC's -fanalyzer option

Posted Dec 11, 2021 4:25 UTC (Sat) by bartoc (guest, #124262) [Link]

Yeah, that's what they do, and I kinda agree with it, users generally aren't going to bother fixing your warnings, so as long as you're compiling with them it seems OK for the "released" build scripts to compile without. Making them errors in releases is _insanely annoying_ because it's not forward compatible, compilers add new warnings all the time.

There's a reason build systems use -isystem for dependencies, and a reason that even msvc (after many years of resisting it) added an equivalent flag.

Improvements to GCC's -fanalyzer option

Posted Sep 27, 2021 5:02 UTC (Mon) by kccqzy (guest, #121854) [Link]

Detecting the use of moved-from objects in C++ has already been implemented in ClangTidy: https://clang.llvm.org/extra/clang-tidy/checks/bugprone-u... It actually caught a few genuine bugs for me. It's not part of Clang itself but rather a separate tool, but then in codebases I work on it's run automatically.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 1:58 UTC (Fri) by dougg (guest, #1894) [Link] (10 responses)

Gave it a spin on around 100k CLOC and it found two issues. One I'm prepared to concede; the other: it said this was a leak:

if (fp && (stdin != fp))
fclose(fp);

Earlier it that function if the user gave '-' as the filename, stdin was placed in fp. Otherwise fp=fopen() was called with a separate error patch if fopen() failed. Interesting, that is a leak iff fopen() returns stdin! Can that happen? Not sure but I fixed it with a new boolean called del_fp which is only set to true if fopen() succeeds.

That was with gcc (Ubuntu 11.1.0-1ubuntu1~21.04) 11.1.0 , looking forward to gcc 12. Good work.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 6:57 UTC (Fri) by zuki (subscriber, #41808) [Link] (8 responses)

> that is a leak iff fopen() returns stdin! Can that happen?

No.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 13:40 UTC (Fri) by felixfix (subscriber, #242) [Link] (6 responses)

Been a long time since I mucked about with C, but if you close stdin, could the next open() return stdin? Or are the first three sacrosanct even if not already open?

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 14:31 UTC (Fri) by zuki (subscriber, #41808) [Link] (5 responses)

If you close file descriptor 0, the next time a file is opened, you would indeed get fd 0. So if you close fd 0 and call fopen(), you'll get a file object which uses fd 0. Since this file object has fd 0, it "is" stdin in the sense that C code would try to read from fd 0 if instructed to read stdin. But this file object still wouldn't be equal to stdin the variable. So the original cleanup code was safe.

(Or in other words: you can have multiple FILE objects using the same file descriptor. This is unlikely to work well, but it's possible. Those objects have different locations in memory and compare as different. And fopen() always returns a new object. So that object can never compare equal to stdin.)

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 14:46 UTC (Fri) by excors (subscriber, #95769) [Link] (3 responses)

> If you close file descriptor 0, the next time a file is opened, you would indeed get fd 0. So if you close fd 0 and call fopen(), you'll get a file object which uses fd 0. Since this file object has fd 0, it "is" stdin in the sense that C code would try to read from fd 0 if instructed to read stdin. But this file object still wouldn't be equal to stdin the variable. So the original cleanup code was safe.

What if you call 'fclose(stdin);'? C99 says "The value of a pointer to a FILE object is indeterminate after the associated file is closed (including the standard text streams)", implying that closing the standard text streams is allowed, and I don't see anything else to forbid it. Then the expression "stdin != fp" is comparing fp to an indeterminate value, so I guess it's undefined behaviour; but if it wasn't undefined, and you assumed some reasonable compiler/library implementation where pointers are just numerical addresses, a subsequent call to 'fp = fopen("foo.txt");' might return a FILE* whose value is numerically equal to the now-closed stdin.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 14:54 UTC (Fri) by zuki (subscriber, #41808) [Link] (2 responses)

This is true. But the OP said:
> Earlier it that function if the user gave '-' as the filename, stdin was placed in fp.
> Otherwise fp=fopen() was called with a separate error pat[h] if fopen() failed.
I'm just saying is that the original code was safe.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 21:20 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

It's wasn't correct according to the standard. Indeed, if you close an stdin then next time fopen is called it's valid for it to return address of stdin.

Most implementations allocate stdin/stdout/stderr statically, though. Thus it can not happen.

P.S. Of course if you talk to compiler developers then they would bring pointer provenance rules to that discussion and because noone knows what these rules are the debate of whether it's even allowed to access stdin to compare it to fp after it was closed would become something to spend whole scientifc articles on.

Improvements to GCC's -fanalyzer option

Posted Sep 25, 2021 0:38 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

> It's wasn't correct according to the standard.

That's an overstatement. It was incorrect, *if and only if* there is at least one code path, anywhere in the program, which actually closes stdin. As long as no such code path exists, then there's nothing wrong with assuming the value of stdin never changes. Closing stdin is poor code style (as it breaks functions like scanf), especially in library code (you don't own the standard streams, those are the property of the application code), and redirecting with freopen is safe according to POSIX (i.e. the FILE pointer remains valid, unless the open fails), so it's entirely plausible that no such code path exists, even in a large and complex application.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 20:20 UTC (Fri) by dskoll (subscriber, #1630) [Link]

In glibc, it looks like stdin, stdout and stderr are in statically-allocated variables. See the source. So I don't think fopen can ever return a value that compares the same as stdin.

The static variables themselves are defined in libio/stdfiles.c

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 19:03 UTC (Fri) by kreijack (guest, #43513) [Link]

> > that is a leak iff fopen() returns stdin! Can that happen?

> No.

I am not sure about that. Behind the return value of fopen(2) there is a malloc [1]. So if you fclose(stdin); and reopen a file with fopen(3), I think that it is *possible* that the latest freed region is returned.

bionic libc, doesn't free the memory region during the fclose; it still use this region for a possible next fopen.
glibc class free(3); so it is possible that a subsequent malloc(3) returns the same region.

It is implementation dependent. For the bionic libc, to me it seems that it is always true (for *stdin*). For glibc it seems no.

[1] the bionic libc does that; the same it is true for glibc

Improvements to GCC's -fanalyzer option

Posted Oct 8, 2021 13:44 UTC (Fri) by HelloWorld (guest, #56129) [Link]

if (fp && (stdin != fp)) fclose(fp);

Earlier it that function if the user gave '-' as the filename, stdin was placed in fp. Otherwise fp=fopen() was called with a separate error patch if fopen() failed. Interesting, that is a leak iff fopen() returns stdin! Can that happen? Not sure but I fixed it with a new boolean called del_fp which is only set to true if fopen() succeeds.

That's an ugly, error-prone way of doing that sort of thing. You should get rid of that conditional entirely by factoring out the code that does stuff with the file into a function. And then do this:
void doStuff(FILE *fp);

…

if (strcmp(filename, "-") == 0) {
  doStuff(stdin);
} else {
  FILE *fp = fopen(filename, "r");
  if (fp != NULL) {
    doStuff(fp);
    fclose(fp));
  } else {
    /* handle error */
  }
}
(error handling for fclose(3) omitted for brevity)

I'm pretty sure -fanalyzer wouldn't produce a false positive if you write it this way.

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 5:12 UTC (Fri) by pabs (subscriber, #43278) [Link]

Will this feature come with something like LLVM's scan-build? Or will folks have to add -fanalyzer to the right *FLAGS variables and build manually?

Improvements to GCC's -fanalyzer option

Posted Sep 24, 2021 12:34 UTC (Fri) by dskoll (subscriber, #1630) [Link]

Great article. I wasn't aware of -fanalyzer. I ran it on my Mailmunge project with gcc 10.2.1 and it found one issue... a code path that could result in snprintf being called from a signal handler. It was a pretty easy fix.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds