OLS: GCC: present and future

[Posted July 24, 2006 by corbet]

The GNU Compiler Collection (GCC) is a fundamental part of our free operating system. Licenses may make the software free, but it's GCC which lets us turn that software into something our computers can run. GCC's strengths and weaknesses will, thus, influence the quality of a Linux system in a big way. GCC is, however, an opaque tool for many Linux users - and for many developers as well. It is a black box, full of compiler magic, which, one hopes, just works. For those interested in looking a little more deeply into GCC, however, Diego Novillo's OLS talk was a welcome introduction.

According to Diego, GCC has been at a bit of a turning point over the last couple of years. On one hand, the software is popular and ubiquitous. On the other, it is a pile of 2.2 million lines of code, initially developed by "people who didn't know about compilers" (that comment clearly intended as a joke), and showing all of its 15 years of age. The code is difficult to maintain, and even harder to push forward. Compiler technology has moved forward in many ways, and GCC is sometimes having a hard time keeping up.

The architecture of GCC has often required developers to make changes throughout the pipeline. But the complexity of the code is such that nobody is really able to understand the entire pipeline. There are simply too many different tasks being performed. Recent architectural improvements are changing that situation, however, providing better isolation between the various pipeline stages.

GCC has a steering committee for dealing with "political stuff." There is, at any given time, one release manager whose job is to get the next release together; it is, says Diego, a thankless job. Then, there is a whole set of maintainers who are empowered to make changes all over the tree. The project is trying to get away from having maintainers with global commit privileges, however. Since building a good mental model of the entire compiler is essentially impossible, it is better to keep maintainers within their areas of expertise.

The (idealized) development model works in three stages. The first two months are for major changes and the addition of major new features. Then, over the next two months, things tighten down and focus on stabilization and the occasional addition of small features. Finally, in the last two months, only bug fixes are allowed. This is, Diego says, "where everybody disappears" and the release manager is force to chase down developers and nag them into fixing bugs. Much of the work in this stage is driven by companies with an interest in the release.

In the end, this ideal six-month schedule tends to not work out quite so well in reality. But, says Diego, the project is able to get "one good release" out every year.

GCC development works out of a central subversion repository with many development branches. Anybody wishing to contribute to GCC must assign copyrights to the Free Software Foundation.

The compiler pipeline looks something like this:

Language-specific front ends are charged with parsing the input source and turning it into an internal language called "Generic." The Generic language is able to represent programs written in any language supported by GCC.
A two-stage process turns Generic into another language called Gimple. As part of this process, the program is simplified in a number of ways. All statements are rewritten to get to a point where there are no side effects; each statement performs, at most, one assignment. Quite a few temporary variables are introduced to bring this change about. Eventually, by the time the compiler has transformed the program into "low Gimple," all control structures have been reduced to if tests and gotos.
At this point, the various SSA ("single static assignment") optimizers kick in. There are, according to Diego, about 100 passes made over the program at this point. The flow of data through the program is analyzed and used to perform loop optimizations, some vectorization tasks, constant propagation, etc. Much more information on SSA can be found in this LWN article from 2004.
After all this work is done, the result is a form of the program expressed in "register transfer language" or RTL. RTL was originally the only internal language used by GCC; over time, the code which uses RTL is shrinking, while the work done at the SSA level is growing. The RTL representation is used to do things like instruction pipelining, common subexpression elimination, and no end of machine-specific tasks.
The final output from gcc is an assembly language program, which can then be fed to the assembler.

The effect of recasting GCC into the above form is a compiler which is more modular and easier to work with.

Future plans were touched on briefly. There is currently a great deal of interest in static analysis tools. The GCC folks would like to support that work, but they do not want to weigh down the compiler with a large pile of static analysis tools. So they will likely implement a set of hooks which allow third party tools to get the information they need from the compiler. Inevitably, it was asked what sort of license those tools would need to have to be able to use the GCC hooks; evidently no answer to that question exists yet, however.

Another area of interest is link-time optimization and the ability to deal with multiple program units as a whole. There is also work happening on dynamic compilation - compiling to byte codes which are then interpreted by a just-in-time compiler at run time. Much more information on current GCC development can be found on the GCC wiki.

This session was highly informative. Unfortunately, its positioning on the schedule (in the first Saturday morning slot, when many of those who participated in the previous evening's whiskey tasting event were notably absent) may have reduced attendance somewhat. This was, however, a talk worth getting up for.

Index entries for this article
Conference	Linux Symposium/2006

GCC evolution

Posted Jul 26, 2006 4:57 UTC (Wed) by Kluge (subscriber, #2881) [Link] (2 responses)

No mention of things like the use of LLVM as a backend, I take it?

re: LLVM

Posted Jul 27, 2006 9:18 UTC (Thu) by dank (guest, #1865) [Link] (1 responses)

I wasn't there, but for those who are wondering
about the status of LLVM and GCC, the most
recent posts I know of are
http://gcc.gnu.org/ml/gcc/2006-03/msg00706.html (as of March,
copyright assignment papers hadn't yet been signed)
http://gcc.gnu.org/ml/gcc-patches/2006-06/msg00153.html (as
of June, the rival LTO proposal is moving ahead)

Can somebody who was at the lto/llvm discussion at the gcc summit comment?

And for those who are wondering what LLVM and LTO are,
see the presentations at
http://www.gelato.org/community/gelato_meeting.php?id=ICE...

re: LLVM

Posted Jul 27, 2006 14:51 UTC (Thu) by Kluge (subscriber, #2881) [Link]

Thanks. I hadn't heard of the LTO proposal.

OLS: GCC: present and future

Posted Jul 27, 2006 18:10 UTC (Thu) by smoogen (subscriber, #97) [Link] (1 responses)

I wonder how hard it would be to just use Generic directly to code programs. Probably would just be an exercise in insanity.

OLS: GCC: present and future

Posted Jul 31, 2006 12:06 UTC (Mon) by dnovillo (guest, #36710) [Link]

We have discussed the notion of reading the various ILs and be able to start compilation from an arbitrary stage in the pipeline. This would allow us to get us closer to implement some degree of unit testing.

For instance, someone finds a bug in dead code elimination that only happens when the IL shows a specific stream of instructions with a specific combination of memory references.

Instead of trying to recreate that IL pattern out of source code, we would be able to create it and feed it to DCE directly. That would eliminate random changes in the first N - 1 passes that may paper over the original bug.

GCC: Nobody has the "big" picture

Posted Aug 3, 2006 8:29 UTC (Thu) by forthy (guest, #1525) [Link]

Thanks for highlighting that nobody understands the big picture of GCC. When looking at GCC's sources and comments from the developers, I already had that impression. The main problem here is: Whatever phase you look at, GCC is doing it at the wrong time. E.g. it combines instructions before it reorders control flow (so that possible combined instructions after control flow reorder can't happen). The mapping to actual instructions happens way too early, anyway.

Or: Since years, we are struggling with -fno-reorder-blocks not eliminating cross-jumps (a severe pessivation for implementing VMs). The bug gets fixed, gets closed, and then pops up again. Meanwhile, it has been marked as regression several times, and taken out of regression again, to just pop up again after a short while.

So the "joking" comment that it was written by people who don't know anything about compilers is not completely wrong. Since GCC is so big and so full of legacy, it's probably better to write a new compiler suite from scratch. Maybe using an implementation language that is better suited at the problem as C (GCC's coding style is C-trying-to-emulate-a-badly-written-Lisp-system ATM. Lisp would have been the right choice for this particulary architecture, but that's because RMS is a Lisp guy). Or at least use an abstraction layer that's appropriate for the implementation language (if it must be C).