OLS: GCC: present and future
According to Diego, GCC has been at a bit of a turning point over the last couple of years. On one hand, the software is popular and ubiquitous. On the other, it is a pile of 2.2 million lines of code, initially developed by "people who didn't know about compilers" (that comment clearly intended as a joke), and showing all of its 15 years of age. The code is difficult to maintain, and even harder to push forward. Compiler technology has moved forward in many ways, and GCC is sometimes having a hard time keeping up.
The architecture of GCC has often required developers to make changes throughout the pipeline. But the complexity of the code is such that nobody is really able to understand the entire pipeline. There are simply too many different tasks being performed. Recent architectural improvements are changing that situation, however, providing better isolation between the various pipeline stages.
GCC has a steering committee for dealing with "political stuff." There is, at any given time, one release manager whose job is to get the next release together; it is, says Diego, a thankless job. Then, there is a whole set of maintainers who are empowered to make changes all over the tree. The project is trying to get away from having maintainers with global commit privileges, however. Since building a good mental model of the entire compiler is essentially impossible, it is better to keep maintainers within their areas of expertise.
The (idealized) development model works in three stages. The first two months are for major changes and the addition of major new features. Then, over the next two months, things tighten down and focus on stabilization and the occasional addition of small features. Finally, in the last two months, only bug fixes are allowed. This is, Diego says, "where everybody disappears" and the release manager is force to chase down developers and nag them into fixing bugs. Much of the work in this stage is driven by companies with an interest in the release.
In the end, this ideal six-month schedule tends to not work out quite so well in reality. But, says Diego, the project is able to get "one good release" out every year.
GCC development works out of a central subversion repository with many development branches. Anybody wishing to contribute to GCC must assign copyrights to the Free Software Foundation.
The compiler pipeline looks something like this:
- Language-specific front ends are charged with parsing the input
source and turning it into an internal language called "Generic."
The Generic language is able to represent programs written in any
language supported by GCC.
- A two-stage process turns Generic into another language called
Gimple. As part of this process, the program is simplified in a
number of ways. All statements are rewritten to get to a point where
there are no side effects; each statement performs, at most, one
assignment. Quite a few temporary variables are introduced to bring
this change about. Eventually, by the time the compiler has
transformed the program into "low Gimple," all control structures have
been reduced to if tests and gotos.
- At this point, the various SSA ("single static assignment") optimizers
kick in. There are, according to Diego, about 100 passes made over
the program at this point. The flow of data through the program is
analyzed and used to perform loop optimizations, some vectorization
tasks, constant propagation, etc. Much more information on SSA can be
found in this LWN article
from 2004.
- After all this work is done, the result is a form of the program
expressed in "register transfer language" or RTL. RTL was originally
the only internal language used by GCC; over time, the code which uses
RTL is shrinking, while the work done at the SSA level is growing.
The RTL representation is used to do things like instruction
pipelining, common subexpression elimination, and no end of
machine-specific tasks.
- The final output from gcc is an assembly language program, which can then be fed to the assembler.
The effect of recasting GCC into the above form is a compiler which is more modular and easier to work with.
Future plans were touched on briefly. There is currently a great deal of interest in static analysis tools. The GCC folks would like to support that work, but they do not want to weigh down the compiler with a large pile of static analysis tools. So they will likely implement a set of hooks which allow third party tools to get the information they need from the compiler. Inevitably, it was asked what sort of license those tools would need to have to be able to use the GCC hooks; evidently no answer to that question exists yet, however.
Another area of interest is link-time optimization and the ability to deal with multiple program units as a whole. There is also work happening on dynamic compilation - compiling to byte codes which are then interpreted by a just-in-time compiler at run time. Much more information on current GCC development can be found on the GCC wiki.
This session was highly informative. Unfortunately, its positioning on
the schedule (in the first Saturday morning slot, when many of those who
participated in the previous evening's whiskey tasting event were notably
absent) may have reduced attendance somewhat. This was, however, a talk
worth getting up for.
Index entries for this article | |
---|---|
Conference | Linux Symposium/2006 |
Posted Jul 26, 2006 4:57 UTC (Wed)
by Kluge (subscriber, #2881)
[Link] (2 responses)
Posted Jul 27, 2006 9:18 UTC (Thu)
by dank (guest, #1865)
[Link] (1 responses)
Can somebody who was at the lto/llvm discussion at the gcc summit comment?
And for those who are wondering what LLVM and LTO are,
Posted Jul 27, 2006 14:51 UTC (Thu)
by Kluge (subscriber, #2881)
[Link]
Posted Jul 27, 2006 18:10 UTC (Thu)
by smoogen (subscriber, #97)
[Link] (1 responses)
Posted Jul 31, 2006 12:06 UTC (Mon)
by dnovillo (guest, #36710)
[Link]
We have discussed the notion of reading the various ILs and be able to start compilation from an arbitrary stage in the pipeline. This would allow us to get us closer to implement some degree of unit testing.
For instance, someone finds a bug in dead code elimination that only happens when the IL shows a specific stream of instructions with a specific combination of memory references.
Instead of trying to recreate that IL pattern out of source code, we would be able to create it and feed it to DCE directly. That would eliminate random changes in the first N - 1 passes that may paper over the original bug.
Posted Aug 3, 2006 8:29 UTC (Thu)
by forthy (guest, #1525)
[Link]
Thanks for highlighting that nobody understands the big picture of
GCC. When looking at GCC's sources and comments from the developers, I
already had that impression. The main problem here is: Whatever phase you
look at, GCC is doing it at the wrong time. E.g. it combines instructions
before it reorders control flow (so that possible combined instructions
after control flow reorder can't happen). The mapping to actual
instructions happens way too early, anyway. Or: Since years, we are struggling with -fno-reorder-blocks not
eliminating cross-jumps (a severe pessivation for implementing VMs). The
bug gets fixed, gets closed, and then pops up again. Meanwhile, it has
been marked as regression several times, and taken out of regression
again, to just pop up again after a short while. So the "joking" comment that it was written by people who don't know
anything about compilers is not completely wrong. Since GCC is so big and
so full of legacy, it's probably better to write a new compiler suite
from scratch. Maybe using an implementation language that is better
suited at the problem as C (GCC's coding style is
C-trying-to-emulate-a-badly-written-Lisp-system ATM. Lisp would have been
the right choice for this particulary architecture, but that's because
RMS is a Lisp guy). Or at least use an abstraction layer that's
appropriate for the implementation language (if it must be C).
No mention of things like the use of LLVM as a backend, I take it?GCC evolution
I wasn't there, but for those who are wonderingre: LLVM
about the status of LLVM and GCC, the most
recent posts I know of are
http://gcc.gnu.org/ml/gcc/2006-03/msg00706.html (as of March,
copyright assignment papers hadn't yet been signed)
http://gcc.gnu.org/ml/gcc-patches/2006-06/msg00153.html (as
of June, the rival LTO proposal is moving ahead)
see the presentations at
http://www.gelato.org/community/gelato_meeting.php?id=ICE...
Thanks. I hadn't heard of the LTO proposal.re: LLVM
I wonder how hard it would be to just use Generic directly to code programs. Probably would just be an exercise in insanity.OLS: GCC: present and future
OLS: GCC: present and future
GCC: Nobody has the "big" picture