OLS: GCC: present and future
[Posted July 24, 2006 by corbet]
The
GNU Compiler Collection (GCC) is a
fundamental part of our free operating system. Licenses may make the
software free, but it's GCC which lets us turn that software into something
our computers can run. GCC's strengths and weaknesses will, thus,
influence the quality of a Linux system in a big way. GCC is, however, an
opaque tool for many Linux users - and for many developers as well.
It is a black box, full of compiler magic, which, one hopes, just works.
For those interested in looking a little more deeply into GCC, however,
Diego
Novillo's OLS talk was a welcome introduction.
According to Diego, GCC has been at a bit of a turning point over the last
couple of years. On one hand, the software is popular and ubiquitous. On
the other, it is a pile of 2.2 million lines of code, initially
developed by "people who didn't know about compilers" (that comment clearly
intended as a joke), and showing all of
its 15 years of age. The code is difficult to maintain, and even harder to
push forward. Compiler technology has moved forward in many ways, and GCC
is sometimes having a hard time keeping up.
The architecture of GCC has often required developers to make changes
throughout the pipeline. But the complexity of the code is such that
nobody is really able to understand the entire pipeline. There are simply
too many different tasks being performed. Recent architectural
improvements are changing that situation, however, providing better
isolation between the various pipeline stages.
GCC has a steering committee for dealing with "political stuff." There is,
at any given time, one release manager whose job is to get the next release
together; it is, says Diego, a thankless job. Then, there is a whole set
of maintainers who are empowered to make changes all over the tree. The
project is trying to get away from having maintainers with global commit
privileges, however. Since building a good mental model of the entire
compiler is essentially impossible, it is better to keep maintainers within
their areas of expertise.
The (idealized) development model works in three stages. The first two
months are for major changes and the addition of major new features. Then,
over the next two months, things tighten down and focus on stabilization
and the occasional addition of small features. Finally, in the last two
months, only bug fixes are allowed. This is, Diego says, "where everybody
disappears" and the release manager is force to chase down developers and
nag them into fixing bugs. Much of the work in this stage is driven by
companies with an interest in the release.
In the end, this ideal six-month schedule tends to not work out quite so
well in reality. But, says Diego, the project is able to get "one good
release" out every year.
GCC development works out of a central subversion repository with many
development branches. Anybody wishing to contribute to GCC must assign
copyrights to the Free Software Foundation.
The compiler pipeline looks something like this:
- Language-specific front ends are charged with parsing the input
source and turning it into an internal language called "Generic."
The Generic language is able to represent programs written in any
language supported by GCC.
- A two-stage process turns Generic into another language called
Gimple. As part of this process, the program is simplified in a
number of ways. All statements are rewritten to get to a point where
there are no side effects; each statement performs, at most, one
assignment. Quite a few temporary variables are introduced to bring
this change about. Eventually, by the time the compiler has
transformed the program into "low Gimple," all control structures have
been reduced to if tests and gotos.
- At this point, the various SSA ("single static assignment") optimizers
kick in. There are, according to Diego, about 100 passes made over
the program at this point. The flow of data through the program is
analyzed and used to perform loop optimizations, some vectorization
tasks, constant propagation, etc. Much more information on SSA can be
found in this LWN article
from 2004.
- After all this work is done, the result is a form of the program
expressed in "register transfer language" or RTL. RTL was originally
the only internal language used by GCC; over time, the code which uses
RTL is shrinking, while the work done at the SSA level is growing.
The RTL representation is used to do things like instruction
pipelining, common subexpression elimination, and no end of
machine-specific tasks.
- The final output from gcc is an assembly language program, which can
then be fed to the assembler.
The effect of recasting GCC into the above form is a compiler which is more
modular and easier to work with.
Future plans were touched on briefly. There is currently a great deal of
interest in static analysis tools. The GCC folks would like to support
that work, but they do not want to weigh down the compiler with a large
pile of static analysis tools. So they will likely implement a set of
hooks which allow third party tools to get the information they need from
the compiler. Inevitably, it was asked what sort of license those tools
would need to have to be able to use the GCC hooks; evidently no answer to
that question exists yet, however.
Another area of interest is link-time optimization and the ability to deal
with multiple program units as a whole. There is also work happening on
dynamic compilation - compiling to byte codes which are then interpreted by
a just-in-time compiler at run time. Much more information on current GCC
development can be found on the GCC
wiki.
This session was highly informative. Unfortunately, its positioning on
the schedule (in the first Saturday morning slot, when many of those who
participated in the previous evening's whiskey tasting event were notably
absent) may have reduced attendance somewhat. This was, however, a talk
worth getting up for.
(
Log in to post comments)