Dispatches from the compiler front

By Jonathan Corbet
April 14, 2010

Your editor has recently noticed a string of interesting announcements and discussions in the GCC and LLVM compiler communities. Here is an attempt to pull together a look at a few of these discussions, including resistance to cooperation between the two projects, building an assembler into LLVM, and more.

The up-and-coming LLVM compiler has been an irritation to some GCC developers for some time; LLVM apparently comes off as an upstart trying to muscle into territory which GCC has owned for a long time. So it's not surprising that occasionally the relationship between the two projects gets a little frosty.

Consider the case of DragonEgg, a GCC plugin which replaces the bulk of GCC's optimization and code-generation system with the LLVM implementation. DragonEgg is clearly a useful tool for LLVM developers, who can focus on improving the backend code while making use of GCC's well-developed front ends. Jack Howarth recently proposed the addition of DragonEgg as an official part of the GCC code base. Some developers welcomed the idea; Basile Starynkevitch, for example, thought it would make a good plugin example. But from others came complaints like this:

So, no offense, but the suggestion here is to make this subversive (for FSF GCC) plugin part of FSF GCC? What is the benefit of this for GCC? I don't see any. I just see a plugin trying to piggy-back on the hard work of GCC front-end developers and negating the efforts of those working on the middle ends and back ends.

It's not clear that this is a majority opinion; some GCC developers see DragonEgg as an easy way to try out LLVM code and compare it against their own. If LLVM comes out on top, GCC developers can then figure out why or, possibly, just adopt the relevant LLVM code. Those developers see only benefit in some cooperative competition between the projects.

Others, though, see the situation as more of a zero-sum game; when viewed through that lens, cooperation with LLVM would appear to make little sense. But free software is not a zero-sum game; the more we can learn from each other, the better off we all are. GCC need not worry about being displaced by LLVM (or anything else) any time in the near future. Barring technical issues with the merging of DragonEgg (and none have been mentioned), accepting the code seems like it should be ultimately beneficial to the project.

In a side discussion, GCC developers wondered why LLVM seems to be more successful in attracting developers and mindshare in general. One suggestion was that LLVM has a clear leader who is able to set the direction of the project, while GCC is more scattered. Others have a different view; in this context, Ian Lance Taylor's notes are worth a look:

What I do see is that relatively few gcc developers take the time to reach out to new people and help them become part of the community. I also see a lot of external patches not reviewed, and I see a lot of back-and-forth about patches which is simply confusing and offputting to those trying to contribute. Joining the gcc community requires a lot of self-motivation, or it takes being paid enough to get over the obstacles.

There is also the matter of the old code base, the lack of a clean separation between passes, and, most important, weak internal documentation.

Some of these issues are being fixed; others will take longer. It seems clear that attending to these problems is important for the long-term future of the project.

Lest things look too grim, though, it's worth perusing this posting from Taras Glek on his success with the GCC "profile-guided optimization" (PGO) feature. PGO works by instrumenting the binary, then rebuilding the program with optimization driven by the profile information. With Firefox, Taras was able to cut the startup time by one third and to reduce initial memory use considerably as well. Taras says:

I think the numbers speak for themselves. Isn't it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code.

There's no shortage of interesting, development-oriented tools being integrated into GCC, and the addition of the plugin architecture can only result in an acceleration of this process. Things have reached a point where more projects should probably be looking into the use of these tools to improve the experience for their users.

Meanwhile, on the LLVM side, the developers have recently unveiled the LLVM MC project. "MC" stands for "machine code" in this context; in short, the LLVM developers are trying to integrate the assembler directly into the compiler. There are a number of reasons for doing this, including performance (formatting text for a separate assembler and running that assembler are expensive operations), portability (not all target systems have an assembler out of the box), and the ability to easily add support for new processor instructions. Much of this functionality is required anyway for LLVM's just-in-time compiler features, so it makes sense to just finish the job.

This work appears to be fairly well advanced, with much of the basic functionality in place. Chris Lattner says:

If you're interested in this level of the tool chain, this is a great area to get involved in, because there are lots of small to mid-sized projects just waiting to be tackled. I believe that the long term impact of this work is huge: it allows building new interesting CPU-level tools and it means that we can add new instructions to one .td file instead of having to add them to the compiler, the assembler, the disassembler, etc.

In summary: there is currently a lot going on in the area of development toolchains. Given that all of us - including those who do no development - depend on those toolchains, this can only be a good thing. Computers can do a lot to make the task of programming them easier and more robust; despite the occasional glitch, developers for both GCC and LLVM appear to be working hard to realize that potential.

Assembler considered cheap

Posted Apr 15, 2010 1:52 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

formatting text for a separate assembler and running that assembler are expensive operations

That is frequently asserted forever, but every time it's measured, assembly turns out to cost practically nothing. There might be other valid reasons to skip the assembly stage, but that ain't one of them.

This misperception is an excellent example of how poor even very smart people are at guessing where a computer spends its time, and what will help it do better. Most programs, these days, spend most of their time stalled waiting for cache lines to be copied from main memory. A fast program is one that gets more done between stalls. Keeping useful lines from being kicked out of your caches is among the most productive ways to speed up a program, these days. Another is to get another processor involved in the job. Piping to an assembler in a different process does a bit of both.

Assembler considered cheap

Posted Apr 15, 2010 5:45 UTC (Thu) by sbishop (guest, #33061) [Link]

It doesn't look like they guessed in this case.

Assembler considered cheap

Posted Apr 15, 2010 6:04 UTC (Thu) by corbet (editor, #1) [Link]

They've measured the assembly phase at about 20% of the total. It's not clear to me how much of that they can avoid with the MC scheme, but their understanding of the costs are not based on just guessing.

Dispatches from the compiler front

Posted Apr 15, 2010 7:24 UTC (Thu) by jdv (guest, #712) [Link] (10 responses)

I just see a plugin trying to piggy-back on the hard work of GCC front-end developers and negating the efforts of those working on the middle ends and back ends.

What's wrong with piggy-backing on the work of others? It's free software. One of the reasons software is free is so that others can change it and build on it without hassles -- in other words, piggy-backing on the hard work put into the original.

Dispatches from the compiler front

Posted Apr 15, 2010 13:49 UTC (Thu) by rriggs (guest, #11598) [Link]

I completely agree. In the words of Isaac Newton, "If I have seen further it is only by standing on the shoulders of giants." That, to me, is the Tao of open source. The LLVM people seem to have their hearts in the right place.

Dispatches from the compiler front

Posted Apr 15, 2010 15:36 UTC (Thu) by ejr (subscriber, #51652) [Link] (7 responses)

Some notable companies are jumping on LLVM explicitly to avoid the GNU GPL. Apple's move to emphasize LLVM pulled developers off of gcc and echoed problems from long ago. So some people see an LLVM plug-in as *exactly* the reason why plug-ins were banned for so long.

Dispatches from the compiler front

Posted Apr 15, 2010 18:31 UTC (Thu) by Trelane (subscriber, #56877) [Link] (4 responses)

Is the LLVM license GPL-compatible?

Dispatches from the compiler front

Posted Apr 15, 2010 18:33 UTC (Thu) by Trelane (subscriber, #56877) [Link] (2 responses)

(particularly, could you grab the code and re-license it under the GPL)

Dispatches from the compiler front

Posted Apr 15, 2010 18:51 UTC (Thu) by Trelane (subscriber, #56877) [Link]

Interesting reading from the SFLC on the topic: http://www.softwarefreedom.org/resources/2007/gpl-non-gpl...

Dispatches from the compiler front

Posted Apr 27, 2010 17:27 UTC (Tue) by vonbrand (subscriber, #4458) [Link]

Obviously no. You can't grab code belonging to somebody else and slap your own license on it. What you can do is to include some code into a larger work under another license if the licenses are compatible (i.e., some BSD code into a GPLed whole).

Dispatches from the compiler front

Posted Apr 16, 2010 8:13 UTC (Fri) by baldrick (subscriber, #4123) [Link]

LLVM uses the University of Illinois/NCSA Open Source License, which is GPL compatible.

Dispatches from the compiler front

Posted Apr 16, 2010 4:44 UTC (Fri) by magnus (subscriber, #34778) [Link] (1 responses)

It will be interesting to see if LLVM can hold together as one project and support as many arch:s as GCC or if it will splinter into proprietary forks.

Without the GPL enforcing it, I think many CPU manufacturers would have made their own (possibly binary only) GCC forks instead of contributing back.

Dispatches from the compiler front

Posted Apr 19, 2010 0:10 UTC (Mon) by elanthis (guest, #6227) [Link]

There a many variations of GCC that have not been contributed back. The code is available for others to pull in, sure, but (a) the community doesn't care about the forks and so has no desire to maintain them and (b) those forks are hairy and gross and otherwise not something you're likely to want in the first place.

It's much the same story as the Linux kernel. Being GPL only guarantees that some kind of non-binary and arguable human-readable code representation of a modification exists. It does not guarantee that those code representations are actually worth crap to anyone in the larger community.

The argument also fails to note MANY examples of BSD and MIT licensed software that has thriving involvement from the proprietary sectors.

Until the GPL states, "all modifications must be accepted by and committed into the original authors' tree before released as part of a product, unless he explicitly states he does not want the modifications due to lack of interest in the nature of the modifications made (and not solely due to correctable implementation flaws)" the GPL is really quite ineffective at enforcing any kind of community involvement or useful code contributions on the part of a company. It's really no harder to be a poor sport with the GPL than it is with the BSD license.

Dispatches from the compiler front

Posted Apr 22, 2010 10:57 UTC (Thu) by steven97 (guest, #2702) [Link]

There is nothing wrong per-se with piggy-backing. What is wrong here, is the hypocrisy of it all. On the one hand certain LLVM developers leave no opportunity unused to trash-talk GCC, and try to lure developers away from GCC on GCC mailing lists. On the other hand those same developers have no problem taking the GCC front ends and expect the GCC community to cooperate. It's IMVHO just opportunism of the worst kind.

Dispatches from the compiler front

Posted Apr 15, 2010 16:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Another reason: GCC is a mess.

Its source code is a maze of twisty little passages, all alike. Its build system is an abomination from the deepest circles of Hell.

Also, pure C for compilers is just a dumb idea. They even had to write a garbage collector because it's impossible to write anything complex in C reliably.

In comparison, LLVM development is just a breeze. First, I can use CMake to generate a VisualStudio project (I still haven't figured out how to build GCC using MSVC). Then I can easily write front-ends with LLVM. A simple non-trivial codegenerator can be written in about one day.

Yeah, LLVM is sweet

Posted Apr 15, 2010 16:53 UTC (Thu) by emk (subscriber, #1128) [Link] (1 responses)

Having poked around in GCC a bit, and having contributed some minor patches to LLVM, I have to agree: GCC is a seriously hairy codebase, and LLVM is about the nicest compiler I've ever hacked on.

LLVM's decision to use a strongly-typed assembly language with just a few opcodes is pure genius. It has a bunch of neat consequences:

1) The retention of basic type information throughout the entire compilation process makes it much easier to answer all sorts of questions, and to verify that the generated intermediate code makes some sort of sense.

2) The decision to use an assembly language (as opposed to an AST or a more exotic representation) makes it easy to dump the output from any optimization stage and examine it by hand.

3) The decision to use a _single_ assembly language (instead of the huge number of intermediate languages which seem to be used by GCC) makes it a lot easier for novices to find their way around the code base, and it means that you can build up large libraries of helper functions.

4) The decision to use a _small_ assembly language means that any given optimizer only needs to know about a small, fixed set of instructions.

Of course, a single intermediate representation isn't sufficient for every possible optimization. So LLVM can optionally annotate the typed assembly with further information, and individual passes can specify whether or not they (a) need a given set of annotations, and (b) preserve a given set of annotations if they exist.

LLVM is a really sweet compiler, and there's some friendly and super-productive hackers working on it. I think it has a great future, with or without GCC.

Yeah, LLVM is sweet

Posted Apr 17, 2010 0:32 UTC (Sat) by daglwn (guest, #65432) [Link]

I hack on LLVM for my job and onm my own time. Just a few comments.

> LLVM is about the nicest compiler I've ever hacked on.

Agreed.

> LLVM's decision to use a strongly-typed assembly language with just a few
> opcodes is pure genius. It has a bunch of neat consequences:

It does, but to be fair, LLVM wasn't the first compiler to provide this.

> 1) The retention of basic type information throughout the entire
> compilation process makes it much easier to answer all sorts of
> questions, and to verify that the generated intermediate code makes some
> sort of sense.

Yes. LLVM's Verifier pass helps a ton. It's saved us many times.

> 2) The decision to use an assembly language (as opposed to an AST or a
> more exotic representation) makes it easy to dump the output from any
> optimization stage and examine it by hand.

Debatable. I've worked on compilers that use very high-level IR representations and it was usually easier to understand what the compiler was doing. One could grasp larger programs much more easily. There are tradeoffs. LLVM uses a low-level representation because the community wants to expose all kinds of machine-level micro-optimizations. IME, the debugging tools surrounding the compiler are as important, if not more, than the IR itself for fixing bugs.

> 3) The decision to use a _single_ assembly language (instead of the huge
> number of intermediate languages which seem to be used by GCC) makes it a
> lot easier for novices to find their way around the code base, and it
> means that you can build up large libraries of helper functions.

This statement simply isn't true. LLVM does not have a single IR. It has at least five now: the Instruction IR, the ScalarEvolution IR, the SelectionDAG/ScheduleDAG IR, the MachineInstr IR and the MCInst IR (from the MC project).

This isn't necessarily a bad thing. Different representations allow easier manipulations for certain phases. One can't represent machine instructions with the higher-level Instruction IR. There is some duplication, however. SCEV passes in particular duplicate a lot of logic other passes that use the LLVM IR already have.

> 4) The decision to use a _small_ assembly language means that any given
> optimizer only needs to know about a small, fixed set of instructions.

Again, there are tradeoffs. One is that to do anything machine-specific requires intrinsics and the optimizer doesn't understand those. There are certainly instructions I would like to see added to the IR (a robust vector representation, for example) but it's not critical right now. Instructions have been added over the course of the project. I predict we will see quite a few new ones over the next several years.

> LLVM is a really sweet compiler, and there's some friendly and super-
> productive hackers working on it. I think it has a great future, with or
> without GCC.

100% agreed. Not only is LLVM used in lots of projects, it's been able to spark a renewed interest in compiler technology among students. This is going to be critical as we keep packing more cores onto chips. The era of "free" speedup via higher clocks is over. The compiler is more important than ever.

Dispatches from the compiler front

Posted Apr 15, 2010 17:09 UTC (Thu) by da4089 (subscriber, #1195) [Link] (5 responses)

Doesn't the Linux kernel disprove your assertion that "it's impossible to write anything complex in C reliably"?

Dispatches from the compiler front

Posted Apr 15, 2010 20:10 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

That was an exaggeration on my part, certainly.

However, it has a kernel of truth. You need to write quite complex algorithms in compilers which have to traverse trees, annotating its nodes with complex structures, etc. It all looks quite clumsy in C (C++ in LLVM is much better).

In comparison, Linux kernel doesn't really has that kind of complex algorithms. The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux.

Personally I prefer OCaml for that kind of things. Pattern matching is the killer feature for compiler writers :)

For example: http://llvm.org/docs/tutorial/index.html

Dispatches from the compiler front

Posted Apr 18, 2010 20:08 UTC (Sun) by eparis123 (guest, #59739) [Link] (1 responses)

The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux.

I like your argument, and I understand it for algorithms-centric userspace code like compilers. But, sorry, citing trouble in the Linux scheduler 'cause it's written in C' is pure unsupported hallucinations.

Dispatches from the compiler front

Posted Apr 19, 2010 6:47 UTC (Mon) by nix (subscriber, #2304) [Link]

Quite. The scheduler keeps changing because it has a nearly impossible job to do (one which could be done perfectly only in the presence of perfect knowledge of the future).

Dispatches from the compiler front

Posted Apr 19, 2010 15:03 UTC (Mon) by daglwn (guest, #65432) [Link] (1 responses)

> However, it has a kernel of truth. You need to write quite complex
> algorithms in compilers which have to traverse trees, annotating its
> nodes with complex structures, etc. It all looks quite clumsy in C (C++
> in LLVM is much better).

Well. The folks at Edison Design Group would disagree with you. Their C++ frontend (a non-trivial project by anyone's definition) is all pure C. It is organized beautifully, commented well and doesn't resort to the "object-oriented C" tricks that end up making C code complex and obfuscated. The printed documentation is the best I've ever seen for any piece of any compiler.

I'm a C++ nut, no question. But I appreciate good design in any language and the Edison frontend comes about as close to perfect for a C project that I've ever seen. It works within the spirit of the language and presents a very clean API.

Dispatches from the compiler front

Posted Apr 27, 2010 20:47 UTC (Tue) by wahern (subscriber, #37304) [Link]

I'm not surprised.

Good design means good encapsulation**. Objected-oriented _syntax_ is most useful when at the outset you're unsure how to encapsulate the data and segregate the logic--both strongly related. In that case, you use a set of generic patterns which will, presumably, get you close to the mark until the solution makes itself clear. Once its clear you probably won't bother re-writing it.

But where something is heavily centered on an abstract algorithm, how to encapsulate the data and segregate the logic is usually self-evident. In this case object-oriented syntax doesn't buy you much if anything. Data structures don't need to be generic and protected with getters and setters. You don't need dynamic methods. One could even argue that C is preferable, being a much simpler language; things like inheritance become more obsfuscation than explication.

So, let's look at LLVM: not only is it a compiler with an easily discernible purpose with suggestive algorithms and data structures, but its centered on an even larger, more comprehensive meta-abstraction: manipulation and transform of bytecodes. This suggests that the choice of C++ has far less to do with the code quality than other choices.

Add to the mix the fact that GCC has been gutted and rebuilt several times over, and that it's 15+ years older than LLVM, I don't think it's reasonable to draw any conclusions whatsoever about C v. C++ in this comparison.

** "object-oriented tricks" sounds a little ambiguous. Any well-encapsulated design is bound to be "object-oriented", at least if juggling lots of data and performing lots of complex transformations internally. Contrast that with well-encapsulated design of unix shell utilities. You wouldn't say grep or sed is object-oriented internally; nor if you put them together. But that sort of segregation of work doesn't suite compiler design, so inevitably a well-written compiler will be object-oriented, no matter the language. The issue is whether C++ makes it easier to accomplish object-orientation. And I'll argue only to the degree the overall design is indeterminate. This is similar to choosing a scripting language over a compiled one. Scripting languages are more attractive the more ill-defined the problem.