Dispatches from the compiler front
The up-and-coming LLVM compiler has been an irritation to some GCC developers for some time; LLVM apparently comes off as an upstart trying to muscle into territory which GCC has owned for a long time. So it's not surprising that occasionally the relationship between the two projects gets a little frosty.
Consider the case of DragonEgg, a GCC plugin which replaces the bulk of GCC's optimization and code-generation system with the LLVM implementation. DragonEgg is clearly a useful tool for LLVM developers, who can focus on improving the backend code while making use of GCC's well-developed front ends. Jack Howarth recently proposed the addition of DragonEgg as an official part of the GCC code base. Some developers welcomed the idea; Basile Starynkevitch, for example, thought it would make a good plugin example. But from others came complaints like this:
It's not clear that this is a majority opinion; some GCC developers see DragonEgg as an easy way to try out LLVM code and compare it against their own. If LLVM comes out on top, GCC developers can then figure out why or, possibly, just adopt the relevant LLVM code. Those developers see only benefit in some cooperative competition between the projects.
Others, though, see the situation as more of a zero-sum game; when viewed through that lens, cooperation with LLVM would appear to make little sense. But free software is not a zero-sum game; the more we can learn from each other, the better off we all are. GCC need not worry about being displaced by LLVM (or anything else) any time in the near future. Barring technical issues with the merging of DragonEgg (and none have been mentioned), accepting the code seems like it should be ultimately beneficial to the project.
In a side discussion, GCC developers wondered why LLVM seems to be more successful in attracting developers and mindshare in general. One suggestion was that LLVM has a clear leader who is able to set the direction of the project, while GCC is more scattered. Others have a different view; in this context, Ian Lance Taylor's notes are worth a look:
There is also the matter of the old code base, the lack of a clean separation between passes, and, most important, weak internal documentation.
Some of these issues are being fixed; others will take longer. It seems clear that attending to these problems is important for the long-term future of the project.
Lest things look too grim, though, it's worth perusing this posting from Taras Glek on his success with the GCC "profile-guided optimization" (PGO) feature. PGO works by instrumenting the binary, then rebuilding the program with optimization driven by the profile information. With Firefox, Taras was able to cut the startup time by one third and to reduce initial memory use considerably as well. Taras says:
There's no shortage of interesting, development-oriented tools being integrated into GCC, and the addition of the plugin architecture can only result in an acceleration of this process. Things have reached a point where more projects should probably be looking into the use of these tools to improve the experience for their users.
Meanwhile, on the LLVM side, the developers have recently unveiled the LLVM MC project. "MC" stands for "machine code" in this context; in short, the LLVM developers are trying to integrate the assembler directly into the compiler. There are a number of reasons for doing this, including performance (formatting text for a separate assembler and running that assembler are expensive operations), portability (not all target systems have an assembler out of the box), and the ability to easily add support for new processor instructions. Much of this functionality is required anyway for LLVM's just-in-time compiler features, so it makes sense to just finish the job.
This work appears to be fairly well advanced, with much of the basic functionality in place. Chris Lattner says:
In summary: there is currently a lot going on in the area of development
toolchains. Given that all of us - including those who do no development -
depend on those toolchains, this can only be a good thing. Computers can
do a lot to make the task of programming them easier and more robust;
despite the occasional glitch, developers for both GCC and LLVM appear to
be working hard to realize that potential.
Posted Apr 15, 2010 1:52 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
That is frequently asserted forever, but every time it's measured, assembly turns out to cost practically nothing. There might be other valid reasons to skip the assembly stage, but that ain't one of them.
This misperception is an excellent example of how poor even very smart people are at guessing where a computer spends its time, and what will help it do better. Most programs, these days, spend most of their time stalled waiting for cache lines to be copied from main memory. A fast program is one that gets more done between stalls. Keeping useful lines from being kicked out of your caches is among the most productive ways to speed up a program, these days. Another is to get another processor involved in the job. Piping to an assembler in a different process does a bit of both.
Posted Apr 15, 2010 5:45 UTC (Thu)
by sbishop (guest, #33061)
[Link]
It doesn't look like they guessed in this case.
Posted Apr 15, 2010 6:04 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Apr 15, 2010 7:24 UTC (Thu)
by jdv (guest, #712)
[Link] (10 responses)
Posted Apr 15, 2010 13:49 UTC (Thu)
by rriggs (guest, #11598)
[Link]
Posted Apr 15, 2010 15:36 UTC (Thu)
by ejr (subscriber, #51652)
[Link] (7 responses)
Posted Apr 15, 2010 18:31 UTC (Thu)
by Trelane (subscriber, #56877)
[Link] (4 responses)
Posted Apr 15, 2010 18:33 UTC (Thu)
by Trelane (subscriber, #56877)
[Link] (2 responses)
Posted Apr 15, 2010 18:51 UTC (Thu)
by Trelane (subscriber, #56877)
[Link]
Posted Apr 27, 2010 17:27 UTC (Tue)
by vonbrand (subscriber, #4458)
[Link]
Obviously no. You can't grab code belonging to somebody else and slap your own license on it. What you can do is to include some code into a larger work under another license if the licenses are compatible (i.e., some BSD code into a GPLed whole).
Posted Apr 16, 2010 8:13 UTC (Fri)
by baldrick (subscriber, #4123)
[Link]
Posted Apr 16, 2010 4:44 UTC (Fri)
by magnus (subscriber, #34778)
[Link] (1 responses)
Without the GPL enforcing it, I think many CPU manufacturers would have made their own (possibly binary only) GCC forks instead of contributing back.
Posted Apr 19, 2010 0:10 UTC (Mon)
by elanthis (guest, #6227)
[Link]
It's much the same story as the Linux kernel. Being GPL only guarantees that some kind of non-binary and arguable human-readable code representation of a modification exists. It does not guarantee that those code representations are actually worth crap to anyone in the larger community.
The argument also fails to note MANY examples of BSD and MIT licensed software that has thriving involvement from the proprietary sectors.
Until the GPL states, "all modifications must be accepted by and committed into the original authors' tree before released as part of a product, unless he explicitly states he does not want the modifications due to lack of interest in the nature of the modifications made (and not solely due to correctable implementation flaws)" the GPL is really quite ineffective at enforcing any kind of community involvement or useful code contributions on the part of a company. It's really no harder to be a poor sport with the GPL than it is with the BSD license.
Posted Apr 22, 2010 10:57 UTC (Thu)
by steven97 (guest, #2702)
[Link]
Posted Apr 15, 2010 16:17 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Its source code is a maze of twisty little passages, all alike. Its build system is an abomination from the deepest circles of Hell.
Also, pure C for compilers is just a dumb idea. They even had to write a garbage collector because it's impossible to write anything complex in C reliably.
In comparison, LLVM development is just a breeze. First, I can use CMake to generate a VisualStudio project (I still haven't figured out how to build GCC using MSVC). Then I can easily write front-ends with LLVM. A simple non-trivial codegenerator can be written in about one day.
Posted Apr 15, 2010 16:53 UTC (Thu)
by emk (subscriber, #1128)
[Link] (1 responses)
LLVM's decision to use a strongly-typed assembly language with just a few opcodes is pure genius. It has a bunch of neat consequences:
1) The retention of basic type information throughout the entire compilation process makes it much easier to answer all sorts of questions, and to verify that the generated intermediate code makes some sort of sense.
2) The decision to use an assembly language (as opposed to an AST or a more exotic representation) makes it easy to dump the output from any optimization stage and examine it by hand.
3) The decision to use a _single_ assembly language (instead of the huge number of intermediate languages which seem to be used by GCC) makes it a lot easier for novices to find their way around the code base, and it means that you can build up large libraries of helper functions.
4) The decision to use a _small_ assembly language means that any given optimizer only needs to know about a small, fixed set of instructions.
Of course, a single intermediate representation isn't sufficient for every possible optimization. So LLVM can optionally annotate the typed assembly with further information, and individual passes can specify whether or not they (a) need a given set of annotations, and (b) preserve a given set of annotations if they exist.
LLVM is a really sweet compiler, and there's some friendly and super-productive hackers working on it. I think it has a great future, with or without GCC.
Posted Apr 17, 2010 0:32 UTC (Sat)
by daglwn (guest, #65432)
[Link]
> LLVM is about the nicest compiler I've ever hacked on.
> LLVM's decision to use a strongly-typed assembly language with just a few
It does, but to be fair, LLVM wasn't the first compiler to provide this.
> 1) The retention of basic type information throughout the entire
Yes. LLVM's Verifier pass helps a ton. It's saved us many times.
> 2) The decision to use an assembly language (as opposed to an AST or a
Debatable. I've worked on compilers that use very high-level IR representations and it was usually easier to understand what the compiler was doing. One could grasp larger programs much more easily. There are tradeoffs. LLVM uses a low-level representation because the community wants to expose all kinds of machine-level micro-optimizations. IME, the debugging tools surrounding the compiler are as important, if not more, than the IR itself for fixing bugs.
> 3) The decision to use a _single_ assembly language (instead of the huge
This statement simply isn't true. LLVM does not have a single IR. It has at least five now: the Instruction IR, the ScalarEvolution IR, the SelectionDAG/ScheduleDAG IR, the MachineInstr IR and the MCInst IR (from the MC project).
This isn't necessarily a bad thing. Different representations allow easier manipulations for certain phases. One can't represent machine instructions with the higher-level Instruction IR. There is some duplication, however. SCEV passes in particular duplicate a lot of logic other passes that use the LLVM IR already have.
> 4) The decision to use a _small_ assembly language means that any given
Again, there are tradeoffs. One is that to do anything machine-specific requires intrinsics and the optimizer doesn't understand those. There are certainly instructions I would like to see added to the IR (a robust vector representation, for example) but it's not critical right now. Instructions have been added over the course of the project. I predict we will see quite a few new ones over the next several years.
> LLVM is a really sweet compiler, and there's some friendly and super-
100% agreed. Not only is LLVM used in lots of projects, it's been able to spark a renewed interest in compiler technology among students. This is going to be critical as we keep packing more cores onto chips. The era of "free" speedup via higher clocks is over. The compiler is more important than ever.
Posted Apr 15, 2010 17:09 UTC (Thu)
by da4089 (subscriber, #1195)
[Link] (5 responses)
Posted Apr 15, 2010 20:10 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
However, it has a kernel of truth. You need to write quite complex algorithms in compilers which have to traverse trees, annotating its nodes with complex structures, etc. It all looks quite clumsy in C (C++ in LLVM is much better).
In comparison, Linux kernel doesn't really has that kind of complex algorithms. The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux.
Personally I prefer OCaml for that kind of things. Pattern matching is the killer feature for compiler writers :)
For example: http://llvm.org/docs/tutorial/index.html
Posted Apr 18, 2010 20:08 UTC (Sun)
by eparis123 (guest, #59739)
[Link] (1 responses)
The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux. I like your argument, and I understand it for algorithms-centric userspace code like compilers. But, sorry, citing trouble in the Linux scheduler 'cause it's written in C' is pure unsupported hallucinations.
Posted Apr 19, 2010 6:47 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted Apr 19, 2010 15:03 UTC (Mon)
by daglwn (guest, #65432)
[Link] (1 responses)
Well. The folks at Edison Design Group would disagree with you. Their C++ frontend (a non-trivial project by anyone's definition) is all pure C. It is organized beautifully, commented well and doesn't resort to the "object-oriented C" tricks that end up making C code complex and obfuscated. The printed documentation is the best I've ever seen for any piece of any compiler.
I'm a C++ nut, no question. But I appreciate good design in any language and the Edison frontend comes about as close to perfect for a C project that I've ever seen. It works within the spirit of the language and presents a very clean API.
Posted Apr 27, 2010 20:47 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
Good design means good encapsulation**. Objected-oriented _syntax_ is most useful when at the outset you're unsure how to encapsulate the data and segregate the logic--both strongly related. In that case, you use a set of generic patterns which will, presumably, get you close to the mark until the solution makes itself clear. Once its clear you probably won't bother re-writing it.
But where something is heavily centered on an abstract algorithm, how to encapsulate the data and segregate the logic is usually self-evident. In this case object-oriented syntax doesn't buy you much if anything. Data structures don't need to be generic and protected with getters and setters. You don't need dynamic methods. One could even argue that C is preferable, being a much simpler language; things like inheritance become more obsfuscation than explication.
So, let's look at LLVM: not only is it a compiler with an easily discernible purpose with suggestive algorithms and data structures, but its centered on an even larger, more comprehensive meta-abstraction: manipulation and transform of bytecodes. This suggests that the choice of C++ has far less to do with the code quality than other choices.
Add to the mix the fact that GCC has been gutted and rebuilt several times over, and that it's 15+ years older than LLVM, I don't think it's reasonable to draw any conclusions whatsoever about C v. C++ in this comparison.
** "object-oriented tricks" sounds a little ambiguous. Any well-encapsulated design is bound to be "object-oriented", at least if juggling lots of data and performing lots of complex transformations internally. Contrast that with well-encapsulated design of unix shell utilities. You wouldn't say grep or sed is object-oriented internally; nor if you put them together. But that sort of segregation of work doesn't suite compiler design, so inevitably a well-written compiler will be object-oriented, no matter the language. The issue is whether C++ makes it easier to accomplish object-orientation. And I'll argue only to the degree the overall design is indeterminate. This is similar to choosing a scripting language over a compiled one. Scripting languages are more attractive the more ill-defined the problem.
formatting text for a separate assembler and running that assembler are expensive operations
Assembler considered cheap
Assembler considered cheap
They've measured the assembly phase at about 20% of the total. It's not clear to me how much of that they can avoid with the MC scheme, but their understanding of the costs are not based on just guessing.
Assembler considered cheap
Dispatches from the compiler front
I just see a plugin trying to piggy-back on the hard work of GCC front-end developers and negating the efforts of those working on the middle ends and back ends.
What's wrong with piggy-backing on the work of others? It's free software. One of the reasons software is free is so that others can change it and build on it without hassles -- in other words, piggy-backing on the hard work put into the original.
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
LLVM uses the University of Illinois/NCSA Open Source License, which is
GPL compatible.
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Yeah, LLVM is sweet
Yeah, LLVM is sweet
Agreed.
> opcodes is pure genius. It has a bunch of neat consequences:
> compilation process makes it much easier to answer all sorts of
> questions, and to verify that the generated intermediate code makes some
> sort of sense.
> more exotic representation) makes it easy to dump the output from any
> optimization stage and examine it by hand.
> number of intermediate languages which seem to be used by GCC) makes it a
> lot easier for novices to find their way around the code base, and it
> means that you can build up large libraries of helper functions.
> optimizer only needs to know about a small, fixed set of instructions.
> productive hackers working on it. I think it has a great future, with or
> without GCC.
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
Dispatches from the compiler front
> algorithms in compilers which have to traverse trees, annotating its
> nodes with complex structures, etc. It all looks quite clumsy in C (C++
> in LLVM is much better).
Dispatches from the compiler front