Dispatches from the compiler front

Posted Apr 15, 2010 16:17 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
Parent article: Dispatches from the compiler front

Another reason: GCC is a mess.

Its source code is a maze of twisty little passages, all alike. Its build system is an abomination from the deepest circles of Hell.

Also, pure C for compilers is just a dumb idea. They even had to write a garbage collector because it's impossible to write anything complex in C reliably.

In comparison, LLVM development is just a breeze. First, I can use CMake to generate a VisualStudio project (I still haven't figured out how to build GCC using MSVC). Then I can easily write front-ends with LLVM. A simple non-trivial codegenerator can be written in about one day.

Yeah, LLVM is sweet

Posted Apr 15, 2010 16:53 UTC (Thu) by emk (subscriber, #1128) [Link] (1 responses)

Having poked around in GCC a bit, and having contributed some minor patches to LLVM, I have to agree: GCC is a seriously hairy codebase, and LLVM is about the nicest compiler I've ever hacked on.

LLVM's decision to use a strongly-typed assembly language with just a few opcodes is pure genius. It has a bunch of neat consequences:

1) The retention of basic type information throughout the entire compilation process makes it much easier to answer all sorts of questions, and to verify that the generated intermediate code makes some sort of sense.

2) The decision to use an assembly language (as opposed to an AST or a more exotic representation) makes it easy to dump the output from any optimization stage and examine it by hand.

3) The decision to use a _single_ assembly language (instead of the huge number of intermediate languages which seem to be used by GCC) makes it a lot easier for novices to find their way around the code base, and it means that you can build up large libraries of helper functions.

4) The decision to use a _small_ assembly language means that any given optimizer only needs to know about a small, fixed set of instructions.

Of course, a single intermediate representation isn't sufficient for every possible optimization. So LLVM can optionally annotate the typed assembly with further information, and individual passes can specify whether or not they (a) need a given set of annotations, and (b) preserve a given set of annotations if they exist.

LLVM is a really sweet compiler, and there's some friendly and super-productive hackers working on it. I think it has a great future, with or without GCC.

Yeah, LLVM is sweet

Posted Apr 17, 2010 0:32 UTC (Sat) by daglwn (guest, #65432) [Link]

I hack on LLVM for my job and onm my own time. Just a few comments.

> LLVM is about the nicest compiler I've ever hacked on.

Agreed.

> LLVM's decision to use a strongly-typed assembly language with just a few
> opcodes is pure genius. It has a bunch of neat consequences:

It does, but to be fair, LLVM wasn't the first compiler to provide this.

> 1) The retention of basic type information throughout the entire
> compilation process makes it much easier to answer all sorts of
> questions, and to verify that the generated intermediate code makes some
> sort of sense.

Yes. LLVM's Verifier pass helps a ton. It's saved us many times.

> 2) The decision to use an assembly language (as opposed to an AST or a
> more exotic representation) makes it easy to dump the output from any
> optimization stage and examine it by hand.

Debatable. I've worked on compilers that use very high-level IR representations and it was usually easier to understand what the compiler was doing. One could grasp larger programs much more easily. There are tradeoffs. LLVM uses a low-level representation because the community wants to expose all kinds of machine-level micro-optimizations. IME, the debugging tools surrounding the compiler are as important, if not more, than the IR itself for fixing bugs.

> 3) The decision to use a _single_ assembly language (instead of the huge
> number of intermediate languages which seem to be used by GCC) makes it a
> lot easier for novices to find their way around the code base, and it
> means that you can build up large libraries of helper functions.

This statement simply isn't true. LLVM does not have a single IR. It has at least five now: the Instruction IR, the ScalarEvolution IR, the SelectionDAG/ScheduleDAG IR, the MachineInstr IR and the MCInst IR (from the MC project).

This isn't necessarily a bad thing. Different representations allow easier manipulations for certain phases. One can't represent machine instructions with the higher-level Instruction IR. There is some duplication, however. SCEV passes in particular duplicate a lot of logic other passes that use the LLVM IR already have.

> 4) The decision to use a _small_ assembly language means that any given
> optimizer only needs to know about a small, fixed set of instructions.

Again, there are tradeoffs. One is that to do anything machine-specific requires intrinsics and the optimizer doesn't understand those. There are certainly instructions I would like to see added to the IR (a robust vector representation, for example) but it's not critical right now. Instructions have been added over the course of the project. I predict we will see quite a few new ones over the next several years.

> LLVM is a really sweet compiler, and there's some friendly and super-
> productive hackers working on it. I think it has a great future, with or
> without GCC.

100% agreed. Not only is LLVM used in lots of projects, it's been able to spark a renewed interest in compiler technology among students. This is going to be critical as we keep packing more cores onto chips. The era of "free" speedup via higher clocks is over. The compiler is more important than ever.

Dispatches from the compiler front

Posted Apr 15, 2010 17:09 UTC (Thu) by da4089 (subscriber, #1195) [Link] (5 responses)

Doesn't the Linux kernel disprove your assertion that "it's impossible to write anything complex in C reliably"?

Dispatches from the compiler front

Posted Apr 15, 2010 20:10 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

That was an exaggeration on my part, certainly.

However, it has a kernel of truth. You need to write quite complex algorithms in compilers which have to traverse trees, annotating its nodes with complex structures, etc. It all looks quite clumsy in C (C++ in LLVM is much better).

In comparison, Linux kernel doesn't really has that kind of complex algorithms. The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux.

Personally I prefer OCaml for that kind of things. Pattern matching is the killer feature for compiler writers :)

For example: http://llvm.org/docs/tutorial/index.html

Dispatches from the compiler front

Posted Apr 18, 2010 20:08 UTC (Sun) by eparis123 (guest, #59739) [Link] (1 responses)

The closest thing in complexity is the scheduler, which still is a frequent source of problems in Linux.

I like your argument, and I understand it for algorithms-centric userspace code like compilers. But, sorry, citing trouble in the Linux scheduler 'cause it's written in C' is pure unsupported hallucinations.

Dispatches from the compiler front

Posted Apr 19, 2010 6:47 UTC (Mon) by nix (subscriber, #2304) [Link]

Quite. The scheduler keeps changing because it has a nearly impossible job to do (one which could be done perfectly only in the presence of perfect knowledge of the future).

Dispatches from the compiler front

Posted Apr 19, 2010 15:03 UTC (Mon) by daglwn (guest, #65432) [Link] (1 responses)

> However, it has a kernel of truth. You need to write quite complex
> algorithms in compilers which have to traverse trees, annotating its
> nodes with complex structures, etc. It all looks quite clumsy in C (C++
> in LLVM is much better).

Well. The folks at Edison Design Group would disagree with you. Their C++ frontend (a non-trivial project by anyone's definition) is all pure C. It is organized beautifully, commented well and doesn't resort to the "object-oriented C" tricks that end up making C code complex and obfuscated. The printed documentation is the best I've ever seen for any piece of any compiler.

I'm a C++ nut, no question. But I appreciate good design in any language and the Edison frontend comes about as close to perfect for a C project that I've ever seen. It works within the spirit of the language and presents a very clean API.

Dispatches from the compiler front

Posted Apr 27, 2010 20:47 UTC (Tue) by wahern (subscriber, #37304) [Link]

I'm not surprised.

Good design means good encapsulation**. Objected-oriented _syntax_ is most useful when at the outset you're unsure how to encapsulate the data and segregate the logic--both strongly related. In that case, you use a set of generic patterns which will, presumably, get you close to the mark until the solution makes itself clear. Once its clear you probably won't bother re-writing it.

But where something is heavily centered on an abstract algorithm, how to encapsulate the data and segregate the logic is usually self-evident. In this case object-oriented syntax doesn't buy you much if anything. Data structures don't need to be generic and protected with getters and setters. You don't need dynamic methods. One could even argue that C is preferable, being a much simpler language; things like inheritance become more obsfuscation than explication.

So, let's look at LLVM: not only is it a compiler with an easily discernible purpose with suggestive algorithms and data structures, but its centered on an even larger, more comprehensive meta-abstraction: manipulation and transform of bytecodes. This suggests that the choice of C++ has far less to do with the code quality than other choices.

Add to the mix the fact that GCC has been gutted and rebuilt several times over, and that it's 15+ years older than LLVM, I don't think it's reasonable to draw any conclusions whatsoever about C v. C++ in this comparison.

** "object-oriented tricks" sounds a little ambiguous. Any well-encapsulated design is bound to be "object-oriented", at least if juggling lots of data and performing lots of complex transformations internally. Contrast that with well-encapsulated design of unix shell utilities. You wouldn't say grep or sed is object-oriented internally; nor if you put them together. But that sort of segregation of work doesn't suite compiler design, so inevitably a well-written compiler will be object-oriented, no matter the language. The issue is whether C++ makes it easier to accomplish object-orientation. And I'll argue only to the degree the overall design is indeterminate. This is similar to choosing a scripting language over a compiled one. Scripting languages are more attractive the more ill-defined the problem.