parallel linking still wanted

Posted Feb 4, 2025 3:13 UTC (Tue) by oldnpastit (subscriber, #95303)
In reply to: parallel linking still wanted by epa
Parent article: GNU Binutils 2.44 Released

Concatenating all the files into one big module and then compiling the result is effectively what LTO does.

It makes linking slower (since you are now compiling and linking) but the resulting code is smaller and faster (and can find bugs that span modules such as inconsistent definitions).

parallel linking still wanted

Posted Feb 4, 2025 10:09 UTC (Tue) by epa (subscriber, #39769) [Link] (5 responses)

Yes, LTO works like that. But (as I understand it from the gcc docs) often it's used in a contorted way generating "object files" which actually contain compiler intermediate language and then "linking" them is another compilation stage. Just to fit in with the ancient model of compiling each .c file to a corresponding .o file and then linking them as the last step. You can run it in what seems to me a saner way, passing all the source files to a single compiler invocation. I think that should be the normal way to build projects even if LTO is not being used. If the compiler needs to compile each one individually and then link, it can choose to do so, but you don't need to fork off a new cc process for each file.

parallel linking still wanted

Posted Feb 4, 2025 11:46 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

It's not just to fit in with the "ancient model"; if you're running LTO the way it's supposed to be used, you take source code (codegen units in Rust land, for example, which isn't doing the ancient model), generate IR, optimize the IR in isolation, and then at link time, you merge all the IR into a program, optimize again since you now have extra knowledge about the whole program.

If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.

parallel linking still wanted

Posted Feb 4, 2025 12:02 UTC (Tue) by intelfx (subscriber, #130118) [Link] (2 responses)

> If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.

Speaking about parallel LTO, it might be interesting to note that with Clang, by default, the final phase of LTO is serial (parallel LTO is called ThinLTO, is incompatible with regular LTO, and usually produces larger binaries, so I suppose there are other missed optimizations as well due to how the final call graph is partitioned). GCC, though, uses parallel LTO by default; emulating "fat" LTO requires quite a command-line contortion, and I don't recall there being a similarly wide binary size gap as with Clang).

I'd be curious to hear from someone who actually worked on GCC and/or Clang — what's the reason for this, and if there's something fundamentally different between how GCC and Clang do (parallel) LTO that might have resulted in different tradeoffs when parallel LTO was developed for both.

parallel linking still wanted

Posted Feb 7, 2025 5:00 UTC (Fri) by dberlin (subscriber, #24694) [Link] (1 responses)

I can do my best to answer this, as I was around when both were created, and the team that created ThinLTO reported up through me at Google :)

Let me start by saying - the team responsible for ThinLTO in LLVM also had done a lot of LTO work on GCC.

GCC went with parallel LTO in part because it didn't support any useful form of multithreaded compilation (at the time, i haven't looked now), and memory usage at the scale of projects we were compiling was huge. So if you wanted meaningful parallelism, it required multiple jobs, and even if you were willing to eat the time cost, the memory cost was intractable in plenty of cases. Parallel LTO was also the dominant form of LTO in a lot of other compilers at the time, mainly due to time and memory usage.

When Google transitioned from GCC to LLVM, a lot of time was spent trying to figure out if LLVM's "fat LTO" would work at the scale needed.
LLVM was *massively* more memory efficient than GCC (easily 10-100x in plenty of real world cases. Plenty of examples of gcc taking gigabytes where LLVM would take 10-50 megabytes of memory. I'm sure this sounds funny now), so it was actually possible that this might work. There were, however, significant challenges in speed of compilation, etc, because at the time, multithreaded compilation was not being used (it was being worked on), lots of algorithms needed to be optimized, etc.
This led to significant disagreements over the technical path to follow.
In the end, I made the call to go with ThinLTO over trying to do fat LTO at scale. Not even because i necessarily believed it would be impossible, but because getting it was another risk in an already risky project, and nothing stopped us from going back later and trying to make "fat LTO" better.

However, because LLVM did not have the memory usage/speed issues that GCC did with "fat" LTO, most of the community was/is happy with it, and ThinLTO gets used on larger things. As such, "fat" LTO has remained the default in LLVM, because there has not been a huge need or desire to change the default.

Note, however, that the use of serial and parallel here are confusing. LLVM supports multithreaded compilation in non-ThinLTO modes.
So it's serial in the sense that it is a single job vs multiple jobs, but not serial in the sense that it can still use all the cores if you want. Obviously, because it is not partitioned, you are still limited by the slowest piece, and I haven't kept up on whether codegen now supports multiple threads. ThinLTO buys you the multi-job parallelism over that.

ThinLTO in Rust

Posted Feb 7, 2025 11:45 UTC (Fri) by farnz (subscriber, #17727) [Link]

Note that Cargo exploits ThinLTO in its default profiles to allow it to partition a crate into 16 or 256 codegen units, and then have ThinLTO optimize the entire crate, getting you parallelism that would hurt performance without LTO.

You can still turn on whole-program ThinLTO or "fat" LTO if desired, but this is, IMO, a neat use of the compiler technology to get more threads crunching at once.

parallel linking still wanted

Posted Feb 4, 2025 12:16 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> You can run it in what seems to me a saner way, passing all the source files to a single compiler invocation. I think that should be the normal way to build projects even if LTO is not being used.

Maybe for CI or package builds, but for development, I definitely don't want to compile the entire project while I iterate on a single source file.

Also note that these "glob builds" are impossible with C++20 modules anyways unless the compiler becomes a build system (which they definitely do not want to do).