parallel linking still wanted
parallel linking still wanted
Posted Feb 4, 2025 3:13 UTC (Tue) by oldnpastit (subscriber, #95303)In reply to: parallel linking still wanted by epa
Parent article: GNU Binutils 2.44 Released
It makes linking slower (since you are now compiling and linking) but the resulting code is smaller and faster (and can find bugs that span modules such as inconsistent definitions).
Posted Feb 4, 2025 10:09 UTC (Tue)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Feb 4, 2025 11:46 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (3 responses)
If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.
Posted Feb 4, 2025 12:02 UTC (Tue)
by intelfx (subscriber, #130118)
[Link] (2 responses)
Speaking about parallel LTO, it might be interesting to note that with Clang, by default, the final phase of LTO is serial (parallel LTO is called ThinLTO, is incompatible with regular LTO, and usually produces larger binaries, so I suppose there are other missed optimizations as well due to how the final call graph is partitioned). GCC, though, uses parallel LTO by default; emulating "fat" LTO requires quite a command-line contortion, and I don't recall there being a similarly wide binary size gap as with Clang).
I'd be curious to hear from someone who actually worked on GCC and/or Clang — what's the reason for this, and if there's something fundamentally different between how GCC and Clang do (parallel) LTO that might have resulted in different tradeoffs when parallel LTO was developed for both.
Posted Feb 7, 2025 5:00 UTC (Fri)
by dberlin (subscriber, #24694)
[Link] (1 responses)
Let me start by saying - the team responsible for ThinLTO in LLVM also had done a lot of LTO work on GCC.
When Google transitioned from GCC to LLVM, a lot of time was spent trying to figure out if LLVM's "fat LTO" would work at the scale needed.
However, because LLVM did not have the memory usage/speed issues that GCC did with "fat" LTO, most of the community was/is happy with it, and ThinLTO gets used on larger things. As such, "fat" LTO has remained the default in LLVM, because there has not been a huge need or desire to change the default.
Note, however, that the use of serial and parallel here are confusing. LLVM supports multithreaded compilation in non-ThinLTO modes.
Posted Feb 7, 2025 11:45 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
You can still turn on whole-program ThinLTO or "fat" LTO if desired, but this is, IMO, a neat use of the compiler technology to get more threads crunching at once.
Posted Feb 4, 2025 12:16 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Maybe for CI or package builds, but for development, I definitely don't want to compile the entire project while I iterate on a single source file.
Also note that these "glob builds" are impossible with C++20 modules anyways unless the compiler becomes a build system (which they definitely do not want to do).
parallel linking still wanted
It's not just to fit in with the "ancient model"; if you're running LTO the way it's supposed to be used, you take source code (codegen units in Rust land, for example, which isn't doing the ancient model), generate IR, optimize the IR in isolation, and then at link time, you merge all the IR into a program, optimize again since you now have extra knowledge about the whole program.
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
GCC went with parallel LTO in part because it didn't support any useful form of multithreaded compilation (at the time, i haven't looked now), and memory usage at the scale of projects we were compiling was huge. So if you wanted meaningful parallelism, it required multiple jobs, and even if you were willing to eat the time cost, the memory cost was intractable in plenty of cases. Parallel LTO was also the dominant form of LTO in a lot of other compilers at the time, mainly due to time and memory usage.
LLVM was *massively* more memory efficient than GCC (easily 10-100x in plenty of real world cases. Plenty of examples of gcc taking gigabytes where LLVM would take 10-50 megabytes of memory. I'm sure this sounds funny now), so it was actually possible that this might work. There were, however, significant challenges in speed of compilation, etc, because at the time, multithreaded compilation was not being used (it was being worked on), lots of algorithms needed to be optimized, etc.
This led to significant disagreements over the technical path to follow.
In the end, I made the call to go with ThinLTO over trying to do fat LTO at scale. Not even because i necessarily believed it would be impossible, but because getting it was another risk in an already risky project, and nothing stopped us from going back later and trying to make "fat LTO" better.
So it's serial in the sense that it is a single job vs multiple jobs, but not serial in the sense that it can still use all the cores if you want. Obviously, because it is not partitioned, you are still limited by the slowest piece, and I haven't kept up on whether codegen now supports multiple threads. ThinLTO buys you the multi-job parallelism over that.
Note that Cargo exploits ThinLTO in its default profiles to allow it to partition a crate into 16 or 256 codegen units, and then have ThinLTO optimize the entire crate, getting you parallelism that would hurt performance without LTO.
ThinLTO in Rust
parallel linking still wanted