parallel linking still wanted
parallel linking still wanted
Posted Feb 3, 2025 9:58 UTC (Mon) by tchernobog (guest, #73595)Parent article: GNU Binutils 2.44 Released
But linking with bfd for me remains the biggest bottleneck when developing and linking C, C++, Rust... programs. Gold provided an observable speedup, albeit not ground-shattering.
Compiling is nowadays rather fine as you are producing at least one object file per process, and doing that in parallel is ok. Plus, programs like cache/sccache hit often enough on repeated recompilations.
Linking... Is still problematic. Especially as the trend is having more cores rather than faster individual ones.
I guess I will have to invest more time in having mold work correctly for the architectures I need (e.g. armv5 in my case).
Posted Feb 3, 2025 10:07 UTC (Mon)
by epa (subscriber, #39769)
[Link] (27 responses)
Posted Feb 3, 2025 11:02 UTC (Mon)
by tchernobog (guest, #73595)
[Link] (12 responses)
Try nodejs as a benchmark...
Posted Feb 3, 2025 13:27 UTC (Mon)
by epa (subscriber, #39769)
[Link] (8 responses)
It would be nice if you could pass multiple source files to the compiler on a single command line and it would make its own decision about whether to compile them separately or together, depending on available memory and various heuristics.
Posted Feb 4, 2025 20:19 UTC (Tue)
by ringerc (subscriber, #3071)
[Link] (7 responses)
It's a weird mix of a pleasure to use and an absolute horror when I do Windows work. The VS debugger is so good that I've ported C++ codebases to Windows largely so I can use the debugger.
Posted Feb 4, 2025 21:22 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (6 responses)
Posted Feb 4, 2025 23:16 UTC (Tue)
by ringerc (subscriber, #3071)
[Link] (5 responses)
At least it isn't golang, where "debugging, who needs that?" seems to be the norm and things like external debuginfo that have been the norm for C and C++ projects for 15 years are just Not A Thing.
Posted Feb 4, 2025 23:49 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Erm... Whut? Golang unconditionally embeds the debugging info, so it's trivially easy to attach with a debugger.
The debugger itself is missing some niceties, like custom type rendering, but it's not too bad.
Posted Feb 5, 2025 2:52 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (3 responses)
What I was getting at is that many projects strip debug info from their builds with:
go build -ldflags "-s -w"
e.g. the Kubernetes project.
As far as I can tell from when I last checked, there is no well supported, portable means of extracting the stripped debug info into a separate archive.
So if a project wants to produce compact release builds, they have no sensible way to publish separate debuginfo - e.g. to a debuginfod and MS symbol server or a simple archive server.
Posted Feb 5, 2025 3:31 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
There are workarounds: https://github.com/golang/go/issues/51692
Posted Feb 5, 2025 3:54 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (1 responses)
I'm really surprised it's not seen as more of an issue. How is one supposed to analyze intermittent faults, heisenbugs etc without the ability to debug crashes and hangs in release builds?
Posted Feb 5, 2025 4:14 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I think, most people just leave the debug symbols? They are not _that_ massive in Go.
Posted Feb 3, 2025 13:30 UTC (Mon)
by insi-eb (subscriber, #161562)
[Link] (1 responses)
I've been using unity-style builds with self-created tooling years before anyone called this unity build, and always had to split up the build into smaller chunks than "full binary" to be able to build on $(large_build_machine).
Posted Feb 3, 2025 13:39 UTC (Mon)
by insi-eb (subscriber, #161562)
[Link]
Posted Feb 4, 2025 17:22 UTC (Tue)
by jengelh (guest, #33263)
[Link]
Throw fewer symbols into the mix. Shared libraries with limited symbol visibility (despite that being potential candidate for ODR violation) may help.
Posted Feb 3, 2025 16:58 UTC (Mon)
by jreiser (subscriber, #11027)
[Link]
Posted Feb 3, 2025 18:05 UTC (Mon)
by iabervon (subscriber, #722)
[Link] (5 responses)
Posted Feb 3, 2025 21:54 UTC (Mon)
by dezgeg (subscriber, #92243)
[Link]
Posted Feb 4, 2025 0:19 UTC (Tue)
by khim (subscriber, #9252)
[Link] (3 responses)
It's result of ossification… of everything. Compilers don't do what you talk about today, essentially, because that separation between compiler and linker is embedded in bazillion scripts and bazillion tools and bazillion tools for these tools… problem is not technical, that's for sure. Turbo Pascal 4.0 did everything that you describe… almost 40 years ago. On a PC with 256KiB RAM (you needed 512KiB for IDE, but command-line compiler only needed 256KiB)! But back then Borland could just say: our new compiler is 90% compatible with what you had before, if you want to use old programs with it you need to do Today… incentive is just not there: if you would offer that people would find bazillion reasons to stay with old version of the language. Just look on attempts to bind modules to C++!
Posted Feb 4, 2025 17:17 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (2 responses)
Posted Feb 4, 2025 19:13 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
It would, most definitely, be both. Sure, but that usecase sidesteps the critical question that cripples the whole thing: where are how intermediate files should live? If you specify bunch of C/C++ files then the answer to that question is “nowhere”. Thus wouldn't work for incremental compilation and compiling everything from scratch, all the time… is very inefficient. Worse than what we have today. Yes, but then it stops being a “C/C++ language” and turns into “GCC language”, “clang language”, “MSVC language”. Turbo Pascal and (later) Java solved that problem by fiat: all the compiled files go in the directory specified by this compiler option and names are the same as names of source files. Bam. Done. People may like it, people may hate it… but they have to accept it. C++… they spent decade or so deciding what to do about that… and in the end excluded that from the standard. And that made “standard C/C++ modules” completely useless. They couldn't be used without reading extra documentation from the compiler… and, of course, compilers couldn't agree on what they want (or else there would have been easy to include that into the standard)… nothing works. Still, even five years standard was approved. I'll believe that when I'll see that. So far we have two classes of languages:
C++ promises to do both… we'll see if it would manage to pull that off or if it would become a legacy-nobody-cares-about before that'll happen.
Posted Feb 4, 2025 21:19 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Yeah, the ISO C++ standard is allergic to saying things like "source code lives in files" (or that code lives in libraries for that matter: the standard only speaks of programs), so it blocks itself from specifying such things.
> And that made “standard C/C++ modules” completely useless. They couldn't be used without reading extra documentation from the compiler… and, of course, compilers couldn't agree on what they want (or else there would have been easy to include that into the standard)… nothing works. Still, even five years standard was approved.
Eh, it certainly made it hard for build systems, but that's been mostly solved now. The next step is to use the support the compilers now provide to be able to compile modules correctly to flush out bugs in the compiler module implementations and to gather numbers on actual module usage to discover things like whether "large" or "small" modules are "better".
But if you're using something like autotools (which is unlikely to ever support them), Bazel (which has an open PR), or Meson (which I'm sure will…someday), yes, modules are "nowhere" for all practical purposes.
Posted Feb 4, 2025 3:13 UTC (Tue)
by oldnpastit (subscriber, #95303)
[Link] (6 responses)
It makes linking slower (since you are now compiling and linking) but the resulting code is smaller and faster (and can find bugs that span modules such as inconsistent definitions).
Posted Feb 4, 2025 10:09 UTC (Tue)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Feb 4, 2025 11:46 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (3 responses)
If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.
Posted Feb 4, 2025 12:02 UTC (Tue)
by intelfx (subscriber, #130118)
[Link] (2 responses)
Speaking about parallel LTO, it might be interesting to note that with Clang, by default, the final phase of LTO is serial (parallel LTO is called ThinLTO, is incompatible with regular LTO, and usually produces larger binaries, so I suppose there are other missed optimizations as well due to how the final call graph is partitioned). GCC, though, uses parallel LTO by default; emulating "fat" LTO requires quite a command-line contortion, and I don't recall there being a similarly wide binary size gap as with Clang).
I'd be curious to hear from someone who actually worked on GCC and/or Clang — what's the reason for this, and if there's something fundamentally different between how GCC and Clang do (parallel) LTO that might have resulted in different tradeoffs when parallel LTO was developed for both.
Posted Feb 7, 2025 5:00 UTC (Fri)
by dberlin (subscriber, #24694)
[Link] (1 responses)
Let me start by saying - the team responsible for ThinLTO in LLVM also had done a lot of LTO work on GCC.
When Google transitioned from GCC to LLVM, a lot of time was spent trying to figure out if LLVM's "fat LTO" would work at the scale needed.
However, because LLVM did not have the memory usage/speed issues that GCC did with "fat" LTO, most of the community was/is happy with it, and ThinLTO gets used on larger things. As such, "fat" LTO has remained the default in LLVM, because there has not been a huge need or desire to change the default.
Note, however, that the use of serial and parallel here are confusing. LLVM supports multithreaded compilation in non-ThinLTO modes.
Posted Feb 7, 2025 11:45 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
You can still turn on whole-program ThinLTO or "fat" LTO if desired, but this is, IMO, a neat use of the compiler technology to get more threads crunching at once.
Posted Feb 4, 2025 12:16 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Maybe for CI or package builds, but for development, I definitely don't want to compile the entire project while I iterate on a single source file.
Also note that these "glob builds" are impossible with C++20 modules anyways unless the compiler becomes a build system (which they definitely do not want to do).
Posted Feb 3, 2025 16:21 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
Posted Feb 3, 2025 18:39 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Feb 3, 2025 21:00 UTC (Mon)
by ballombe (subscriber, #9523)
[Link]
(yes it does not answer your question!)
Posted Feb 4, 2025 3:42 UTC (Tue)
by pizza (subscriber, #46)
[Link]
Anything targeting bare-metal hardware, for one.
Posted Feb 4, 2025 10:08 UTC (Tue)
by laarmen (subscriber, #63948)
[Link]
Posted Feb 4, 2025 14:35 UTC (Tue)
by bjackman (subscriber, #109548)
[Link]
:D
Maybe it's possible but for any sort of low-level stuff I think I've always had a linker script.
Still, you could probably identify a fairly small subset of the linker script features that would be able to build most projects? But I dunno.
Posted Feb 26, 2025 8:18 UTC (Wed)
by daenzer (subscriber, #7050)
[Link]
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
The final bottleneck that was not optimizeable always was the serial linking.
parallel linking still wanted
parallel linking still wanted
Why not go all the way: get rid of linking entirely by compiling directly to memory. More than 50 years go PUFFT the Purdue University Fast Fortran Translator could compile Fortran source into memory, and initiate execution, faster than the usual IBJOB loader could load the corresponding compiled binary object file into memory. For student and "one-shot" programs PUFFT made an ancient IBM 7094 (32768 words of 36 bits each) run rings around most System/360 machines. Some flavors of interpreted systems (SNOBOL4, LISP, Java, Perl, Python, ...) offer analogous schemes with "just-in-time" compilation for execution within the environment of an existing process.
Compile direct to memory
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
x
, y
and z
. The desire to, finally, be able to build programs that could be larger than 64KiB in a single executable… was acute enough that people did all the required dances.parallel linking still wanted
> It wouldn't have to be backwards-incompatible or a change to the language.
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
It's not just to fit in with the "ancient model"; if you're running LTO the way it's supposed to be used, you take source code (codegen units in Rust land, for example, which isn't doing the ancient model), generate IR, optimize the IR in isolation, and then at link time, you merge all the IR into a program, optimize again since you now have extra knowledge about the whole program.
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
GCC went with parallel LTO in part because it didn't support any useful form of multithreaded compilation (at the time, i haven't looked now), and memory usage at the scale of projects we were compiling was huge. So if you wanted meaningful parallelism, it required multiple jobs, and even if you were willing to eat the time cost, the memory cost was intractable in plenty of cases. Parallel LTO was also the dominant form of LTO in a lot of other compilers at the time, mainly due to time and memory usage.
LLVM was *massively* more memory efficient than GCC (easily 10-100x in plenty of real world cases. Plenty of examples of gcc taking gigabytes where LLVM would take 10-50 megabytes of memory. I'm sure this sounds funny now), so it was actually possible that this might work. There were, however, significant challenges in speed of compilation, etc, because at the time, multithreaded compilation was not being used (it was being worked on), lots of algorithms needed to be optimized, etc.
This led to significant disagreements over the technical path to follow.
In the end, I made the call to go with ThinLTO over trying to do fat LTO at scale. Not even because i necessarily believed it would be impossible, but because getting it was another risk in an already risky project, and nothing stopped us from going back later and trying to make "fat LTO" better.
So it's serial in the sense that it is a single job vs multiple jobs, but not serial in the sense that it can still use all the cores if you want. Obviously, because it is not partitioned, you are still limited by the slowest piece, and I haven't kept up on whether codegen now supports multiple threads. ThinLTO buys you the multi-job parallelism over that.
Note that Cargo exploits ThinLTO in its default profiles to allow it to partition a crate into 16 or 256 codegen units, and then have ThinLTO optimize the entire crate, getting you parallelism that would hurt performance without LTO.
ThinLTO in Rust
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
parallel linking still wanted
/usr/lib/x86_64-linux-gnu/libbsd.so: ASCII text
/usr/lib/x86_64-linux-gnu/libc.so: ASCII text
/usr/lib/x86_64-linux-gnu/libm.so: ASCII text
/usr/lib/x86_64-linux-gnu/libncurses.so: ASCII text
/usr/lib/x86_64-linux-gnu/libncursesw.so: ASCII text
/usr/lib/x86_64-linux-gnu/libtermcap.so: ASCII text
parallel linking still wanted
parallel linking still wanted
I haven't tried it but maybe if you want to embed a payload directly in an ELF section for $reasons ?
parallel linking still wanted
parallel linking still wanted