parallel linking still wanted [LWN.net]

parallel linking still wanted

Posted Feb 3, 2025 10:07 UTC (Mon) by epa (subscriber, #39769) [Link] (27 responses)

Separately compiling each source file and linking them afterwards seems like a relic of an earlier time when computers were much smaller. Nowadays you could mostly concatenate all the .c files and compile that once, or at least divide a larger project into a handful of modules where each module is a single compilation. In principle you're doing extra work because you compile a much larger file when anything changes, but in practice I suspect it might go faster, as compiling ten thousand lines of code doesn't take ten times as long as compiling one thousand lines.

parallel linking still wanted

Posted Feb 3, 2025 11:02 UTC (Mon) by tchernobog (guest, #73595) [Link] (12 responses)

Maybe it works for middle sized C projects, but I can assure you (because I tried) that compiling our C++ project as a unity build quickly exhausts the 128 GB of RAM available in our beefiest CI machine.

Try nodejs as a benchmark...

parallel linking still wanted

Posted Feb 3, 2025 13:27 UTC (Mon) by epa (subscriber, #39769) [Link] (8 responses)

Agreed. That's why I said ".c files", though I admit that some have used that extension for C++ code too.

It would be nice if you could pass multiple source files to the compiler on a single command line and it would make its own decision about whether to compile them separately or together, depending on available memory and various heuristics.

parallel linking still wanted

Posted Feb 4, 2025 20:19 UTC (Tue) by ringerc (subscriber, #3071) [Link] (7 responses)

IIRC MS Visual Studio's compiler toolchain with msbuild has been doing this for about a decade.

It's a weird mix of a pleasure to use and an absolute horror when I do Windows work. The VS debugger is so good that I've ported C++ codebases to Windows largely so I can use the debugger.

parallel linking still wanted

Posted Feb 4, 2025 21:22 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (6 responses)

Heh. And here I set up entire sanitizer build chains to repro problems on Linux so that I can even get a debugger to give me the time of day. I'm sure there's some way to get it to do so, but attaching to a program without debug info has always been a pain.

parallel linking still wanted

Posted Feb 4, 2025 23:16 UTC (Tue) by ringerc (subscriber, #3071) [Link] (5 responses)

Yep, I've been known to use Linux for `valgrind` and Windows for VC++ debugger on the same project, for the same problem.

At least it isn't golang, where "debugging, who needs that?" seems to be the norm and things like external debuginfo that have been the norm for C and C++ projects for 15 years are just Not A Thing.

parallel linking still wanted

Posted Feb 4, 2025 23:49 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> At least it isn't golang, where "debugging, who needs that?"

Erm... Whut? Golang unconditionally embeds the debugging info, so it's trivially easy to attach with a debugger.

The debugger itself is missing some niceties, like custom type rendering, but it's not too bad.

parallel linking still wanted

Posted Feb 5, 2025 2:52 UTC (Wed) by ringerc (subscriber, #3071) [Link] (3 responses)

Quite right - I was unclear.

What I was getting at is that many projects strip debug info from their builds with:

go build -ldflags "-s -w"

e.g. the Kubernetes project.

As far as I can tell from when I last checked, there is no well supported, portable means of extracting the stripped debug info into a separate archive.

So if a project wants to produce compact release builds, they have no sensible way to publish separate debuginfo - e.g. to a debuginfod and MS symbol server or a simple archive server.

parallel linking still wanted

Posted Feb 5, 2025 3:31 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Oh yeah. K8s is a project that should stop doing that.

There are workarounds: https://github.com/golang/go/issues/51692

parallel linking still wanted

Posted Feb 5, 2025 3:54 UTC (Wed) by ringerc (subscriber, #3071) [Link] (1 responses)

I know, I wrote it. Glad it is useful.

I'm really surprised it's not seen as more of an issue. How is one supposed to analyze intermittent faults, heisenbugs etc without the ability to debug crashes and hangs in release builds?

parallel linking still wanted

Posted Feb 5, 2025 4:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Thanks!

I think, most people just leave the debug symbols? They are not _that_ massive in Go.

parallel linking still wanted

Posted Feb 3, 2025 13:30 UTC (Mon) by insi-eb (subscriber, #161562) [Link] (1 responses)

Fully agree.

I've been using unity-style builds with self-created tooling years before anyone called this unity build, and always had to split up the build into smaller chunks than "full binary" to be able to build on $(large_build_machine).
The final bottleneck that was not optimizeable always was the serial linking.

parallel linking still wanted

Posted Feb 3, 2025 13:39 UTC (Mon) by insi-eb (subscriber, #161562) [Link]

and I had the same problem with linking being the final bottleneck just last year

parallel linking still wanted

Posted Feb 4, 2025 17:22 UTC (Tue) by jengelh (guest, #33263) [Link]

>I can assure you (because I tried) that compiling our C++ project as a unity build quickly exhausts the 128 GB of RAM available

Throw fewer symbols into the mix. Shared libraries with limited symbol visibility (despite that being potential candidate for ODR violation) may help.

Compile direct to memory

Posted Feb 3, 2025 16:58 UTC (Mon) by jreiser (subscriber, #11027) [Link]

Why not go all the way: get rid of linking entirely by compiling directly to memory. More than 50 years go PUFFT the Purdue University Fast Fortran Translator could compile Fortran source into memory, and initiate execution, faster than the usual IBJOB loader could load the corresponding compiled binary object file into memory. For student and "one-shot" programs PUFFT made an ancient IBM 7094 (32768 words of 36 bits each) run rings around most System/360 machines. Some flavors of interpreted systems (SNOBOL4, LISP, Java, Perl, Python, ...) offer analogous schemes with "just-in-time" compilation for execution within the environment of an existing process.

parallel linking still wanted

Posted Feb 3, 2025 18:05 UTC (Mon) by iabervon (subscriber, #722) [Link] (5 responses)

It seems to me like a relic of compilers being less clever. It wouldn't be hard, compared to other things that compilers do, for them to accept the full list of source files and decide what to look at concurrently versus not, and they could probably do a better job of managing temporary storage and results cached for a future rebuild if it were up to them than if they need to produce .o files in particular.

parallel linking still wanted

Posted Feb 3, 2025 21:54 UTC (Mon) by dezgeg (subscriber, #92243) [Link]

The GHC compiler for Haskell works like that (and it goes further, you can just give it the path to the main module and it will recursively compile all imported modules without having to explicitly list every one of them).

parallel linking still wanted

Posted Feb 4, 2025 0:19 UTC (Tue) by khim (subscriber, #9252) [Link] (3 responses)

It's result of ossification… of everything.

Compilers don't do what you talk about today, essentially, because that separation between compiler and linker is embedded in bazillion scripts and bazillion tools and bazillion tools for these tools… problem is not technical, that's for sure.

Turbo Pascal 4.0 did everything that you describe… almost 40 years ago. On a PC with 256KiB RAM (you needed 512KiB for IDE, but command-line compiler only needed 256KiB)!

But back then Borland could just say: our new compiler is 90% compatible with what you had before, if you want to use old programs with it you need to do x, y and z. The desire to, finally, be able to build programs that could be larger than 64KiB in a single executable… was acute enough that people did all the required dances.

Today… incentive is just not there: if you would offer that people would find bazillion reasons to stay with old version of the language.

Just look on attempts to bind modules to C++!

parallel linking still wanted

Posted Feb 4, 2025 17:17 UTC (Tue) by iabervon (subscriber, #722) [Link] (2 responses)

It wouldn't have to be backwards-incompatible or a change to the language. Currently, if you just give a bunch of C/C++ files to the compiler at once, it compiles each of them and links all of them into a single file, which is what you'd want. There are just a lot of quality of life issues, because the compiler doesn't expect you to do this: it doesn't manage its memory footprint, it doesn't parallelize things based on having a lot of separate compilation units known at once, and it doesn't store anything for use by a subsequent run or use anything from a previous run. But there's no reason it couldn't take care of all of these things (if you used a new command line option to give it a persistent storage location to be a cache), and there's no reason that handling this usage effectively would interfere with existing usage around compiling single files and linking as a separate command.

parallel linking still wanted

Posted Feb 4, 2025 19:13 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> It wouldn't have to be backwards-incompatible or a change to the language.

It would, most definitely, be both.

> Currently, if you just give a bunch of C/C++ files to the compiler at once, it compiles each of them and links all of them into a single file, which is what you'd want.

Sure, but that usecase sidesteps the critical question that cripples the whole thing: where are how intermediate files should live?

If you specify bunch of C/C++ files then the answer to that question is “nowhere”. Thus wouldn't work for incremental compilation and compiling everything from scratch, all the time… is very inefficient. Worse than what we have today.

> if you used a new command line option to give it a persistent storage location to be a cache

Yes, but then it stops being a “C/C++ language” and turns into “GCC language”, “clang language”, “MSVC language”.

Turbo Pascal and (later) Java solved that problem by fiat: all the compiled files go in the directory specified by this compiler option and names are the same as names of source files. Bam. Done. People may like it, people may hate it… but they have to accept it.

C++… they spent decade or so deciding what to do about that… and in the end excluded that from the standard.

And that made “standard C/C++ modules” completely useless. They couldn't be used without reading extra documentation from the compiler… and, of course, compilers couldn't agree on what they want (or else there would have been easy to include that into the standard)… nothing works. Still, even five years standard was approved.

> there's no reason that handling this usage effectively would interfere with existing usage around compiling single files and linking as a separate command

I'll believe that when I'll see that. So far we have two classes of languages:

Languages that support compilation of single files and linking as a separate command… and then compiling many files efficiently becomes more-or-less impossible and unfeasible.
Language that can compile many files simultaneously using many cores (starting from Turbo Pascal, but also including Java, Haskell, etc) – but then they need a lot of kludges to work with “compile single files and link them as separate command”, even if such usecase is possible at all (it's not possible to compile some programs written in Turbo Pascal or Java if you attempt to compile these files separately).

C++ promises to do both… we'll see if it would manage to pull that off or if it would become a legacy-nobody-cares-about before that'll happen.

parallel linking still wanted

Posted Feb 4, 2025 21:19 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> C++… they spent decade or so deciding what to do about that… and in the end excluded that from the standard.

Yeah, the ISO C++ standard is allergic to saying things like "source code lives in files" (or that code lives in libraries for that matter: the standard only speaks of programs), so it blocks itself from specifying such things.

> And that made “standard C/C++ modules” completely useless. They couldn't be used without reading extra documentation from the compiler… and, of course, compilers couldn't agree on what they want (or else there would have been easy to include that into the standard)… nothing works. Still, even five years standard was approved.

Eh, it certainly made it hard for build systems, but that's been mostly solved now. The next step is to use the support the compilers now provide to be able to compile modules correctly to flush out bugs in the compiler module implementations and to gather numbers on actual module usage to discover things like whether "large" or "small" modules are "better".

But if you're using something like autotools (which is unlikely to ever support them), Bazel (which has an open PR), or Meson (which I'm sure will…someday), yes, modules are "nowhere" for all practical purposes.

parallel linking still wanted

Posted Feb 4, 2025 3:13 UTC (Tue) by oldnpastit (subscriber, #95303) [Link] (6 responses)

Concatenating all the files into one big module and then compiling the result is effectively what LTO does.

It makes linking slower (since you are now compiling and linking) but the resulting code is smaller and faster (and can find bugs that span modules such as inconsistent definitions).

parallel linking still wanted

Posted Feb 4, 2025 10:09 UTC (Tue) by epa (subscriber, #39769) [Link] (5 responses)

Yes, LTO works like that. But (as I understand it from the gcc docs) often it's used in a contorted way generating "object files" which actually contain compiler intermediate language and then "linking" them is another compilation stage. Just to fit in with the ancient model of compiling each .c file to a corresponding .o file and then linking them as the last step. You can run it in what seems to me a saner way, passing all the source files to a single compiler invocation. I think that should be the normal way to build projects even if LTO is not being used. If the compiler needs to compile each one individually and then link, it can choose to do so, but you don't need to fork off a new cc process for each file.

parallel linking still wanted

Posted Feb 4, 2025 11:46 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

It's not just to fit in with the "ancient model"; if you're running LTO the way it's supposed to be used, you take source code (codegen units in Rust land, for example, which isn't doing the ancient model), generate IR, optimize the IR in isolation, and then at link time, you merge all the IR into a program, optimize again since you now have extra knowledge about the whole program.

If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.

parallel linking still wanted

Posted Feb 4, 2025 12:02 UTC (Tue) by intelfx (subscriber, #130118) [Link] (2 responses)

> If you have enough "codegen units" (source files in C), then this gets you a lot of parallelism without costing significant performance; each codegen unit can be compiled in parallel to IR, and then you enter the optimize/combine/optimize phase (which can be serial, or can be done as a parallel tree reduction). The downside is that because you repeat optimize, you end up spending more total compute on compilation, even though you save wall clock time; burning 20% more compute power, but using 16 cores in parallel is a wall-clock saving.

Speaking about parallel LTO, it might be interesting to note that with Clang, by default, the final phase of LTO is serial (parallel LTO is called ThinLTO, is incompatible with regular LTO, and usually produces larger binaries, so I suppose there are other missed optimizations as well due to how the final call graph is partitioned). GCC, though, uses parallel LTO by default; emulating "fat" LTO requires quite a command-line contortion, and I don't recall there being a similarly wide binary size gap as with Clang).

I'd be curious to hear from someone who actually worked on GCC and/or Clang — what's the reason for this, and if there's something fundamentally different between how GCC and Clang do (parallel) LTO that might have resulted in different tradeoffs when parallel LTO was developed for both.

parallel linking still wanted

Posted Feb 7, 2025 5:00 UTC (Fri) by dberlin (subscriber, #24694) [Link] (1 responses)

I can do my best to answer this, as I was around when both were created, and the team that created ThinLTO reported up through me at Google :)

Let me start by saying - the team responsible for ThinLTO in LLVM also had done a lot of LTO work on GCC.

GCC went with parallel LTO in part because it didn't support any useful form of multithreaded compilation (at the time, i haven't looked now), and memory usage at the scale of projects we were compiling was huge. So if you wanted meaningful parallelism, it required multiple jobs, and even if you were willing to eat the time cost, the memory cost was intractable in plenty of cases. Parallel LTO was also the dominant form of LTO in a lot of other compilers at the time, mainly due to time and memory usage.

When Google transitioned from GCC to LLVM, a lot of time was spent trying to figure out if LLVM's "fat LTO" would work at the scale needed.
LLVM was *massively* more memory efficient than GCC (easily 10-100x in plenty of real world cases. Plenty of examples of gcc taking gigabytes where LLVM would take 10-50 megabytes of memory. I'm sure this sounds funny now), so it was actually possible that this might work. There were, however, significant challenges in speed of compilation, etc, because at the time, multithreaded compilation was not being used (it was being worked on), lots of algorithms needed to be optimized, etc.
This led to significant disagreements over the technical path to follow.
In the end, I made the call to go with ThinLTO over trying to do fat LTO at scale. Not even because i necessarily believed it would be impossible, but because getting it was another risk in an already risky project, and nothing stopped us from going back later and trying to make "fat LTO" better.

However, because LLVM did not have the memory usage/speed issues that GCC did with "fat" LTO, most of the community was/is happy with it, and ThinLTO gets used on larger things. As such, "fat" LTO has remained the default in LLVM, because there has not been a huge need or desire to change the default.

Note, however, that the use of serial and parallel here are confusing. LLVM supports multithreaded compilation in non-ThinLTO modes.
So it's serial in the sense that it is a single job vs multiple jobs, but not serial in the sense that it can still use all the cores if you want. Obviously, because it is not partitioned, you are still limited by the slowest piece, and I haven't kept up on whether codegen now supports multiple threads. ThinLTO buys you the multi-job parallelism over that.

ThinLTO in Rust

Posted Feb 7, 2025 11:45 UTC (Fri) by farnz (subscriber, #17727) [Link]

Note that Cargo exploits ThinLTO in its default profiles to allow it to partition a crate into 16 or 256 codegen units, and then have ThinLTO optimize the entire crate, getting you parallelism that would hurt performance without LTO.

You can still turn on whole-program ThinLTO or "fat" LTO if desired, but this is, IMO, a neat use of the compiler technology to get more threads crunching at once.

parallel linking still wanted

Posted Feb 4, 2025 12:16 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> You can run it in what seems to me a saner way, passing all the source files to a single compiler invocation. I think that should be the normal way to build projects even if LTO is not being used.

Maybe for CI or package builds, but for development, I definitely don't want to compile the entire project while I iterate on a single source file.

Also note that these "glob builds" are impossible with C++20 modules anyways unless the compiler becomes a build system (which they definitely do not want to do).

parallel linking still wanted

Posted Feb 3, 2025 16:21 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (5 responses)

And mold doesn't support linker scripts, so there are tricks you can't do when you do it involving things like weird section rearrangements.

parallel linking still wanted

Posted Feb 3, 2025 18:39 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

I'm not sure I've seen linker scripts used outside of glibc and libz? What are they used for these days?

parallel linking still wanted

Posted Feb 3, 2025 21:00 UTC (Mon) by ballombe (subscriber, #9523) [Link]

%file /usr/lib/x86_64-linux-gnu/lib*.so(.)| grep -v ELF
/usr/lib/x86_64-linux-gnu/libbsd.so: ASCII text
/usr/lib/x86_64-linux-gnu/libc.so: ASCII text
/usr/lib/x86_64-linux-gnu/libm.so: ASCII text
/usr/lib/x86_64-linux-gnu/libncurses.so: ASCII text
/usr/lib/x86_64-linux-gnu/libncursesw.so: ASCII text
/usr/lib/x86_64-linux-gnu/libtermcap.so: ASCII text

(yes it does not answer your question!)

parallel linking still wanted

Posted Feb 4, 2025 3:42 UTC (Tue) by pizza (subscriber, #46) [Link]

> I'm not sure I've seen linker scripts used outside of glibc and libz? What are they used for these days?

Anything targeting bare-metal hardware, for one.

parallel linking still wanted

Posted Feb 4, 2025 10:08 UTC (Tue) by laarmen (subscriber, #63948) [Link]

FWIW, I've used one a couple of years ago to create some symbols within a static struct to preserve ABI compatibility in a library while doing a massive overhaul of it.
I haven't tried it but maybe if you want to embed a payload directly in an ELF section for $reasons ?

parallel linking still wanted

Posted Feb 4, 2025 14:35 UTC (Tue) by bjackman (subscriber, #109548) [Link]

I seem to remember a project called "Linnucks" or something that relies linker scripts? I think it's an OS kernel (just a hobby, not big and professional like GNU).

:D

Maybe it's possible but for any sort of low-level stuff I think I've always had a linker script.

Still, you could probably identify a fairly small subset of the linker script features that would be able to build most projects? But I dunno.

parallel linking still wanted

Posted Feb 26, 2025 8:18 UTC (Wed) by daenzer (subscriber, #7050) [Link]

FWIW, in the rare cases where mold doesn't work yet, lld has always worked for me. And while it's not quite as fast as mold, it's much faster than bfd.