The performance of the Rust compiler
Sparrow Li presented virtually at RustConf 2024 about the current state of and future plans for the Rust compiler's performance. The compiler is relatively slow to compile large programs, although it has been getting better over time. The next big performance improvement to come will be parallelizing the compiler's parsing, type-checking, and related operations, but even after that, the project has several avenues left to explore.
As the projects written using Rust get larger, compilation latency has become increasingly important. The Rust compiler's design causes slow compilation, Li said, when compared to languages with implementations suitable for fast prototyping such as Python or Go. This slowness does have a cause — the compiler performs many sophisticated analyses and transformations — but it would be easier to develop with Rust if the language could be compiled faster.
To give some idea of how badly Rust performs compared to other languages, Li presented a graph showing the compile time of programs with different numbers of functions across several languages. Rust's line was noticeably steeper. For small numbers of functions, it was no worse than other languages, but compile times increased much more as the Rust programs grew larger. Improving compile times has become an area of focus for the whole community, Li said.
This year, the speed of the compiler has more than doubled. Most of this is due to small, incremental changes by a number of developers. Part of this improvement is also from work on tools like rustc-perf, which shows developers how their pull requests to the Rust compiler affect compile times. The efforts of the parallel-rustc working group (in which Li participates) are now available in the nightly language builds, and end up providing a 30% performance improvement in aggregate. Those improvements should land in stable Rust by next year.
There are a handful of other big improvements that will be available soon. Profile-guided optimization provides (up to) a 20% speedup on Linux. BOLT, a post-link optimizer, (which was the topic of a talk that LWN covered recently) can speed things up by 5-10%, and link-time optimization can add another 5%, Li said. The biggest improvement at this point actually comes from switching to the lld linker, however, which can improve end-to-end compilation time by 40%.
The majority of the speedups have not been from big-ticket items, however. "It
is the accumulation of small things
" that most improves performance, Li
said. These include introducing laziness to some areas, improved hashing and
hash tables, changing the layout of the types used by the compiler, and more.
Unfortunately, code optimizations like these require a good understanding of the
compiler, and become rarer over time as people address the lower-hanging fruit.
So, to continue improving performance, at some point the community will need to address the overall design of the compiler, Li said. Parallelization is one promising avenue. Cargo, Rust's build tool, already builds different crates in parallel, but it is often the case that a build will get hung up waiting for one large, commonly-used crate to build, so parallelism inside the compiler is required as well. There is already some limited parallelism in the compiler back-end (the part that performs code generation), but the front-end (which handles parsing and type-checking) is still serial. The close connection between different front-end components makes parallelizing it difficult — but the community would still like to try.
Proposed in 2018 and established in 2019, the parallel-rustc working group has been planning how to tackle the problem of parallelizing the front-end of the compiler for some years. The group finally resolved some of the technical difficulties in 2022, and landed an implementation in the nightly version of the compiler in 2023. The main focus of its work has been the Rust compiler's query system, which is now capable of taking advantage of multiple threads. The main technical challenge to implementing that was a performance loss when running in a single-threaded environment. The solution the working group came up with was to create data structures that can switch between single-threaded and thread-safe implementations at run time. This reduced the single-threaded overhead to only 1-3%, which was judged to be worth it in exchange for the improvement to the multithreaded case.
The query system was not the only component that needed to adapt, however. The working group also needed to change the memory allocator — while the default allocator was thread-safe, having too many threads was causing contention. The group solved this by using separate allocation pools for each thread, but Li warned that this wasn't the only source of cross-thread contention. When the compiler caches the result of a query, that result can be accessed from any thread, so there is still a good deal of cross-thread memory traffic there. The working group was able to partially address that by sharding the cache, but there's still room for improvements.
While performance improves when using up to eight threads in testing, Li said, performance actually goes down between eight and sixteen threads; there are too many inefficiencies from synchronization. The project has considered switching to a work-stealing implementation so that threads do not spend as much time waiting for each other, but that is hard to do without introducing deadlocks. Finally, some front-end components are still single-threaded — particularly macro expansion. At some point, those will need to be parallelized as well.
In all, though, Li thought that Rust's compilation performance has "a shiny
future
". There are a handful of approaches that still need to be explored,
such as improving the linker, using thin link-time optimization, etc., but the
community is working hard on implementing these. Rust's incremental compilation
could also use some improvement, Li said. C++ does a better job there, because
Rust's lexing, parsing, and macro expansion are not yet incremental.
Li finished the talk with a few areas that are not currently being looked into
that might be fruitful to investigate: better inter-crate sharing of compilation
tasks, cached binary crates, and refactoring the Rust compiler to reduce the
amount of necessary global context. These "deserve a lot of deep
thought
", Li said. Ultimately, while the Rust compiler's performance is worse than
many other languages, it's something that the project is aware of and actively
working on. Performance is improving, and despite the easy changes becoming
harder to find, it is likely to continue improving.
Index entries for this article | |
---|---|
Conference | RustConf/2024 |
Posted Oct 28, 2024 21:40 UTC (Mon)
by Karellen (subscriber, #67644)
[Link] (9 responses)
it is often the case that a build will get hung up waiting for one large, commonly-used crate to build ... Li finished the talk with a few areas [...] that might be fruitful to investigate: [...] cached binary crates I find it very surprising that this is not already the case.
Posted Oct 29, 2024 10:03 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (8 responses)
Something along the lines of "pull in all include files, run md5 over the resulting artifact, and if there's a cache with that name that's the processed output. If there's no cache, run the 'source to IR' pass, plus any local optimisations, and dump that to cache".
Dunno whether it's practical, or how much it would save, but it would certainly save reprocessing the same source time and again, and if there's a decent bunch of optimisations you can run before you start linking with other files, it would save that too ...
Cheers,
Posted Oct 29, 2024 11:46 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Oct 29, 2024 13:10 UTC (Tue)
by intelfx (guest, #130118)
[Link]
sccache works on crate level, not source file level. That’s exactly the opposite of what GP wanted (and it’s far from perfect, at that).
Posted Oct 29, 2024 14:31 UTC (Tue)
by josh (subscriber, #17465)
[Link] (5 responses)
Posted Oct 29, 2024 14:59 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (4 responses)
Sounds like that's almost what people are looking for when iterating through the development process.
Cheers,
Posted Oct 30, 2024 16:09 UTC (Wed)
by josh (subscriber, #17465)
[Link] (3 responses)
Posted Oct 30, 2024 16:31 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
So small changes while coding should compile pretty quickly.
Personally, I'd be a lot happier with "incremental=off" for release builds, simply because it gives an aura (maybe false?) of reproducibility. Or rather, it addresses the assumption that dev builds can't be reproducible "because".
Cheers,
Posted Oct 30, 2024 16:39 UTC (Wed)
by intelfx (guest, #130118)
[Link] (1 responses)
Is there something that elaborates on these reasons?
Posted Oct 31, 2024 9:26 UTC (Thu)
by taladar (subscriber, #68407)
[Link]
Posted Oct 29, 2024 8:51 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (8 responses)
I question this commonly made statement.
When fast prototyping you usually do not have huge programs.
You also do not know the structure of your program very well so a compiler that checks for you that you don't use your data structures or functions wrong saves a lot of time when compared to only figuring that out when you reach that point in the code at runtime, even if the compiler run is slow compared to other languages.
When prototyping you also want to later be able to read the resulting code to turn it into the actual production implementation, spamming copy&paste code all over the place a la Go error handling or similarly verbose, less expressive languages seems counter-productive to that goal.
And last but not least, you wouldn't want to prototype in a language that is fundamentally very different from one suitable for the final implementation, so the whole idea to prototype in one language and then write the actual implementation in another is questionable at best.
Posted Oct 29, 2024 10:07 UTC (Tue)
by aragilar (subscriber, #122569)
[Link] (2 responses)
As for prototyping in one (slow, interpreted) language and then implement the core algorithms (or all of it) in a fast language, this is pretty common in the scientific computing space, as it helps with testing and validation of the fast code.
Posted Oct 30, 2024 9:15 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (1 responses)
There are a few projects for that kind of thing but I haven't tried them, just came across them the other day when looking for something unrelated.
Posted Oct 31, 2024 9:50 UTC (Thu)
by moltonel (guest, #45207)
[Link]
Posted Oct 29, 2024 10:09 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (2 responses)
It matches my experience of working in Rust and Python; I prefer to write in Rust, because I'm rarely genuinely writing a prototype, but rather writing "version 0" of a program, and I will not normally be given time to rewrite in another language later. But if I was genuinely allowed to prototype things most of the time, I'd be writing the prototype to throw away in Python, and writing "version 0" in Rust.
Yesterday, I wrote a short Rust program (10 lines of my code, plus dependencies) to create a PNG representing squared paper. It took about 15 minutes computer time (and about 30 minutes elapsed time), of which about 5 was spent waiting on the compiler to compile my code as the person I was doing this for changed their mind about what the PNG should look like (colour/greyscale, line thickness, square size).
Today, after reading your comment, I redid the same work in Python, including making the same changes; it took under 10 minutes, because changes to my code were near-instant.
Neither program did a significant amount of compute (after all, when prototyping, you don't work on large data sets if you can avoid it), both came up with the same result. The Rust program uses significantly less CPU time when compiled in the debug profile, but neither of them used enough CPU time to make a difference to me as the person running the code. And the Rust program is more robust against errors if I come back round to it - e.g. instead of using a naming convention, as in Python, Rust's type system will prevent me from modifying constants, or assigning to the wrong thing - but I'm very unlikely to ever run this code ever again.
I still used Rust first, because I'm habituated to working in environments where "never run this code again" is a dangerous statement, but if Rust could build faster, it'd also have been the faster language to use. Recompiling my code, there are two things that would speed it up considerably:
If Rust could get the wall-clock time for a small change down below 200 ms, it'd be faster than Python for prototyping here on this laptop. And that's not impossible with parallelism; for most of the build time, I have only one CPU thread of the 16 on this laptop saturated; if Rust in dev profile could saturate 5 CPU threads on average, then Rust in dev profile (1 CPU-second) would have a faster edit/compile/run loop than Python does.
And note that I'm mostly ignoring the time it takes to build dependencies here; at least in theory, Rust could eliminate that 20 seconds of highly parallel work by downloading pre-compiled artefacts instead of building from scratch.
Posted Oct 30, 2024 9:19 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (1 responses)
If you have some sort of GUI or web app where you need to navigate to your functionality to test each time those few seconds of compilation time are quickly dwarfed by that repetitive walltime-overhead during the test run.
Posted Oct 30, 2024 13:41 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
But by the time I'm putting together a GUI that needs testing, and navigation to functionality, Python is still faster because I run the prototype inside pdb or similar, and tweak it "live", instead of having to recompile and navigate again. Once I've finished prototyping, I might have to restart and re-navigate a few times, but that's when I'm going from the prototype to version 0.
Posted Oct 29, 2024 18:11 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted Oct 30, 2024 9:21 UTC (Wed)
by taladar (subscriber, #68407)
[Link]
Posted Oct 29, 2024 11:54 UTC (Tue)
by mmechri (subscriber, #95694)
[Link]
I didn’t understand the comparison with C++. What is it that C++ does better than Rust when it comes to incremental compilation?
Posted Nov 1, 2024 13:58 UTC (Fri)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Nov 1, 2024 15:49 UTC (Fri)
by atnot (subscriber, #124910)
[Link]
Build artefacts aren't cached?
Build artefacts aren't cached?
Wol
Build artefacts aren't cached?
Build artefacts aren't cached?
Build artefacts aren't cached?
Build artefacts aren't cached?
Wol
Build artefacts aren't cached?
Build artefacts aren't cached?
Wol
Build artefacts aren't cached?
Build artefacts aren't cached?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
Are those languages really more suitable for fast prototyping?
C++ incremental compilation vs Rust’s
Does Rust generate huge output from text-macro processing? Or what makes the parsing such a time-consuming operation that parallel processing of that part appears to be a promising way towards smaller elapsed compile times? My expectation is that the compiler spends nearly all of its time in the checking parts and in the LLVM back end.
Why parallelise the parser?
Why parallelise the parser?