Python interpreter adds tail calls
The Faster CPython project has been working to speed up the Python interpreter for the past several years. Now, Ken Jin, a member of the project, has merged a new set of changes that have been benchmarked as improving performance by 10% for some architectures. The only change is switching from using computed goto statements to using tail calls as part of the implementation of Python's bytecode interpreter — but that change allows modern compilers to generate significantly better code.
Python's many interpreters
When running a Python program, the interpreter compiles the program to custom Python bytecode — instructions for a high-level stack-based virtual machine. This is a big speed advantage because the bytecode is a denser, more linear representation of the Python program. Instead of needing to process some data structure representing the program, the inner loop of the interpreter can just fetch each instruction sequentially and interpret it. The bytecode is still fairly high-level, however, and Python's new just-in-time compiler (JIT) cannot turn it directly into machine code without first collecting type information by actually running the code.
Prior to Jin's changes, the CPython code base boasted three different interpreters: a switch-based bytecode interpreter, a bytecode interpreter using computed goto statements, and a micro-op interpreter. The micro-op interpreter is a relatively recent addition that is also the work of the Faster CPython project. It is used for validating the implementation of Python's JIT compiler, and doesn't need to run outside of that context, so its performance is not a central concern. On the other hand, every instruction in a Python program is executed by one of the two bytecode interpreters at least once in order to gather the information that the JIT compiler needs to produce machine code.
The bytecode interpreters have the job of running the bytecode until the JIT compiler can take over — and, since it takes time for the JIT compiler to warm up, their performance is important to the overall performance of the program. Whether that interpreter uses switch statements or computed goto statements, its operation is conceptually the same: the interpreter fetches the next instruction, looks it up, jumps to the code to handle that instruction, and then loops.
The difference between the two interpreters comes down to how they jump between the implementation of bytecode instructions. The switch-based interpreter is conceptually laid out like this, although the actual implementation is a bit messier:
while (true) { instruction i = get_next_instruction(); switch (i) { case POP: // Code for a POP instruction ... break; case ADD: // Code for an ADD instruction ... break; ... } }
The computed-goto-based interpreter uses a C extension to look up a label in a jump table and then jump to it, resulting in code that instead looks something like this:
static void* dispatch_table[] = { &&POP, &&ADD, ... }; #define DISPATCH() goto *dispatch_table[get_next_instruction()] DISPATCH(); POP: // Code for POP ... DISPATCH(); Add: // Code for ADD ... DISPATCH();
In this version of the interpreter, the loop is replaced by the sequence of jumps from each bytecode instruction to the next one.
The need for different interpreters is purely performance-motivated. They implement the same loop over the bytecode, but Clang and GCC generate better code for the computed-goto-based version. The switch-based interpreter is an older design that exists in order to support the Microsoft Visual C++ (MSVC) compiler, which doesn't implement computed-goto statements. Most Linux users running the Python build from their distribution will use the computed-goto interpreter.
The main reason the computed-goto-based version is faster is due to the limitations of the CPU's branch predictor. In the switch-based design, a single native instruction (corresponding to the switch) makes an indirect jump to the code for every bytecode instruction. This quickly exhausts the limited prediction capabilities of the branch predictor, making every indirect jump into a slow misprediction. The computed-goto-based design spreads those indirect jumps out over multiple locations (everywhere DISPATCH() is called) at the end of each bytecode instruction's implementation. This lets the branch predictor perform better because it can save more context, and also because it can learn which bytecode instructions are more likely to follow each other.
But there are still problems with the version using computed gotos. Combining the implementations for all of the bytecode instructions into a single large function is hard on the compiler. Most compilers have a limit for how large a function they'll attempt to optimize, but even without that, optimizing a single gigantic function is hard. Many operations, such as register allocation, scale poorly to large functions. Clang and GCC produce valid, working code, but it's not the best code that they could produce with some help.
Tail calls
This is where tail calls come in. Tail calls are an old idea in computer science — when the last thing that one function does is call another function, the first function can simply jump into the second function without using a call instruction, reusing the stack frame. This saves memory and time by not pushing an extra return address to the stack, and lets the second function return directly to the first function's caller. In Python's case, changing the interpreter to use tail calls to jump between the implementation of different bytecode instructions lets the implementation be broken up into many smaller functions, which is easier for the compiler to optimize and therefore produces better code.
Tail calling is usually presented as an optimization, and both Clang and GCC do perform tail-call optimization opportunistically, but until relatively recently, that optimization could not be guaranteed, making it not possible to rely on. Some languages, such as Scheme, make it part of the language specification that tail calls must use a jump instruction. This lets programmers using those languages implement loops with recursion without ever overflowing the stack. With C, since it wasn't guaranteed, any program written in that style would have a chance of producing stack-overflow errors if there were a long enough sequence of un-optimized tail calls. In Python's case, that could affect any non-trivial Python program.
Since 2021, Clang has supported a [[clang::musttail]] attribute on function calls, which instructs the compiler that the call must be an optimized tail call. If the compiler can't comply for some reason, it will raise an error. GCC added support for the same attribute in 2024.
With the [[musttail]] attribute ensuring that switching to using tail calls would not cause stack overflow errors, Python could consider changing implementations. Unfortunately, just switching to tail calls with no other changes actually introduced a performance regression, Jin said.
The problem was function-call overhead. The prologue of a C function must do certain things in order to uphold the ABI requirements of the architecture. In particular, many architectures require functions to save the values in certain registers ("callee-saved" registers) to the stack. Then the epilogue of the function restores those values to the registers. When tail calling from one function to another, this means that the first function will restore a set of registers from the stack, jump to the second function, and then the second function will immediately put those registers back on the stack, where they just were. This is completely unnecessary, but because the C ABI requires it, every function would repeat this wasted work.
The solution comes from another attribute, preserve_none. Functions
with this attribute don't preserve any registers from the caller. The calling
convention was originally created for the Glasgow Haskell Compiler (GHC), which
is why old documents sometimes refer to it as "ghccc", but C++11 and C23
standardized the attribute for C++ and C, respectively [A reader correctly
pointed out that preserve_none has not actually been standardized yet].
Making random functions in a program preserve_none will likely not have much of an impact on performance, since it just puts the burden of preserving any registers on their callers. But for Python's bytecode interpreter, once the program starts executing bytecode instructions, every subsequent call will be a tail call to another preserve_none function. (Or, at the very end of the program, a call to Python's cleanup code and then exit()). So the compiler can completely remove the function prologue and epilogue, eliminating the overhead from using function calls.
When using both attributes, Jin's new tail-call-based interpreter inherits the nice performance benefits of the computed-goto-based version, while also making it easier for the compiler to figure out optimal register allocations and other local optimizations.
Impacts on Python
Normally, implementing a fourth interpreter for a language would seem like a fairly big maintenance burden. In Python's case, however, the build process uses an interpreter generator to create interpreters from a shared description of the bytecode instructions. So Jin's change is only about 200 lines of Python code that extends the interpreter generator to produce the tail-calling version.
The new interpreter does have some drawbacks, though. The main one is that it's only recently that compilers have started supporting both required attributes, and so this won't benefit builds of Python made with old compilers. The full benefit of the changes also depends on profile-guided optimization and link-time optimization; without those, the compilers can't take full advantage of the optimization opportunities provided by the smaller functions. The official Python releases are built with profile-guided optimization, link-time optimization, and a recent compiler; so are many Python packages built by different distributions, although a few distributions such as NixOS disable profile-guided optimization for reasons of build reproducibility. In any case, these limits shouldn't cause problems for most Python users.
Jin measured the performance impact of the change as 10% on average (using the
geometric mean), but up to 40% on some benchmarks that spend most of their
time executing bytecode. He said that the "speedup is roughly equal to 2
minor CPython releases worth of improvements. For example, CPython 3.12 roughly
sped up by 5%.
" The change generated relatively little discussion, in the
face of such a large speedup from a small change. Jin's work has been merged,
and should be available to end users in Python 3.14, expected in October 2025.
Index entries for this article | |
---|---|
Python | CPython |
Python | Performance |
Posted Feb 26, 2025 14:55 UTC (Wed)
by mechanicker (subscriber, #166579)
[Link]
Reminds me of `__declspec(naked)` function used to implement a function call interception based profiler for MSVC compiler build code. Thanks to article by John Panzer in C/C++ Users Journal back in the day (2003/04).
https://bitbucket.org/dhruva/cramp/src/7edad92f9efd7b5495...
Posted Feb 26, 2025 15:17 UTC (Wed)
by Bigos (subscriber, #96807)
[Link] (2 responses)
The only thing standardized in these language versions is the *syntax* for custom attributes: [[$VENDOR::$ATTRIBUTE]]. So for preserve_none it is [[clang::preserve_none]]. It won't be recognized by GCC.
Posted Feb 26, 2025 15:45 UTC (Wed)
by daroc (editor, #160859)
[Link] (1 responses)
... I thought I had a source for that, but going back through my notes suggests I may have gotten it mixed up with another attribute. GCC is
working on implementing preserve_none, but you're right that it does not seem to be actually standardized. I'll add a correction to the article.
Posted Feb 27, 2025 7:02 UTC (Thu)
by mokki (subscriber, #33200)
[Link]
But that profile guided optimization night work better with smaller functions and should improve tail call interpreter performance to be better than jump tables.
Posted Feb 27, 2025 0:39 UTC (Thu)
by riking (guest, #95706)
[Link] (1 responses)
In particular, several of these important algorithms are O(n^3) over things like the number of locals, the number of basic blocks, etc.
Hopefully this helps you understand why function fission is such a great idea :)
Posted Feb 27, 2025 20:30 UTC (Thu)
by bjlucier (subscriber, #5281)
[Link]
In response to various issues that I've reported over the years with sample output from Gambit's compiler, the GCC community has removed most N^3 (or even N^2) algorithms from the compiler. It's also true that sometimes the GCC folks decide to just turn off an optimization pass for functions that are "too large" in a technical sense (see, e.g., -Wdisabled-optimization and the recent report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922).
GCC's emphasis on inter-procedural analysis is another reason that the developers have worked on removing high-complexity algorithms.
Brad
Posted Feb 27, 2025 2:18 UTC (Thu)
by nickodell (subscriber, #125165)
[Link] (3 responses)
Posted Feb 27, 2025 9:37 UTC (Thu)
by anton (subscriber, #25547)
[Link] (2 responses)
The paper on copy-and-patch compilation also works with code snippets that, in the C source code, are functions with tail-calls, but in many cases the tail-calls are not among the copied code. IIRC it also works with clang and the GHC calling convention.
Posted Feb 27, 2025 13:40 UTC (Thu)
by daroc (editor, #160859)
[Link] (1 responses)
Posted Feb 27, 2025 15:33 UTC (Thu)
by nickodell (subscriber, #125165)
[Link]
Posted Feb 27, 2025 10:29 UTC (Thu)
by anton (subscriber, #25547)
[Link] (7 responses)
The paper on copy-and-patch compilation uses tail calls and has inspired me to look into tail calls for my own work (Gforth) again. Preliminary experiments show that it now works, unlike in the 90s (when I tried it last time). I have not seen better code produced (and I have not expected it).
But there is one benefit I saw in my experiments: For gcc we can use more registers through parameters and explicit global register variables than we currently can use in our single-function labels-as-values approach. With the preserve_none calling convention, clang supports more parameters and therefore more registers, making it an attractive option with clang, too.
The main benefit I expect is 1) faster build times, and 2) that we can introduce additional VM instruction variants: With the single-function labels-as-values approach we are currently limited to about 2000 VM instruction variants in gcc to keep compile times usable; clang compiles Gforth unbearably slowly (hours) even though we reduce the number of VM instruction variants for it (which costs run-time performance).
The main challenge of tail calls for our code-copying (without patching) approach is to find a portable way to copy the code of the function without the tail call. If you just want to use it for an interpreter, that's not a problem.
[a] I find the claims that a single function is too big while at the same time claiming benefits from link-time optimization to be absurd. In particular, I don't expect any benefits from LTO not just because it would take even more time to analyse the whole program than the bytecode-interpreter function, but also because all these small tail-calling functions can tail-call any other of these functions, and can be called from any other of these functions, so realisitically no code can be moved between these functions, and I doubt that any useful data-flow information survives these calls; the information about the liveness of interpreter state is already encoded in the use of parameters (at least I would expect the patch to do that, and the preserve_none attribute allows that). Another theoretical possibility would be to have some interpreter state in thread-local (or global) variables, and rely on LTO to register-allocate them, but I doubt that this would work.
Posted Feb 27, 2025 11:24 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Did you miss this paragraph from the article?
> The solution comes from another attribute, preserve_none. Functions with this attribute don't preserve any registers from the caller. The calling convention was originally created for the Glasgow Haskell Compiler (GHC), which is why old documents sometimes refer to it as "ghccc", but C++11 and C23 standardized the attribute for C++ and C, respectively [A reader correctly pointed out that preserve_none has not actually been standardized yet].
Posted Feb 27, 2025 13:48 UTC (Thu)
by daroc (editor, #160859)
[Link]
As for whether I've attributed the performance improvements to the right things in the article — modern performance is complicated. The above description is synthesized from the discussion between the Python developers about the rationale and effects of the change; it makes sense to me, based on my understanding of compiler optimization pipelines, but it is entirely possible that both they and I are mistaken about the deeper causes here.
In particular, I think that the LTO comment can make sense — as mentioned, GCC and Clang have limits on how large a function they're willing to create by inlining things. By breaking the interpreter up into smaller functions, you have more opportunities to inline the other functions in the Python codebase that each instruction calls, and then optimize those in context.
Perhaps as the Faster CPython project keeps working on these things we'll see a more detailed performance breakdown that could shed some light on the matter.
Posted Feb 27, 2025 15:13 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link] (1 responses)
In other words, it could viewed as structuring the same code differently to make the compiler happy.
Posted Feb 27, 2025 15:56 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link]
https://github.com/python/cpython/blob/main/Python/byteco...
As you'll see, there are quite a lot of calls out to other functions, in other compilation units, and these are normal C calls, not tail calls. It's these that I expect are what are being inlined and optimised at link-time. You're right that data flow analysis on the interpreter would be prevented by the restructuring, but I don't suppose that was possible anyway (beyond keeping important interpreter state in registers, which I presume must happen due to the calling convention).
Posted Feb 28, 2025 4:09 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
That sounds a bit pessimistic to me. All of this C code is ultimately generated by what looks to be very convoluted Python code. Logically, there are really only two reasons to write that sort of thing:
* The C code is even more convoluted, to the point that it is easier to maintain the Python code than the C code.
If we suppose the latter, i.e. that the C code is highly repetitive but otherwise straightforward, then you're not just tail calling between arbitrary C functions. You're tail calling between *highly similar* C functions. That opens up optimization opportunities that might not arise in the general case of "lots of C functions tail calling each other."
Regardless, I find it hard to believe that a huge function with lots of computed indirect jumps can possibly be any *faster* than a bunch of tiny functions tail-calling each other (the compiler knows what a function is, the compiler does not know what your computed jumps are), so at the very least, it can't be significantly worse than the goto implementation.
Posted Feb 28, 2025 19:42 UTC (Fri)
by kenjin (guest, #176271)
[Link]
I'm the patch author.
First of all, if you're the Anton working on GForth, I'm a huge fan of your work. Thank you for your pioneering work on interpreters and making them faster!
The "why" of the speedup is not 100% clear to us either. Most of what we derived are just speculation from the performance results.
The first thing is that out of the 10-15% geomean speedup, about 2-3% of the speedup actually is just the threaded code itself. You might ask how is that possible, shouldn't the computed goto version already do that? Well it turns out LLVM 19 has a bug where it deduplicates computed gotos. so it undoes the effect of the computed gotos [1]!
Leaving that aside, that's still a 7-12% unaccounted geomean speedup. We suspect it's the register pinning (the calling convention) that achieves the speedups here. So we're guessing the same thing as you. The LTO/PGO thing is not an "optimization" but rather an anti-pessimization. We've recently encountered problems where the CPython interpreter is getting so large that LTO/PGO does not want to work properly for it anymore on some compilers [2].
Lastly, the speedup might be due to suboptimal interprocedural register allocation in Clang. I suspect GCC's speedup with its better interprocedural regalloc will be in the range of 3-5% thereabouts. This is purely a guess.
So the entire thing was mostly fixing inefficiencies in modern compilers. In theory if modern compilers get better, this will become useless (and perhaps even suboptimal). As you pointed out, this does not preserve interprocedural analysis information, so it should perform worse all things being equal (less information to the compiler and such).
[1]: https://github.com/llvm/llvm-project/issues/106846
Posted Mar 10, 2025 17:47 UTC (Mon)
by nelhage (subscriber, #59579)
[Link]
Good instincts! It turns out the speedup is mostly due a regression in the baseline. Compared to clang-18 or GCC the speedup is a much more believable 1-5%. I had the same reaction as you when I saw the initial announcements. I also happened to be on paternity leave, and it turns out childcare is very compatible with alternately doing brief bursts of focused engineering, and then running lots of benchmarks asynchronously in the background. The story also had a lot more plot twists than I expected, which made it tricky to run them down and get confidence I had a full and accurate story.
Posted Feb 28, 2025 0:17 UTC (Fri)
by MrWim (subscriber, #47432)
[Link]
Nice to see low level performance improvements
preserve_none not standardized
preserve_none not standardized
preserve_none not standardized
It's N cubed!
It's N cubed!
Copy and patch?
There is no copying, no patching, and no native-code (JIT) compiler. The CPython bytecode interpreter with this change is still something that everyone would call an interpreter.
Copy and patch?
The new JIT uses the same musttail + preserve_none trick to avoid generating function prologues and epilogues, yes. The stencil files it makes leave out the actual jumps, though, and instead just concatenate the different bytecode implementations directly, to make a big run of straight-line code. (Which is more friendly to the CPU than making indirect jumps)
That's actually why the new JIT had clang as a build requirement — at the time, it didn't work on GCC yet because they were still implementing the needed support. Now that GCC has, though, the build-time dependency on clang might go away again.
Copy and patch?
Copy and patch?
I wonder where the speedup comes from. The patch notes do not give an explanation, and the claims in the article do not look plausible to me [a]. My guess is that more of the interpreter state is kept in registers in the tail-calling interpreter.
Where does the speedup come from? And my own experiences
Where does the speedup come from? And my own experiences
Where does the speedup come from? And my own experiences
Where does the speedup come from? And my own experiences
Where does the speedup come from? And my own experiences
(It's written in a DSL, with C snippets embedded in it, but you'll get the idea)
Where does the speedup come from? And my own experiences
* The C code is not particularly convoluted, but it is very long, boring, and repetitive, so that the Python code can be justified as reducing code duplication and boilerplate.
Where does the speedup come from? And my own experiences
[2]: https://github.com/python/cpython/issues/129244
Where does the speedup come from? And my own experiences
I wonder where the speedup comes from. The patch notes do not give an explanation, and the claims in the article do not look plausible to me
Cool stuff