Python interpreter adds tail calls
The Faster CPython project has been working to speed up the Python interpreter for the past several years. Now, Ken Jin, a member of the project, has merged a new set of changes that have been benchmarked as improving performance by 10% for some architectures. The only change is switching from using computed goto statements to using tail calls as part of the implementation of Python's bytecode interpreter — but that change allows modern compilers to generate significantly better code.
Python's many interpreters
When running a Python program, the interpreter compiles the program to custom Python bytecode — instructions for a high-level stack-based virtual machine. This is a big speed advantage because the bytecode is a denser, more linear representation of the Python program. Instead of needing to process some data structure representing the program, the inner loop of the interpreter can just fetch each instruction sequentially and interpret it. The bytecode is still fairly high-level, however, and Python's new just-in-time compiler (JIT) cannot turn it directly into machine code without first collecting type information by actually running the code.
Prior to Jin's changes, the CPython code base boasted three different interpreters: a switch-based bytecode interpreter, a bytecode interpreter using computed goto statements, and a micro-op interpreter. The micro-op interpreter is a relatively recent addition that is also the work of the Faster CPython project. It is used for validating the implementation of Python's JIT compiler, and doesn't need to run outside of that context, so its performance is not a central concern. On the other hand, every instruction in a Python program is executed by one of the two bytecode interpreters at least once in order to gather the information that the JIT compiler needs to produce machine code.
The bytecode interpreters have the job of running the bytecode until the JIT compiler can take over — and, since it takes time for the JIT compiler to warm up, their performance is important to the overall performance of the program. Whether that interpreter uses switch statements or computed goto statements, its operation is conceptually the same: the interpreter fetches the next instruction, looks it up, jumps to the code to handle that instruction, and then loops.
The difference between the two interpreters comes down to how they jump between the implementation of bytecode instructions. The switch-based interpreter is conceptually laid out like this, although the actual implementation is a bit messier:
while (true) {
instruction i = get_next_instruction();
switch (i) {
case POP:
// Code for a POP instruction
...
break;
case ADD:
// Code for an ADD instruction
...
break;
...
}
}
The computed-goto-based interpreter uses a C extension to look up a label in a jump table and then jump to it, resulting in code that instead looks something like this:
static void* dispatch_table[] = { &&POP, &&ADD, ... };
#define DISPATCH() goto *dispatch_table[get_next_instruction()]
DISPATCH();
POP:
// Code for POP
...
DISPATCH();
Add:
// Code for ADD
...
DISPATCH();
In this version of the interpreter, the loop is replaced by the sequence of jumps from each bytecode instruction to the next one.
The need for different interpreters is purely performance-motivated. They implement the same loop over the bytecode, but Clang and GCC generate better code for the computed-goto-based version. The switch-based interpreter is an older design that exists in order to support the Microsoft Visual C++ (MSVC) compiler, which doesn't implement computed-goto statements. Most Linux users running the Python build from their distribution will use the computed-goto interpreter.
The main reason the computed-goto-based version is faster is due to the limitations of the CPU's branch predictor. In the switch-based design, a single native instruction (corresponding to the switch) makes an indirect jump to the code for every bytecode instruction. This quickly exhausts the limited prediction capabilities of the branch predictor, making every indirect jump into a slow misprediction. The computed-goto-based design spreads those indirect jumps out over multiple locations (everywhere DISPATCH() is called) at the end of each bytecode instruction's implementation. This lets the branch predictor perform better because it can save more context, and also because it can learn which bytecode instructions are more likely to follow each other.
But there are still problems with the version using computed gotos. Combining the implementations for all of the bytecode instructions into a single large function is hard on the compiler. Most compilers have a limit for how large a function they'll attempt to optimize, but even without that, optimizing a single gigantic function is hard. Many operations, such as register allocation, scale poorly to large functions. Clang and GCC produce valid, working code, but it's not the best code that they could produce with some help.
Tail calls
This is where tail calls come in. Tail calls are an old idea in computer science — when the last thing that one function does is call another function, the first function can simply jump into the second function without using a call instruction, reusing the stack frame. This saves memory and time by not pushing an extra return address to the stack, and lets the second function return directly to the first function's caller. In Python's case, changing the interpreter to use tail calls to jump between the implementation of different bytecode instructions lets the implementation be broken up into many smaller functions, which is easier for the compiler to optimize and therefore produces better code.
Tail calling is usually presented as an optimization, and both Clang and GCC do perform tail-call optimization opportunistically, but until relatively recently, that optimization could not be guaranteed, making it not possible to rely on. Some languages, such as Scheme, make it part of the language specification that tail calls must use a jump instruction. This lets programmers using those languages implement loops with recursion without ever overflowing the stack. With C, since it wasn't guaranteed, any program written in that style would have a chance of producing stack-overflow errors if there were a long enough sequence of un-optimized tail calls. In Python's case, that could affect any non-trivial Python program.
Since 2021, Clang has supported a [[clang::musttail]] attribute on function calls, which instructs the compiler that the call must be an optimized tail call. If the compiler can't comply for some reason, it will raise an error. GCC added support for the same attribute in 2024.
With the [[musttail]] attribute ensuring that switching to using tail calls would not cause stack overflow errors, Python could consider changing implementations. Unfortunately, just switching to tail calls with no other changes actually introduced a performance regression, Jin said.
The problem was function-call overhead. The prologue of a C function must do certain things in order to uphold the ABI requirements of the architecture. In particular, many architectures require functions to save the values in certain registers ("callee-saved" registers) to the stack. Then the epilogue of the function restores those values to the registers. When tail calling from one function to another, this means that the first function will restore a set of registers from the stack, jump to the second function, and then the second function will immediately put those registers back on the stack, where they just were. This is completely unnecessary, but because the C ABI requires it, every function would repeat this wasted work.
The solution comes from another attribute, preserve_none. Functions
with this attribute don't preserve any registers from the caller. The calling
convention was originally created for the Glasgow Haskell Compiler (GHC), which
is why old documents sometimes refer to it as "ghccc", but C++11 and C23
standardized the attribute for C++ and C, respectively [A reader correctly
pointed out that preserve_none has not actually been standardized yet].
Making random functions in a program preserve_none will likely not have much of an impact on performance, since it just puts the burden of preserving any registers on their callers. But for Python's bytecode interpreter, once the program starts executing bytecode instructions, every subsequent call will be a tail call to another preserve_none function. (Or, at the very end of the program, a call to Python's cleanup code and then exit()). So the compiler can completely remove the function prologue and epilogue, eliminating the overhead from using function calls.
When using both attributes, Jin's new tail-call-based interpreter inherits the nice performance benefits of the computed-goto-based version, while also making it easier for the compiler to figure out optimal register allocations and other local optimizations.
Impacts on Python
Normally, implementing a fourth interpreter for a language would seem like a fairly big maintenance burden. In Python's case, however, the build process uses an interpreter generator to create interpreters from a shared description of the bytecode instructions. So Jin's change is only about 200 lines of Python code that extends the interpreter generator to produce the tail-calling version.
The new interpreter does have some drawbacks, though. The main one is that it's only recently that compilers have started supporting both required attributes, and so this won't benefit builds of Python made with old compilers. The full benefit of the changes also depends on profile-guided optimization and link-time optimization; without those, the compilers can't take full advantage of the optimization opportunities provided by the smaller functions. The official Python releases are built with profile-guided optimization, link-time optimization, and a recent compiler; so are many Python packages built by different distributions, although a few distributions such as NixOS disable profile-guided optimization for reasons of build reproducibility. In any case, these limits shouldn't cause problems for most Python users.
Jin measured the performance impact of the change as 10% on average (using the
geometric mean), but up to 40% on some benchmarks that spend most of their
time executing bytecode. He said that the "speedup is roughly equal to 2
minor CPython releases worth of improvements. For example, CPython 3.12 roughly
sped up by 5%.
" The change generated relatively little discussion, in the
face of such a large speedup from a small change. Jin's work has been merged,
and should be available to end users in Python 3.14, expected in October 2025.
| Index entries for this article | |
|---|---|
| Python | CPython |
| Python | Performance |
