Following up on the Python JIT
Performance of Python programs has been a major focus of development for the language over the last five years or so; the Faster CPython project has been a big part of that effort. One of its subprojects is to add an experimental just-in-time (JIT) compiler to the language; at last year's PyCon US, project member Brandt Bucher gave an introduction to the copy-and-patch JIT compiler. At PyCon US 2025, he followed that up with a talk on "What they don't tell you about building a JIT compiler for CPython" to describe some of the things he wishes he had known when he set out to work on that project. There was something of an elephant in the room, however, in that Microsoft dropped support for the project and laid off most of its Faster CPython team a few days before the talk.
Bucher only alluded to that event in the talk, and elsewhere has made
it clear that he intends to continue working on the JIT compiler
whatever the fallout. When he gave the talk back in May, he said that he
had been working with Python for around eight years, as a core developer
for six, part of the Microsoft CPython performance engineering team for
four, and has been working on the JIT compiler for the last two years.
While the team at Microsoft is often equated with the Faster CPython
project, it is really just a part of it; "our team collaborates with
lots of people outside of Microsoft
".
Faster CPython results
The project has seen some great results over the last few Python releases.
Its work first appeared in 2022 as part of Python 3.11, which averaged 25%
faster than 3.10, depending on the workload; "no need to change your
code, you just upgrade Python and everything works
". In the years
since, there have been further improvements: Python 3.12 was 4% faster than
3.11, and 3.13 improved by 7% over 3.12. Python 3.14, which is due in
October, will be around 8% faster than its predecessor.
In aggregate, that means Python has gotten nearly 50% faster in less than four years, he said. Around 93% of the benchmarks that the project uses have improved their performance over that time; nearly half (46%) are more than 50% faster. 20% of the benchmarks are more than 100% faster. Those are not simply micro-benchmarks, the benchmarks represent real workloads; Pylint has gotten 100% faster, for example.
All of those increases have come without the JIT; they come from all of the
other changes that the team has been
working on, while "taking a kind of holistic approach to improving
Python performance
". Those changes have a meaningful impact on
performance and were done in such a way that the community can maintain
them. "This is what happens when companies fund Python core
development
", he said, "it's a really special thing
". On his
slides, that was followed by the crying emoji đ˘ accompanied by an
uncomfortable laugh.
Moving on, he gave a "duck typing" example that he would refer to throughout the talk. It revolved around a duck simulator that would take an iterator of ducks and "quack" each one, then print the sound. As an additional feature, if a duck has an "echo" attribute that evaluates to true, it would double the sound:
def simulate_ducks(ducks):
for duck in ducks:
sound = duck.quack()
if duck.echo:
sound += sound
print(sound)
That was coupled with two classes that produced different sounds:
class Duck:
echo = False
def quack(self):
return "Quack!"
class RubberDuck:
echo = True
def __init__(self, loud):
self.loud = loud
def quack(self):
if self.loud:
return "SQUEAK!"
return "Squeak!"
He stepped through an example execution of the loop in
simulate_ducks(). He showed the bytecode for the stack-based
Python virtual machine that was generated by the interpreter and stepped
through one iteration of the loop describing the changes to the stack and
to the duck and sound local variables. That process is
largely unchanged "since Python
was first created
".
Specialization
The 3.11 interpreter added specialized bytecode into the mix, where some of
the bytecode operations are changed to assume they are using a
specific typeâchosen based on observing the execution of the code
a few times. Python is a dynamic language, so the interpreter always needs
to be able to fall back to, say, looking up the proper binary operator for the
types. But after running the loop a few times, it can assume that
"sound += sound" will be
operating on strings so it can switch to a bytecode with a fast path for
that explicit operation. "You actually have bytecode that can still
handle anything, but has inlined fast paths for the shape of your actual
objects and data structures and memory layout.
"
All of that underlies the JIT compiler, which uses the specialized bytecode
interpreter, and can be viewed as being part of the same pipeline, Bucher
said. The JIT compiler is not enabled by default in any build of Python,
however. As he described in last year's talk, the specialized bytecode
instructions get further broken down into micro-ops, which are "smaller
units of work within an individual bytecode instruction
". The
translation to micro-ops is completely automatic because the bytecodes are
defined in terms of them, "so this translation step is machine-generated
and very very fast
", he said.
The micro-ops can be optimized, that is basically the whole point of
generating them, he said. Observing the different types and values that
are being encountered when executing through the micro-ops will show
optimizations that can be applied. Some micro-ops can be replaced with
more efficient versions, others can be eliminated because they "are
doing work that is entirely redundant and that we can prove we can remove
without changing the semantics
". He showed a slide full of micro-ops
that corresponded to the duck loop and slowly replaced and eliminated
something approaching 25% of them, which corresponds to what the 3.14
version of the JIT does.
The JIT will then translate the micro-ops into machine code one-by-one, but it does so using the copy-and-patch mechanism. The machine-code templates for each of the micro-ops are generated at CPython compile time; it is somewhat analogous to the way the micro-ops themselves are generated in a table-driven fashion. Since the templates are not hand-written, fixing bugs in the micro-ops for the rest of the interpreter also fixes them for the JIT; that helps with the maintainability of the JIT, but also helps lower the barrier to entry for working on it, Bucher said.
Region selection
With that background out of the way, he moved on to some "interesting
parts of working on a JIT compiler
" that are often overlooked, starting
with region selection. Earlier, he had shown a sequence of micro-ops that
needed to be turned into machine code, but he did not describe how that
list was generated; "how did we get there in the first place?
"
The JIT compiler does not start off with such a sequence, it starts with
code like in his duck simulation. There are several questions that need to
be answered about that code based on its run-time activity. The first is:
"what do we want to compile?
" If something
is running only a few times, it is not a good candidate for JIT
compilation, but something that is running a lot is. Another question is
where should it be compiled? A function can be compiled in isolation or it
can be inlined into its callers and those can be compiled instead.
When should the code be compiled? There is a balance to be struck between
compiling things too early, wasting that effort because the code is not
actually running all that much, and too late, which may not actually make
the program any faster. The final question is "why?", he said; it only
makes sense to compile code if it is clear that compiling will make the
code more efficient. "If they are using really dynamic code patterns or
doing weird things that we don't actually compile well, then it's probably
not worth it.
"
One approach that can be taken is to compile entire functions, which is
known as "method at a time" or "method JIT". It "maps naturally to the
way we think about compilers
" because it is the way that many
ahead-of-time compilers work. So, when the JIT looks at
simulate_ducks(), it can just compile the entire function (the
for loop) wholesale, but there are some other opportunities for
optimization. If it recognizes that most of the time the loop operates on
Duck objects, it can inline the quack() function from it:
for duck in ducks:
if duck.__class__ is Duck:
sound = "Quack!"
else:
sound = duck.quack()
...
If there are lots of RubberDuck objects too, that class's
quack() method could be inlined as well. Likewise, the attribute
lookup for duck.echo could be inlined for one or both cases, but
that all starts to get somewhat complicated, he said; "it's not always super-easy to reason about, especially for something that is running while you are compiling it".
Meanwhile, what if ducks is not a list, but is instead a generator? In simple cases, with a single yield expression, it is not that much different from the list case, but with multiple yield expressions and loops in the generator, it also becomes hard to reason about. That creates a kind of optimization barrier and that kind of code is not uncommon, especially in asynchronous programming contexts.
Another technique, and the one that is currently used in the CPython JIT, is to use a "tracing JIT" instead of a method JIT. The technique takes linear traces of the program's execution, so it can use that information to make optimization decisions. If the first duck is a Duck, the code can be optimized as it was earlier, with a guard based on the class and inlining the sound assignment. Next up is a lookup for duck.echo, but the code in the guarded branch has perfect type information; it already knows that it is processing a Duck, so it knows echo is false, and that if can be removed, leaving:
for duck in ducks:
if duck.__class__ is Duck:
sound = "Quack!"
print(sound)
"This is pretty efficient. If you have just a list of Ducks, you're going to be doing kind of the bare minimum amount of work to actually quack all those ducks."
The code still needs to handle the case where the duck is not a Duck, but it does not need to compile that piece; it can, instead, just send it back to the interpreter if the class guard is false. If the code is also handling RubberDuck objects, though, eventually that else branch will get "hot" because it is being taken frequently.
At that point, the tracing can be turned back on to see what the code is doing. If we assume that it mostly has non-loud RubberDuck objects, the resulting code might look like:
elif duck.__class__ is RubberDuck:
if self.loud: ...
sound = "Squeak!Squeak!"
print(sound)
else: ...
The two branches that are not specified would simply return to the regular
interpreter when they are executed. Since the tracing has perfect type
information, it knows that echo is true, so the sound should be
doubled, but there is no need to actually use "+=" to get the
result. So, now the function has the minimum necessary code to quack
either a Duck or a non-loud RubberDuck. If
those other branches start getting hot at some point, tracing can once
again be used optimize it further.
One downside of the tracing JIT approach is that it can compile duplicates
of the same code, as with "print(sound)". In "very branchy
code
" Bucher said, "some things near the tail of those traces can be
duplicated quite a bit
". There are ways to reduce that duplication,
but it is a downside to the technique.
Another technique for selecting regions is called "meta tracing", but he
did not have time to go into it. He suggested that attendees ask their LLM
of choice "about the 'first Futamura projection' and don't misspell it
like me, it's not 'Futurama'
", Bucher said to some chuckles around the
room.
Memory management
JIT compilers "do really weird things with memory
". C programmers
are familiar with readable (or read-only) data, such as a const
array, and data that is both readable and writable is the normal case.
Memory can be dynamically allocated using malloc(), but that kind
of memory cannot be executed; since a JIT compiler needs memory that it can
read, write, and execute, it requires "the big guns
": mmap().
"If you know the right magic incantation, you can whisper to this thing
with all these secret flags and numbers
" to get memory that is
readable, writable, and executable:
char *data = mmap(NULL, 4096,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
One caveat is that memory from mmap() comes in page-sized chunks,
which is 4KB on most systems but can be larger. If the JIT code is, say,
four bytes in length, that can be wasteful, so it needs to be managed
carefully. Once you have that memory, he asked, how do you actually
execute it? It turns out that "C lets us do crazy things":
typedef int (*function)(int);
((function)data)(42);
That first line creates a type definition named "function", which
is a pointer to a function that takes an integer argument and returns an
integer. The second line casts the data pointer to that type and
then calls the function with an argument of 42 (and ignores the return
value). "It's weird, but it works."
He noted that the term "executable data" should be setting off alarm bells
in people's heads; "if you're a Rust programmer, this is what we call
'unsafe code'
" he said to laughter. Being able to write to memory that
can be executed is "a scary thing; at best you shoot yourself in the
foot, at worst it is a major security vulnerability
". For this reason,
operating systems often require that memory not be in that state. He said
that the memory should be mapped readable and writable, then filled in, and
switched to readable and executable using mprotect();
if there is a need to modify the data later, it can be switched back and
forth between the two states.
Debugging and profiling
When code is being profiled using one of the Python profilers, code that has been compiled should call all of the same profiling hooks. The easiest way to do that, at least for now, is to not JIT code that has profiler hooks installed. In recent versions of Python, profiling is implemented by using the specializing adaptive interpreter to change certain bytecodes to other, instrumented versions of them, which will call the profiler hooks. If the tracing encounters one of these instrumented bytecodes, it can shut the JIT down for that part of the code, but it can still run in other, non-profiled parts of the code.
A related problem occurs when someone enables profiling for code that has
already been JIT-compiled. In that case, Python needs to get out of the
JIT code as quickly as possible. That is handled by placing special
_CHECK_VALIDITY micro-ops just before "known safe points
"
where it can jump out of the JIT code and back to the interpreter. That
micro-op checks a one-bit flag; if it is set, the execution bails out of
the JIT code. That bit gets set when profiling is enabled, but it is also
used when code executes that could change the JIT optimizations (e.g. a
change of class attributes).
Something that just kind of falls out of that is the ability to support "the
weirder features of Python debuggers
". The JIT code is created based
on what the tracing has seen, but someone running pdb could
completely upend that state in various ways (e.g. "duck =
Goose()"). The validity bit can be used to avoid problems of that
sort as well.
For native profilers and debuggers, such as perf and GDB, there is a need
to unwind the stack through JIT frames, and interact with JIT frames, but
"the short answer is that it's really really complicated
". There
are lots of tools of this sort, for various platforms, that all work
differently and each has its own APIs for registering debug information in
different formats. The project members are aware of the problem, but are
trying to determine which tools need to be supported and what level of
support they actually need.
Looking ahead
The current Python release is 3.13; the JIT can be built into it by using the --enable-experimental-jit flag. For Python 3.14, which is out in beta form and will be released in October, the Windows and macOS builds have the JIT built-in, but it must be enabled by setting PYTHON_JIT=1 in the environment. He does not recommend enabling it for production code, but the team would love to hear about any results from using it: dramatic improvements or slowdowns, bugs, crashes, and so on. Other platforms, or people creating their own binaries, can enable the JIT with the same flag as for 3.13.
For 3.15, which is in a pre-alpha stage at this point, there are two GitHub
issues they are focusing on: "Supporting stack
unwinding in the JIT compiler" and "Make the JIT
thread-safe". The first he had mentioned earlier with regard to
support for native debuggers and profilers. The second is important since
the free-threaded build of CPython seems to be working out well and is
moving toward becoming the defaultâsee PEP 779 ("Criteria for
supported status for free-threaded Python"), which was recently accepted
by the steering council. The Faster CPython developers think that
making the JIT thread-safe can be done without too much trouble; "it's
going to take a little bit of work and there's kind of a long tail of
figuring out what optimizations are actually still safe to do in a
free-threaded environment
". Both of those issues are outside of his
domain of expertise, however, so he hoped that others who have those skills
would be willing to help out.
In addition, there is a lot of ongoing performance work that is going into the 3.15 branch, of course. He noted, pointedly, that fast progress, especially on larger projects, will depend on the availability of resources. The words on his slide saying that changed to bold and he gave a significant cough to further emphasize the point.
As he wrapped up, he suggested PEP 659 ("Specializing Adaptive Interpreter") and PEP 744 ("JIT Compilation") for further information. For those who would rather watch something, instead of reading about it, he recommended videos of his talks (covered by LWN and linked above) from 2023 on the specializing adaptive interpreter and from 2024 on adding a JIT compiler. The YouTube video of this year's talk is available as well.
[Thanks to the Linux Foundation for its travel sponsorship that allowed me to travel to Pittsburgh for PyCon US.]
| Index entries for this article | |
|---|---|
| Conference | PyCon/2025 |
| Python | JIT |
