Adding a JIT compiler to CPython

By Jake Edge
June 18, 2024

One of the big-ticket items for the upcoming Python 3.13 release is an experimental just-in-time (JIT) compiler for the language; the other is, of course, the removal of the global interpreter lock (GIL), which is also an experiment. Brandt Bucher is a member of the Faster CPython project, which is working on making the reference implementation of the language faster via a variety of techniques. Last year at PyCon, he gave a talk about the specializing adaptive interpreter; at PyCon 2024 in Pittsburgh, he described the work he and others have been doing to add a copy-and-patch JIT compiler to CPython.

Background and history

He began with a little bit about himself; "I started using Python about seven years ago and everything since then has been a kind of in a whirlwind". He started contributing code to Python, then joined the bug-triage team, and became a core developer after he worked on the structural-pattern-matching feature. He currently works on the Microsoft Faster CPython team ("it's my literal 'dream job'"); he helped to make CPython 3.11 25% faster for most workloads. More recently, he implemented the JIT compiler in order to continue improving the performance of the language.

A bit of recent history was next up, he said, to explain why the team thought there was a need for a JIT compiler now, along with describing some of the steps taken to get to this point. The specializing adaptive interpreter (from PEP 659) was added for Python 3.11; it "sort of listens to your program as it runs" in order to type-specialize the code and to cache important values. That meant that the interpreter was gathering valuable data about the program as it ran; "the exact sort of data that we would care about if we wanted to [...] compile it later".

In Python 3.12, there was a "quality of life improvement" for maintainers. The specializing adaptive interpreter had added complexity to what was otherwise a fairly simple interpreter, so the team took some time to add infrastructure that allowed automatically generating the interpreter from a specification in a domain-specific language (DSL). This simplified the code by generating much of the boilerplate for reference-count maintenance, error handling, and more. It also meant that the specification could be used to generate multiple interpreters or, even, things that are not interpreters, Bucher said.

For Python 3.13, which had been feature-frozen just before PyCon and is due in October, a new micro-op interpreter was added. It is an "entirely separate execution pipeline" that analyzes the running code to determine where the hot paths are; those paths are then lowered (i.e. translated into a lower-level format) into new bytecode operations, called micro-ops, which are optimized and executed in another dedicated interpreter. All of this is only possible, Bucher said, because of the steps taken in 3.11 and 3.12.

An example

He put up an iterative Fibonacci function (from his slides) that he would be using to demonstrate how the micro-op interpreter works:

    def fibonacci(n):
        a, b = 0, 1
        for _ in range(n):
            a, b = b, a + b
        return a

Since the interpreter would be spending most of its time in the for loop, that is where he wanted to focus. For those who might not be fully up to speed on Python, the _ is often used as a placeholder for a value that is not needed, and the range() function returns an iterator that produces n successive values starting at zero (in this case). The body of the loop uses tuple syntax to do two assignments: a=b and b=a+b. The bytecode operations for the loop look like the following (from his slides):

    for _ in range(n):      FOR_ITER
        a, b = b, a + b     STORE_FAST
                            LOAD_FAST_LOAD_FAST
                            LOAD_FAST
                            BINARY_OP
                            STORE_FAST_STORE_FAST
                            JUMP_BACKWARD

The Python bytecode instructions are described as part of the documentation for the dis module, which can be used to disassemble compiled bytecode. Some of the newer specialized bytecode instructions are not yet documented, however.

Python has a stack-based virtual machine, Bucher reminded attendees; when the bytecode above starts executing, there is already an iterator (for range()) on the stack. The FOR_ITER bytecode is responsible for getting the next item from the iterator and pushing it onto the stack. But getting the next item for an iterator in Python is a surprisingly complex operation (looking up the object type, calling its __next__() member function, and so on) due to Python's object model. If it is known that the iterator is using range(), much of that can be simplified. So the interpreter will recognize that it is and will specialize the bytecode after a few iterations to FOR_ITER_RANGE, which has a fast path to generate the needed numbers directly. The STORE_FAST instruction will store the number generated, which it pops from the stack, into the _ variable. The LOAD_FAST_LOAD_FAST instruction just does two LOAD_FAST instructions (which push the value of a local variable onto the stack), pushing b, then a; the next LOAD_FAST then pushes b again.

The BINARY_OP instruction pops two values from the stack, performs a binary operation ("+" in this case), and pushes the result back onto the stack. Performing a binary operation in Python is complicated, though, because the types could be anything that the operator is defined for, so there are lots of steps the virtual machine has to take to execute a generic binary operation. With a chuckle, Bucher said that you literally could define an object where the addition operator "formats your hard drive"; "it is a very overpowered mechanism, especially when you are just adding two integers together". So the BINARY_OP will be specialized to a BINARY_OP_ADD_INT instruction that dispenses with all of the rest of the generic handling and just adds the integers.

Finally, the STORE_FAST_STORE_FAST instruction does two STORE_FAST instructions, storing a+b to b and (the original value of) b to a. JUMP_BACKWARD then simply jumps back to the beginning of the loop, as might be expected.

So after the specialization phase, "we actually have some decent profiling information about your code", he said; the original bytecode did not say much about the data types or the operations being performed, but the specialized instructions do. The next step is to do "micro-op traces", which is a phase that breaks up the bytecode instructions into smaller, more easily optimized, parts, called micro-ops.

Micro-op traces

Once the Python code gets warmed up ("for some definition of 'warm'"), the interpreter starts to translate bytecodes into micro-ops. That translation step is automatic, he said, because of the DSL; the larger instructions can be defined in terms of their constituent micro-ops in the DSL description. So the translation just becomes a table lookup, rather than hand-written code.

He showed the micro-ops that make up each of the bytecode instructions in the example loop. Each bytecode instruction starts with the same micro-op: _CHECK_VALIDITY_AND_SET_IP, which is important to ensure the correctness of the generated code; it is a mechanism to check that "the world hasn't changed too drastically". User code "can do all sorts of crazy things": it can raise exceptions, change the values of local variables in a debugger, monkey-patch the builtins, and more. So that micro-op is important in the general case, but is not needed here, he said, because of the information that has been gathered; it is known that it is a simple range iteration that adds two integers. That means that seven micro-ops are not needed (one per bytecode instruction of the loop).

Next, the FOR_ITER_RANGE translation has an _ITER_CHECK_RANGE as its first micro-op (after _CHECK_VALIDITY_AND_SET_IP was dropped). It ensures that the object being iterated over is still a range, but in this specific case, the code in the loop makes it clear that it will always be a range, so that guard micro-op can be dropped as well.

Similarly, the micro-ops for BINARY_OP_ADD_INT (now) start with a _GUARD_BOTH_INT micro-op, which guards against the types of the operands changing and chooses the regular, slow path if that happens. Here, though, the operand types cannot change, so the guard can be removed. In the end, the 19 micro-ops generated for the seven bytecode instructions were reduced to ten.

That list of ten micro-ops is, effectively, the minimal amount of work needed to "faithfully execute the Python semantics for this for loop". It is all statically typed at this point with no unneeded overhead; all of the dynamic-type handling has been removed, "which is perfect if we want to further lower it". It turns out that further lowering is needed, because the micro-op instruction format, despite its name, is more complex than bytecode instructions; the micro-op format has a lot of information about cached values, de-optimization options, and things of that nature.

Because of that complexity, the decoder in the micro-op interpreter is a lot more expensive than the bytecode interpreter, he said. Even though each micro-op is doing less work, there are more of them than there are bytecode instructions (ten versus seven), so the dispatch overhead is higher. Turning on the micro-op traces, doing the optimization as he described, and running the result in the micro-op interpreter is actually around 20% slower than the bytecode interpreter. That is "super disappointing", because all of this extra work resulted in slower execution, but that is where JIT compilation comes into play.

JIT

The hope is that the JIT compiler can overcome the additional interpreter overhead and gain even more performance by compiling the optimized traces directly to machine code. Three mechanisms are being used to reduce the indirection that causes some performance loss.

The first is to "burn in" constants, caches, and arguments, so they are encoded directly in the machine code without having to be decoded from individual instructions in the interpreter. The second technique is to move data out of stack frames and into registers. The Python interpreter is stack-based, but that means each push and pop operation goes out to memory, so redirecting them to registers provides a large savings. The third mechanism is to move the hot code paths to inline code; "rather than having a generic interpreter that needs to handle every possible micro-op", Python will "create a straight-line sequence of the exact code that we need to execute to reduce that dispatch overhead".

All of this work needs to be weighed against Python's deployment goals, Bucher said. For example, these changes need to have broad platform support; "Python runs a lot of places" and he would not consider the project to be a success if the JIT only runs on one or two operating systems, or only on one architecture. Similarly, he wants to ensure that there are few runtime dependencies for the JIT; some deployments cannot also install a full compiler toolchain—or even just a C compiler. It should remain true that users can just download a binary from python.org and "everything just works".

The most important goal to him is that the JIT have a low implementation complexity. Python is maintained almost exclusively by volunteer effort. The project would not be a success if "our new JIT compiler was so big and so complex that nobody could reason about it, no one could maintain it, no one could fix bugs in it, or improve it, or extend on it". The team is willing to sacrifice some performance in order to ensure that the Python project remains healthy.

Historically, JIT compilers are "very very complex things", which means that the technique is in opposition to the deployment goals. But, it turns out that copy-and-patch compilation "solves a lot of these problems in a very satisfying way". It is a technique for "generating template JIT compilers", which is "really really cool". The paper at the link is quite good, Bucher said, but there is also a blog post by its main author where he implements a Lua JIT from scratch using the technique. "If you don't know how JIT compilers work and you read that blog post, you'll know how JIT compilers work."

The paper shows good performance increases, he said. Compared to the Liftoff WebAssembly baseline compiler, the paper reported code generation that was five times faster and generated code that ran 50% faster. Compared to a traditional JIT toolchain, such as LLVM with a low optimization level (-O0), the code was generated 100 times more quickly and it ran 15% faster using the copy-and-patch technique. As sort of an upper bound on what can be expected, comparing it to an optimizing JIT with hand-written assembly (LuaJIT) showed that copy-and-patch executed more quickly on around one-third of the LuaJIT benchmark tests and was only 35% slower overall.

Thought experiment

He led attendees through a thought experiment on how you might go about implementing a program to compile a sequence of "bytecode instructions to machine code as quickly as possible". The program could step through the bytecode; for each one it could copy some static, pre-compiled machine code into executable memory, then the static code could be patched for anything that needed changing in the template. Patches would include things like jump targets, constants, and cached values.

If you look at those steps, he said, they look like something that we have already had for a long time: relocatable object files. The patching in that case is for the information from relocation records for external symbols and the like. The way copy-and-patch works is analogous to the functioning of the linker and loader for object files.

He used the code for a single micro-op (_LOAD_FAST) as an example of how the JIT works. The code for the micro-op is fairly straightforward:

    case _LOAD_FAST:
        PyObject *value = frame->localsplus[oparg];
        Py_INCREF(value);
        *stack_pointer++ = value;
        break;

It simply gets an object (value) from the local-variables array (localsplus) of the frame, increments its reference count, and pushes it onto the stack (pointed to by stack_pointer).

At CPython build time, the body of the micro-op is extracted into its own function so it can be compiled separately. Of course, the code will not compile because it is missing frame and stack_pointer, but those can be passed as parameters to the function. It is also missing the value of oparg, but it is treated somewhat differently.

When the function is compiled to machine code, the value of oparg will not be known, but when the JITted code is placed into memory for a given usage of _LOAD_FAST, that value will be known. It will be the same each time, so there needs to be a way to patch that value into the compiled code when the final output for the micro-op is constructed. Since "we don't really know how to do that yet", he suggested that people think of oparg being replaced with MAGICALLY_INSERT_OPARG, but he would come back to that.

Similarly, there will be more than just a single micro-op to be JITted, so there needs to be some way to transfer control to the next instruction. He would just "magically hand-wave it away" by inserting a line at the end of the function:

    return MAGICALLY_RUN_NEXT_MICRO_OP(frame, stack_pointer);

The copy-and-patch technique has a "really elegant solution" for handling the two MAGICALLY_* placeholders by using extern declarations. This will cause the C compiler to leave holes in the generated code along with instructions on where and how to patch those holes; those instructions are intended for use by the linker, but CPython can use them for its purposes. In the end, the code for the _LOAD_FAST micro-op looks like the following:

    extern int MAGICALLY_INSERT_OPARG;
    extern int MAGICALLY_RUN_NEXT_MICRO_OP(_PyInterpreterFrame *, PyObject **);

    int
    _LOAD_FAST(_PyInterpreterFrame *frame, PyObject **stack_pointer)
    {
        PyObject *value = frame->localsplus[MAGICALLY_INSERT_OPARG];
        Py_INCREF(value);
        *stack_pointer++ = value;
        return MAGICALLY_RUN_NEXT_MICRO_OP(frame, stack_pointer);
    }

At CPython build time, that code gets compiled at a high-optimization level; there is no need to worry about the amount of time that takes since it is not done at run time. That generates an ELF object with the machine code and the relocation information for the two magical symbols. That code can actually be further optimized by CPython because it knows more than the C compiler about the domain of oparg, which always fits in 32 bits, so it can use a different instruction with a 32-bit immediate value. CPython also knows how the code will be used; since the JITted code for the micro-ops will be placed one right after another in memory, the jump emitted for MAGICALLY_RUN_NEXT_MICRO_OP would effectively be a jump of zero bytes, so it can be eliminated.

The byte values for the instructions can be extracted from the ELF, then put into a function that looks like the following:

    static void
    emit__LOAD_FAST(unsigned char *code, _PyUOpInstruction *uop)
    {
        const unsigned char code_body[] = {
              0xb8, 0x00, 0x00, 0x00, 0x00, 0x48, 0x8b, 0x44,
              0xc7, 0x48, 0x8b, 0x08, 0xff, 0xc1, 0x74, 0x02,
              0x89, 0x08, 0x48, 0x89, 0x06, 0x48, 0x83, 0xc6,
              0x08,
        };
        memcpy(code, code_body, sizeof(code_body));
        memcpy(code + 1, &uop->oparg, 4);
    }

The code_body is the sequence of bytes emitted by the C compiler, with the four zero values as the hole to be filled with oparg, which is what the second memcpy() call does. This technique makes the JIT run extremely quickly. "It actually takes almost twice as long to allocate the memory as it does to JIT the memory; that's not because allocation is slow, that's because copy-and-patch is so fast."

Fitting into CPython

Currently, the JIT build of Python 3.13 uses around 1000 lines of complex Python and 100 lines of complex C code for the copy-and-patch implementation. That is the code that builds the C templates, extracts the generated code from the ELF object, fixes up some individual micro-ops, and then writes out the generated C code. There is a dependency on LLVM, but that is only at build time; users of CPython do not need LLVM at run time.

The run-time code for the JIT compiler is around 400 lines of simple-ish hand-written C. The core of it is a loop over a micro-op trace calling emit_*() functions for each. There are also around 9000 lines of simple generated C code, which is "truly just as simple as what I showed you on the previous slide" (emit__LOAD_FAST() above). There are no run-time dependencies for users.

Because the CPython JIT uses LLVM, "we have incredible platform support right out of the box". It runs on all of the most popular operating systems (Linux, macOS, and Windows) for x86_64, as well as 32-bit Windows on x86; it also supports the three OSes on Arm 64-bit processors. It is mind-blowing to him that the project was able to support so many different platforms, so quickly, and with little additional effort.

As he noted earlier, turning on the micro-op interpreter (without enabling the JIT) is about 20% slower. That mode is mostly useful for developers on the project, because it is much easier to debug interpreted code than JITted code. It also uses around 1% more memory to store the micro-op traces.

The goal for 3.13 was to get a basic, working JIT compiler into CPython so that people could try it out, test deploying it, and allow the project to shake out more bugs—"something that doesn't crash and doesn't perform horribly". The JIT mode is experimental, but if it is configured into the binary, it is basically 0% slower; it has regained all of what was lost to the dispatch overhead in the micro-op interpreter. Meanwhile, most of the optimizations that he described in the talk have not been implemented yet, so the plan is to use the 3.14 cycle to improve on that; "I am confident that we will be able to".

The JIT compiler uses around 10% more memory, which is kind of an upper bound, Bucher said, because it is based on benchmarks that have a high ratio of code to data. Typically, if you are worried about memory usage it is because you are processing a lot of data, so the amount of JITted code will be a small portion of the memory usage. In addition, the JIT developers have not spent any time on reducing the memory used, so that figure is likely to be a high-water mark.

For further information, he recommended his previous talk and PEP 659, both linked above, to get the background on the specializing adaptive interpreter. He recently wrote PEP 744 ("JIT Compilation"), which covers many of the same things as were in the talk, but also gives more background, describes some of the design decisions, and looks to the future.

Those who want to build the JIT into their Python can simply pass a flag to either configure (--enable-experimental-jit) or build.bat (--experimental-jit). That should work for cross-compilation as well. Since the JIT is part of CPython, Bucher asked that any bugs found be filed as GitHub issues in the main CPython repository. With that, his half-hour session ran out of time.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]

Index entries for this article
Conference	PyCon/2024
Python	JIT

Deoptimization

Posted Jun 20, 2024 0:57 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Will it be possible to speculatively optimize the code, falling back to the interpreter if the assumptions turn out to be wrong?

Deoptimization

Posted Jun 20, 2024 10:57 UTC (Thu) by bluss (guest, #47454) [Link]

Both specialized instructions and the JIT are based on this kind of speculation. It inserts guarding instructions inline that check their requirements in the code, and can exit to a higher level interpreter when necessary. Here's a discussion about how that should work in detail https://github.com/faster-cpython/ideas/issues/638