|
|
Log in / Subscribe / Register

Improved code generation in the CPython JIT

By Daroc Alden
January 18, 2024

As previously reported, Python 3.13 is set to include a copy-and-patch just-in-time (JIT) compiler that transforms Python bytecode into machine code for execution. Ken Jin from the Faster CPython project has been working on taking the JIT further by adding support for a peephole optimizer that rewrites the JIT's intermediate representation to introduce constant folding, type specialization, and other optimizations. Those techniques should provide significant benefits for the performance of many different types of code running on CPython.

Background

[Code transformations]

Currently, Python code is transformed in a few ways before execution. After being parsed, the code is transformed into a high-level bytecode. Since PEP 659 (Specializing Adaptive Interpreter) was adopted in version 3.11, Python has used a specializing bytecode interpreter. This work introduces "adaptive" instructions which, when run, replace themselves with an instruction specialized to the type of its arguments. Having inline type information can help eliminate overhead from Python's dynamic dispatch by caching the necessary functions in the bytecode directly.

The JIT introduced in version 3.13 takes Python's existing bytecode and transforms it into machine code. It does this in two steps, first breaking each instruction into micro-operations, and then translating each micro-operation into machine code. This two-pass approach is intended to allow Python to continue working in an environment where JIT compilation is not supported by permitting the micro-operations to be interpreted directly by a micro-operation interpreter (called the Tier 2 interpreter in the design documentation). This also serves as a specification against which to test the JIT to make sure it remains correct. The new optimizer operates on the micro-operations before they are interpreted or transformed into machine code.

In summary, Python code now passes through several phases as the interpreter warms up. It is parsed and compiled to bytecode, and then adaptive instructions specialize the bytecode. Next, the interpreter identifies hot spots and converts them to micro-operations. These pass through any available optimizers before being compiled to machine code by the JIT or executed directly by a specialized interpreter.

Optimizations

The micro-operation optimizer works using abstract interpretation. Unlike normal interpretation, where every value is known at the time the code is run, in abstract interpretation some of the information is known and some is represented by an opaque token standing in for information that will only be learned later, at the actual runtime of the program. The trace of the abstract interpreter's execution is used to generate the sequence of micro-operations to be compiled or executed. The optimizer takes the known information in the bytecode (provided by specialization) and simulates the operation of each instruction. When all of the inputs to an instruction are fully known, such as when adding constant values together, the result is also fully known and can be carried into further instructions without recording an instruction to be emitted in the output of the optimizer. When only some of the inputs are known, the optimizer can still sometimes emit an instruction which is specialized to that input, allowing bytecode specialization to propagate. When none of the inputs are known, the instruction is emitted to be executed at runtime.

The core operation of the optimizer — taking known values and combining them to reduce the number of operations needed at runtime — is known as constant folding, which is a common optimization in other languages. Python actually already performs constant folding on constants present in the source code. The benefit of doing constant propagation again for code about to be JIT compiled is that the optimizer can take advantage of information discovered by adaptive instructions that was not available to the bytecode compiler.

Because Python is a dynamically typed language, the type information that adaptive instructions capture can be important to propagate via constant folding because of the impact on constant guards. Python bytecode frequently contains guard conditions that check whether an assumption, such as whether two operands have compatible types, is true — raising an exception or bailing to specialized handling code when it isn't. Guards are also used to ensure that specialized instructions remain applicable, and de-specialize the instruction if not. However, the bytecode compiler cannot always tell ahead of time whether two guards are redundant, such as when the type of one variable is the same as another variable, and therefore only one of them requires a check. The micro-operation optimizer is capable of noticing that checks depend on already-known information and eliding them.

Another especially useful optimization is "true" inlining of Python functions. Prior to Python 3.11, every call of a Python function required calling a recursive C function inside the interpreter, meaning that no inlining could occur. Python 3.11 introduced an inlining-like optimization to remove the overhead from the C function call by using "fake frames" to record function-call information on the Python stack, and convert Python-to-Python function calls into a simple jump inside the bytecode interpreter. This eliminates some but not all of the unnecessary overhead in making function calls.

The micro-operation optimizer is intended to eliminate the overhead from creating and destroying fake frames on the Python stack, allowing small functions to actually be inlined with no call overhead. This allows the locals from a called function to be placed in the stack frame of their parent. This optimization is expected to have a particularly dramatic effect on the overhead of creating new objects. Currently, creating an object in Python requires pushing and popping at least two frames to call the __init__() function of the object in question. By applying true inlining, the optimizer can ensure that simple object initializers that only fill slots in the object can be reduced to a series of stores directly following the memory allocation.

Other projects

This new work all integrates into the existing CPython implementation using the high-level optimizer API (introduced in version 3.12) that is intended to allow new optimizers to be developed as plugins and called by the interpreter on hot spots. These optimizers operate on "superblocks" of micro-operations. Like the more usual basic block representation, a superblock has exactly one entry point. Unlike a basic block, however, a superblock may have several exits, usually guards or places where an exception may be thrown.

While the high-level optimizer API may eventually be adopted by other projects, developers of Pyston and Cinder — two performance-oriented Python forks — said that the API was not sufficient for their use case, because it does not provide the JIT with enough control over what performance information is collected, and when functions are optimized. Currently, the new CPython JIT is the only client of the API.

Future work

While the new optimizer has not landed in Python's main branch yet, it is mostly working, and is expected to land soon once the tests and documentation have been updated. The new optimizer is one of several other ongoing improvements to the new JIT. And improvements to the JIT are not the end of the Faster CPython project's plans. The availability of the JIT interpreter unblocks additional work, including improvements to the calling convention for Python code. Mark Shannon and Michael Droettboom said in one of the project's planning documents that the recent improvements in the core interpreter code have highlighted garbage collection as the next major performance blocker. Discussion on how to improve Python's garbage collector and many other pending performance improvements is ongoing on the Faster CPython ideas issue tracker.


Index entries for this article
PythonJIT
PythonPerformance


to post comments

Improved code generation in the CPython JIT

Posted Jan 19, 2024 2:48 UTC (Fri) by salimma (subscriber, #34460) [Link]

Interesting - is CPython slowly becoming more like PyPy?

Improved code generation in the CPython JIT

Posted Jan 19, 2024 4:14 UTC (Fri) by qys (guest, #156455) [Link] (3 responses)

I wonder how would JIT affect debuggability. Python does not have Tail Call Optimization because of this: http://neopythonic.blogspot.com/2009/04/final-words-on-ta...

Improved code generation in the CPython JIT

Posted Jan 19, 2024 12:14 UTC (Fri) by vstinner (subscriber, #42675) [Link] (2 responses)

> I wonder how would JIT affect debuggability.

It depends on what you mean by "debuggability". PEP 669 "Low Impact Monitoring for CPython" is a new API added to Python 3.12 to debug Python code with a low overhead. The idea is to use a different bytecode if code is being run in a debugger. The advantage is to avoid having to check if a debugger/profiler is being used before each instruction. For example, code coverage is faster with this new API.

PEP 669: https://peps.python.org/pep-0669/

By the way, the implementation of the copy-and-patch "JIT compiler" uses machine code generated by LLVM with tail call optimization :-) It just makes the implementation easier (to generate machine code ahead with LLVM.)

Improved code generation in the CPython JIT

Posted Jan 20, 2024 17:01 UTC (Sat) by gray_-_wolf (subscriber, #131074) [Link] (1 responses)

> The idea is to use a different bytecode if code is being run in a debugger.

I wonder if this will lead to fun bugs only manifesting when not debugging.

Improved code generation in the CPython JIT

Posted Jan 21, 2024 1:02 UTC (Sun) by khim (subscriber, #9252) [Link]

100% guaranteed. But then these bugs wouldn't be bugs in your program, but bugs in JIT-compiler.

That means that Joe Average would maybe see them once in a lifetime, but there would some some people who would deal with them routinely, because it's their job.

Not any different from Java JIT, or C# JIT or any other advanced JIT.

Improved code generation in the CPython JIT

Posted Jan 19, 2024 14:34 UTC (Fri) by danscox (subscriber, #4125) [Link] (1 responses)

I'd like to thank Daroc for an excellent article! Easy to read but remaining technical. Good job, and thanks!

Improved code generation in the CPython JIT

Posted Jan 28, 2024 2:59 UTC (Sun) by smitty_one_each (subscriber, #28989) [Link]

Seconded, and stuff.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds