Shaw: Python 3.13 gets a JIT
Copy-and-patch was selected because the compilation from bytecodes to machine code is done as a set of “templates” that are then stitched together and patched at runtime with the correct values. This means that your average Python user isn’t running this complex JIT compiler architecture inside their Python runtime. Python writing it’s own IL and JIT would also be unreasonable since so many are available off-the-shelf like LLVMs and ryuJIT. But a full-JIT would require those being bundled with Python and all the added overheads. A copy-and-patch JIT only requires the LLVM JIT tools be installed on the machine where CPython is compiled from source, and for most people that means the machines of the CI that builds and packages CPython for python.org.
Posted Jan 10, 2024 3:21 UTC (Wed)
by lwnuser573 (subscriber, #134940)
[Link] (7 responses)
Posted Jan 11, 2024 9:03 UTC (Thu)
by gabrielesvelto (guest, #83884)
[Link] (6 responses)
And here's SableVM's implementation for a practical application (2003): https://www.sable.mcgill.ca/publications/papers/2003-2/sa...
Posted Jan 11, 2024 10:46 UTC (Thu)
by atnot (subscriber, #124910)
[Link] (1 responses)
> The Baseline JIT. Each bytecode instruction is compiled directly to a small piece of machine code.
https://hacks.mozilla.org/2019/08/the-baseline-interprete...
Apple Webkit's JavaScriptCore seems to use the same concept and terminology
> In a nutshell, the Baseline JIT is a template JIT and what that means is that it generates specific machine code for the bytecode operation. There are two key factors that allow the Baseline JIT a speed up over the LLInt3:
https://zon8.re/posts/jsc-internals-part2-the-llint-and-b...
Posted Jan 11, 2024 12:00 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Compile a high-level language to p-code, and you're going to be spending nearly all your time inside internal function calls, and your interpreter is going to be doing very little.
Okay, it's not a jit, it's not patched your code to machine code, but the effect is pretty much the same ...
Cheers,
Posted Jan 11, 2024 12:29 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link] (2 responses)
Basically copy and patch is a cheap and (relatively) easy approach to generating executable machine code at runtime - instead of having to know how to write machine code for your platform (e.g. instruction encodings for different instructions), you write short template functions in C and compile with clang at build-time (LLVM provides a lot of control over calling convention compared to GCC, apparently), and store the resulting code as data embedded in the executable (with some preprocessing). Then at runtime, dynamic code generation is memory to memory copies followed by some fixups.
I guess a lot of code generation is template driven, the question really is how you get the templates.
Posted Jan 11, 2024 13:26 UTC (Thu)
by gabrielesvelto (guest, #83884)
[Link] (1 responses)
Posted Jan 11, 2024 15:05 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link]
Posted Jan 11, 2024 13:53 UTC (Thu)
by anton (subscriber, #25547)
[Link]
In Retargeting JIT compilers by using C-compiler
generated executable code (as ) we used a copy-and-patch approach to get rid of the VM instruction pointer. We derived the patch information by comparing two variants of each code fragment with different immediate arguments (and some architecture-specific knowledge about possible encodings). I'll have to read Copy-and-Patch Compilation to find what they did differently.
Earlier work in that vein was Automatic, template-based run-time specialization, but that's in the context of a specializer rather than a language implementation, and they used a different way to get the patch information.
Posted Jan 11, 2024 14:08 UTC (Thu)
by DarkFlameMaster (guest, #113360)
[Link] (7 responses)
It is interesting that after so many years, there are still people who believe the performance of Python can be improved in any significant way using template compilation (JIT or AoT). There was UnladenSwallow. It was LLVM-based, and used template JIT. And then there was Grumpy. It used template compilation to trans-pile Python to Go. And now someone is trying template (copy-and-patch) JIT again. Unsurprisingly this approach merely yielded 2-9% performance improvement over CPython. Given that CPython can be 10x to 100x (NOTE: It's 100 times, not 100%) as slow as the equivalent C code, the 2-9% performance improvement is really not significant. The real bottleneck of CPython is that even the simplest operations (such as the "+" operator) take a very long code path to execute due to the need of interpretation loop, type checking, object allocation, boxing/unboxing, etc. Castanos et al. explained it in greater detail in https://dl.acm.org/doi/10.1145/2398857.2384631 And I explained in an old post (https://lwn.net/Articles/710727/) that if we want Python to be significantly faster, we must have the ability to optimize this program into this: That is, specializing the program for Now let's come back to the new copy-and-patch JIT compiler we are talking about. From the generated That is about 70 instructions in a total of 206 bytes. It is not a small instruction sequence for an ADD operation. The JIT compiler is simply going to copy these bytes into the generated machine code sequence and patch a few holes. The code sequence basically implements the switch case in It calls If every UnladenSwallow did roughly the same, except it translates each Python bytecode into a sequence of LLVM IR. But when it encountered an ADD operation, it would call To run dynamic languages fast, a VM needs a specializing JIT compiler capable of reducing type checks and lowering high-level operations (such as arbitrary-precision adding) to low-level operations similar to what the CPU provides (such as adding two 64-bit integers). The VM also needs an efficient garbage collector, but that's another topic. There are many high-performance VMs for dynamic languges that do those. PyPy has a tracing JIT compiler capable of specialization, and it has a garbage-collection framework inspired by MMTk. SpiderMonkey, V8 and JSCore do specializing JIT compilation at method level, and have relatively efficient GC (compared to naive RC), too. And there are LuaJIT, HHVM, etc. to name a few. Seriously, so many open-source project developers have devoted themselves into the efficient execution of dynamic languages, and the academia worked on this topic, too. We should learn something from them before we redo what UnladenSwallow did all over again.
Posted Jan 11, 2024 14:36 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
So yes, if you want a fast Python implementation, you also have to reduce the weight of each VM instruction implementation (at least the frequnetly-executed ones). And the more you do that, the bigger the benefit of a technique like copying-and-patching will be.
Posted Jan 11, 2024 15:21 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (2 responses)
Or increase the weight (as in the amount of real work each individual instruction does), which again reduces the relative impact of the interpreter.
Cheers,
Posted Jan 11, 2024 15:44 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link]
Posted Jan 11, 2024 17:35 UTC (Thu)
by anton (subscriber, #25547)
[Link]
E.g., I translated However, Forth does not have arbitrary-length integers nor run-time type checking. Python is by necessity more heavy-weight, but it should be possible to check the types of s and x to be small integers at the start, and then compile the
Posted Jan 11, 2024 15:40 UTC (Thu)
by qwertyface (subscriber, #84167)
[Link]
Posted Jan 11, 2024 18:00 UTC (Thu)
by mb (subscriber, #50428)
[Link]
Why?
There will only be significant gains, if the JIT will at some point be able to avoid Python runtime library calls for common cases.
Posted Jan 15, 2024 6:43 UTC (Mon)
by amarao (guest, #87073)
[Link]
Posted Mar 7, 2024 4:38 UTC (Thu)
by Rudd-O (guest, #61155)
[Link]
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
> Removal of interpreter dispatch. Interpreter dispatch is the costliest part of interpretation, since the indirect branches used for selecting the implementation of an opcode are hard for the CPU to predict.
Shaw: Python 3.13 gets a JIT
Wol
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
Inline threading is copying without patching. There is still a VM instruction pointer around, and it is used for getting at the immediate arguments that are patched in with copy-and-patch, including, for control flow, the VM-level branch targets; so inline threading performs control flow by loading the new VM instruction pointer, then looking up the machine code there, and jumping to it.
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
def my_mul(x, y):
s = x
i = 1
while i < y:
s = s + x
i = i + 1
return s
PyObject my_mul(PyObject x, PyObject y) {
if (x.is_not_int_or_is_too_big_for_int64()) { recompile(); }
if (y.is_not_int_or_is_too_big_for_int64()) { recompile(); }
int64_t xx = x.to_int64();
int64_t yy = y.to_int64();
int64_t s = xx;
int64_t i = 1;
while (i < yy) {
if (__builtin_add_overflow(s, xx, &s)) { recompile(); }
if (__builtin_add_overflow(i, 1, &i)) { recompile(); }
}
return PyObject::from_int64(s);
}
int64
and eliminating all the type checkings and boxing/unboxing in the hot paths.jit_stencils.h
file, we can find the code sequence for BINARY_OP_ADD_INT
(on x86_64):// _BINARY_OP_ADD_INT
//
// /tmp/tmp5h6_cn60/_BINARY_OP_ADD_INT.o: file format elf64-x86-64
//
// Disassembly of section .text:
//
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55 pushq %rbp
// 1: 41 57 pushq %r15
// 3: 41 56 pushq %r14
// 5: 41 55 pushq %r13
// 7: 41 54 pushq %r12
// 9: 53 pushq %rbx
// a: 50 pushq %rax
// b: 49 89 d6 movq %rdx, %r14
// e: 49 89 f7 movq %rsi, %r15
// 11: 48 89 fb movq %rdi, %rbx
// 14: 4c 8b 6e f0 movq -0x10(%rsi), %r13
// 18: 48 8b 6e f8 movq -0x8(%rsi), %rbp
// 1c: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 000000000000001e: R_X86_64_64 _PyLong_Add
// 26: 4c 89 ef movq %r13, %rdi
// 29: 48 89 ee movq %rbp, %rsi
// 2c: ff d0 callq *%rax
// 2e: 49 89 c4 movq %rax, %r12
// 31: 48 8b 45 00 movq (%rbp), %rax
// 35: 85 c0 testl %eax, %eax
// 37: 78 09 js 0x42 <_JIT_ENTRY+0x42>
// 39: 48 ff c8 decq %rax
// 3c: 48 89 45 00 movq %rax, (%rbp)
// 40: 74 22 je 0x64 <_JIT_ENTRY+0x64>
// 42: 49 8b 45 00 movq (%r13), %rax
// 46: 85 c0 testl %eax, %eax
// 48: 78 31 js 0x7b <_JIT_ENTRY+0x7b>
// 4a: 48 ff c8 decq %rax
// 4d: 49 89 45 00 movq %rax, (%r13)
// 51: 75 28 jne 0x7b <_JIT_ENTRY+0x7b>
// 53: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000055: R_X86_64_64 PyObject_Free
// 5d: 4c 89 ef movq %r13, %rdi
// 60: ff d0 callq *%rax
// 62: eb 17 jmp 0x7b <_JIT_ENTRY+0x7b>
// 64: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000066: R_X86_64_64 PyObject_Free
// 6e: 48 89 ef movq %rbp, %rdi
// 71: ff d0 callq *%rax
// 73: 49 8b 45 00 movq (%r13), %rax
// 77: 85 c0 testl %eax, %eax
// 79: 79 cf jns 0x4a <_JIT_ENTRY+0x4a>
// 7b: 49 8d 47 f0 leaq -0x10(%r15), %rax
// 7f: 4d 85 e4 testq %r12, %r12
// 82: 74 2a je 0xae <_JIT_ENTRY+0xae>
// 84: 49 83 c7 f8 addq $-0x8, %r15
// 88: 4c 89 20 movq %r12, (%rax)
// 8b: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 000000000000008d: R_X86_64_64 _JIT_CONTINUE
// 95: 48 89 df movq %rbx, %rdi
// 98: 4c 89 fe movq %r15, %rsi
// 9b: 4c 89 f2 movq %r14, %rdx
// 9e: 48 83 c4 08 addq $0x8, %rsp
// a2: 5b popq %rbx
// a3: 41 5c popq %r12
// a5: 41 5d popq %r13
// a7: 41 5e popq %r14
// a9: 41 5f popq %r15
// ab: 5d popq %rbp
// ac: ff e0 jmpq *%rax
// ae: 48 29 d8 subq %rbx, %rax
// b1: 48 83 c0 b8 addq $-0x48, %rax
// b5: 48 c1 e8 03 shrq $0x3, %rax
// b9: 89 43 40 movl %eax, 0x40(%rbx)
// bc: 31 c0 xorl %eax, %eax
// be: 48 83 c4 08 addq $0x8, %rsp
// c2: 5b popq %rbx
// c3: 41 5c popq %r12
// c5: 41 5d popq %r13
// c7: 41 5e popq %r14
// c9: 41 5f popq %r15
// cb: 5d popq %rbp
// cc: c3 retq
// cd:
static const unsigned char _BINARY_OP_ADD_INT_code_body[206] = {0x55, 0x41, 0x57, 0x41, /* 200 bytes omitted */ 0xc3};
static const Hole _BINARY_OP_ADD_INT_code_holes[5] = {
{0x1e, HoleKind_R_X86_64_64, HoleValue_ZERO, &_PyLong_Add, 0x0},
{0x55, HoleKind_R_X86_64_64, HoleValue_ZERO, &PyObject_Free, 0x0},
{0x66, HoleKind_R_X86_64_64, HoleValue_ZERO, &PyObject_Free, 0x0},
{0x8d, HoleKind_R_X86_64_64, HoleValue_CONTINUE, NULL, 0x0},
};
// 0:
static const unsigned char _BINARY_OP_ADD_INT_data_body[1];
static const Hole _BINARY_OP_ADD_INT_data_holes[1];
executor_cases.c.h
: case _BINARY_OP_ADD_INT: {
PyObject *right;
PyObject *left;
PyObject *res;
right = stack_pointer[-1];
left = stack_pointer[-2];
STAT_INC(BINARY_OP, hit);
res = _PyLong_Add((PyLongObject *)left, (PyLongObject *)right);
_Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
_Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
if (res == NULL) goto pop_2_error_tier_two;
stack_pointer[-2] = res;
stack_pointer += -1;
break;
}
PyLong_Add
regardless whether the operands are big or small. After that, it also performs decrement operations and may potentially free the operands.+
operator is going to be translated to 70 instructions, among which there is one function call to PyLong_Add
and two reference counting operations, there is no way it can be fast, even if it is JIT compiled into machine code.PyNumber_Add
, too. It is in principle the same as this copy-and-patch JIT compiler. Such template JIT compilers can eliminate the interpretation loop, i.e. the loop that loads the bytecode and uses switch
to dispatch the instruction into its case:
. IIRC, UnladenSwallow can run 2x as fast as CPython (that is, making Python 5x-50x slower than C instead of 10x-100x slower), and unsurprisingly that's basically as much as template JIT compilers can do. Template JIT can eliminate the interpretation loop, but other execution overheads remain, namely the pervasive type checks and the very naive form of reference counting garbage collection.
Yes, the 2%-9% number reminded me of Catenation and Specialization
for Tcl Virtual Machine Performance, where they found that the large size of the machine code for their VM instructions reduced how much benefit resulted from eliminating the interpretive overhead, and they also found problems with I-cache misses in some cases.
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
Wol
Shaw: Python 3.13 gets a JIT
Python has followed the path of heavy-weight stuff since the start, and where it works, it's fine. E.g., when there is just a little work in Python that just calls library functions in C, and the lion's share of work is done in that library. But when it does not work fine, it results in programs that run much slower than C code and also quite a bit slower than code from a code-copying interpreter with light-weight operations.
Shaw: Python 3.13 gets a JIT
my_mul
above into Forth code that's close to the Python version (using locals and a while loop) rather than idiomatic:
: my_mul {: x y -- s :}
x 1 begin {: s i :}
i y < while
s x +
i 1 +
repeat
s ;
gforth-fast
, a code-copying interpreter (without patching), produces the following code:
$7F2156294F80 >l 1->1
0x00007f2155f3c002: mov %rbp,%rax
0x00007f2155f3c005: add $0x8,%r13
0x00007f2155f3c009: lea -0x8(%rbp),%rbp
0x00007f2155f3c00d: mov %r8,-0x8(%rax)
0x00007f2155f3c011: mov 0x0(%r13),%r8
$7F2156294F88 >l 1->0
0x00007f2155f3c015: mov %rbp,%rax
0x00007f2155f3c018: lea -0x8(%rbp),%rbp
0x00007f2155f3c01c: mov %r8,-0x8(%rax)
$7F2156294F90 @local0 0->1
0x00007f2155f3c020: mov 0x0(%rbp),%r8
$7F2156294F98 lit 1->1
$7F2156294FA0 #1
0x00007f2155f3c024: mov %r8,0x0(%r13)
0x00007f2155f3c028: sub $0x8,%r13
0x00007f2155f3c02c: mov 0x20(%rbx),%r8
0x00007f2155f3c030: add $0x28,%rbx
$7F2156294FA8 >l 1->1
0x00007f2155f3c034: mov %rbp,%rax
0x00007f2155f3c037: add $0x8,%r13
0x00007f2155f3c03b: lea -0x8(%rbp),%rbp
0x00007f2155f3c03f: mov %r8,-0x8(%rax)
0x00007f2155f3c043: mov 0x0(%r13),%r8
$7F2156294FB0 >l 1->0
0x00007f2155f3c047: mov %rbp,%rax
0x00007f2155f3c04a: lea -0x8(%rbp),%rbp
0x00007f2155f3c04e: mov %r8,-0x8(%rax)
$7F2156294FB8 @local1 0->1
0x00007f2155f3c052: mov 0x8(%rbp),%r8
$7F2156294FC0 @local3 1->1
0x00007f2155f3c056: mov %r8,0x0(%r13)
0x00007f2155f3c05a: mov 0x18(%rbp),%r8
0x00007f2155f3c05e: sub $0x8,%r13
$7F2156294FC8 < ?branch 1->1
$7F2156294FD0 ?branch
$7F2156294FD8 <my_mul+$A8>
0x00007f2155f3c062: add $0x38,%rbx
0x00007f2155f3c066: mov 0x8(%r13),%rax
0x00007f2155f3c06a: add $0x10,%r13
0x00007f2155f3c06e: mov -0x8(%rbx),%rsi
0x00007f2155f3c072: cmp %r8,%rax
0x00007f2155f3c075: mov 0x0(%r13),%r8
0x00007f2155f3c079: jl 0x7f2155f3c083
0x00007f2155f3c07b: mov (%rsi),%rax
0x00007f2155f3c07e: mov %rsi,%rbx
0x00007f2155f3c081: jmp *%rax
$7F2156294FE0 @local0 1->2
0x00007f2155f3c083: mov 0x0(%rbp),%r15
$7F2156294FE8 @local2 2->3
0x00007f2155f3c087: mov 0x10(%rbp),%r9
$7F2156294FF0 + 3->2
0x00007f2155f3c08b: add %r9,%r15
$7F2156294FF8 @local1 2->1
0x00007f2155f3c08e: mov %r15,-0x8(%r13)
0x00007f2155f3c092: sub $0x10,%r13
0x00007f2155f3c096: mov %r8,0x10(%r13)
0x00007f2155f3c09a: mov 0x8(%rbp),%r8
$7F2156295000 lit+ 1->1
$7F2156295008 #1
0x00007f2155f3c09e: add 0x28(%rbx),%r8
$7F2156295010 lp+2 1->1
0x00007f2155f3c0a2: add $0x10,%rbp
$7F2156295018 branch 1->1
$7F2156295020 <my_mul+$28>
0x00007f2155f3c0a6: mov 0x40(%rbx),%rbx
0x00007f2155f3c0aa: mov (%rbx),%rax
0x00007f2155f3c0ad: jmp *%rax
0x00007f2155f3c0af: nop
$7F2156295028 @local0 1->1
0x00007f2155f3c0b0: mov %r8,0x0(%r13)
0x00007f2155f3c0b4: sub $0x8,%r13
0x00007f2155f3c0b8: mov 0x0(%rbp),%r8
$7F2156295030 lp+!# 1->1
$7F2156295038 #32
0x00007f2155f3c0bc: add $0x18,%rbx
0x00007f2155f3c0c0: add -0x8(%rbx),%rbp
$7F2156295040 ;s 1->1
0x00007f2155f3c0c4: mov (%r14),%rbx
0x00007f2155f3c0c7: add $0x8,%r14
0x00007f2155f3c0cb: mov (%rbx),%rax
0x00007f2155f3c0ce: jmp *%rax
Note that the +
(integer addition) at 7F2156294FF0 is implemented with one instruction. There is also a "heavy-weight" VM instruction lit+
that results from combining the sequence 1 +
. It also results in one instruction (on AMD64).
+
of s+x
into
add %r9,%r15
jo slow_path
and do that in a copy-and-patch system.
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT
Because that is basically about the performance gain (up to 10%) if you compile a Python program with Cython.
Not type-annotated Cython does exactly that: It replaces the interpreter loop with optimized C code.
But that improvement is insignificant, because most of the program time is spent in the Python runtime library.
Shaw: Python 3.13 gets a JIT
Shaw: Python 3.13 gets a JIT