Shaw: Python 3.13 gets a JIT [LWN.net]

Shaw: Python 3.13 gets a JIT

Posted Jan 10, 2024 3:21 UTC (Wed) by lwnuser573 (subscriber, #134940) [Link] (7 responses)

Interview with python developer: https://youtu.be/BgxklT94W0I?si=aLzx_P67HzzoamfO

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 9:03 UTC (Thu) by gabrielesvelto (guest, #83884) [Link] (6 responses)

I find it kind of odd that they claim this is a new technique, because something similar - under the name of inline-threading - has been around since forever. Here's the original article about it (1998): https://dl.acm.org/doi/pdf/10.1145/277650.277743

And here's SableVM's implementation for a practical application (2003): https://www.sable.mcgill.ca/publications/papers/2003-2/sa...

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 10:46 UTC (Thu) by atnot (subscriber, #124910) [Link] (1 responses)

I had the same thought: this reminds me a lot of what firefox calls the "baseline JIT"

> The Baseline JIT. Each bytecode instruction is compiled directly to a small piece of machine code.

https://hacks.mozilla.org/2019/08/the-baseline-interprete...

Apple Webkit's JavaScriptCore seems to use the same concept and terminology

> In a nutshell, the Baseline JIT is a template JIT and what that means is that it generates specific machine code for the bytecode operation. There are two key factors that allow the Baseline JIT a speed up over the LLInt3:
> Removal of interpreter dispatch. Interpreter dispatch is the costliest part of interpretation, since the indirect branches used for selecting the implementation of an opcode are hard for the CPU to predict.

https://zon8.re/posts/jsc-internals-part2-the-llint-and-b...

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 12:00 UTC (Thu) by Wol (subscriber, #4433) [Link]

My immediate reaction to this is p-code.

Compile a high-level language to p-code, and you're going to be spending nearly all your time inside internal function calls, and your interpreter is going to be doing very little.

Okay, it's not a jit, it's not patched your code to machine code, but the effect is pretty much the same ...

Cheers,
Wol

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 12:29 UTC (Thu) by qwertyface (subscriber, #84167) [Link] (2 responses)

I had a quick scan of those papers, and I don't think that they're talking about quite the same class of thing — copy and patch seems to be an implementation strategy for code generation within a somewhat conventional JIT; inline-threading is more like a style of "JIT". Which isn't to say that the copy-and-patch approach is a new technique (it's not, it's used by a Lua interpreter as I understand it).

Basically copy and patch is a cheap and (relatively) easy approach to generating executable machine code at runtime - instead of having to know how to write machine code for your platform (e.g. instruction encodings for different instructions), you write short template functions in C and compile with clang at build-time (LLVM provides a lot of control over calling convention compared to GCC, apparently), and store the resulting code as data embedded in the executable (with some preprocessing). Then at runtime, dynamic code generation is memory to memory copies followed by some fixups.

I guess a lot of code generation is template driven, the question really is how you get the templates.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 13:26 UTC (Thu) by gabrielesvelto (guest, #83884) [Link] (1 responses)

What you described is exactly the way this was implemented in SableVM. That's why I don't think this is a novel technique.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:05 UTC (Thu) by qwertyface (subscriber, #84167) [Link]

I don't think that they claim it's a novel technique, just that the JIT is new to CPython. It isn't the usual way to implement this either, It's not what PyPy does for example, nor hotspot or the CLR.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 13:53 UTC (Thu) by anton (subscriber, #25547) [Link]

Inline threading is copying without patching. There is still a VM instruction pointer around, and it is used for getting at the immediate arguments that are patched in with copy-and-patch, including, for control flow, the VM-level branch targets; so inline threading performs control flow by loading the new VM instruction pointer, then looking up the machine code there, and jumping to it.

In Retargeting JIT compilers by using C-compiler generated executable code (as ) we used a copy-and-patch approach to get rid of the VM instruction pointer. We derived the patch information by comparing two variants of each code fragment with different immediate arguments (and some architecture-specific knowledge about possible encodings). I'll have to read Copy-and-Patch Compilation to find what they did differently.

Earlier work in that vein was Automatic, template-based run-time specialization, but that's in the context of a specializer rather than a language implementation, and they used a different way to get the patch information.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 14:08 UTC (Thu) by DarkFlameMaster (guest, #113360) [Link] (7 responses)

It is interesting that after so many years, there are still people who believe the performance of Python can be improved in any significant way using template compilation (JIT or AoT).

There was UnladenSwallow. It was LLVM-based, and used template JIT.

And then there was Grumpy. It used template compilation to trans-pile Python to Go.

And now someone is trying template (copy-and-patch) JIT again. Unsurprisingly this approach merely yielded 2-9% performance improvement over CPython. Given that CPython can be 10x to 100x (NOTE: It's 100 times, not 100%) as slow as the equivalent C code, the 2-9% performance improvement is really not significant.

The real bottleneck of CPython is that even the simplest operations (such as the "+" operator) take a very long code path to execute due to the need of interpretation loop, type checking, object allocation, boxing/unboxing, etc.

Castanos et al. explained it in greater detail in https://dl.acm.org/doi/10.1145/2398857.2384631

And I explained in an old post (https://lwn.net/Articles/710727/) that if we want Python to be significantly faster, we must have the ability to optimize this program

def my_mul(x, y):
    s = x
    i = 1
    while i < y:
        s = s + x
        i = i + 1
    return s

into this:

PyObject my_mul(PyObject x, PyObject y) {
    if (x.is_not_int_or_is_too_big_for_int64()) { recompile(); }
    if (y.is_not_int_or_is_too_big_for_int64()) { recompile(); }
    int64_t xx = x.to_int64();
    int64_t yy = y.to_int64();
    int64_t s = xx;
    int64_t i = 1;
    while (i < yy) {
        if (__builtin_add_overflow(s, xx, &s)) { recompile(); }
        if (__builtin_add_overflow(i, 1, &i)) { recompile(); }
    }
    return PyObject::from_int64(s);
}

That is, specializing the program for int64 and eliminating all the type checkings and boxing/unboxing in the hot paths.

Now let's come back to the new copy-and-patch JIT compiler we are talking about. From the generated jit_stencils.h file, we can find the code sequence for BINARY_OP_ADD_INT (on x86_64):

// _BINARY_OP_ADD_INT
// 
// /tmp/tmp5h6_cn60/_BINARY_OP_ADD_INT.o:  file format elf64-x86-64
// 
// Disassembly of section .text:
// 
// 0000000000000000 <_JIT_ENTRY>:
// 0: 55                            pushq   %rbp
// 1: 41 57                         pushq   %r15
// 3: 41 56                         pushq   %r14
// 5: 41 55                         pushq   %r13
// 7: 41 54                         pushq   %r12
// 9: 53                            pushq   %rbx
// a: 50                            pushq   %rax
// b: 49 89 d6                      movq    %rdx, %r14
// e: 49 89 f7                      movq    %rsi, %r15
// 11: 48 89 fb                      movq    %rdi, %rbx
// 14: 4c 8b 6e f0                   movq    -0x10(%rsi), %r13
// 18: 48 8b 6e f8                   movq    -0x8(%rsi), %rbp
// 1c: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 000000000000001e:  R_X86_64_64  _PyLong_Add
// 26: 4c 89 ef                      movq    %r13, %rdi
// 29: 48 89 ee                      movq    %rbp, %rsi
// 2c: ff d0                         callq   *%rax
// 2e: 49 89 c4                      movq    %rax, %r12
// 31: 48 8b 45 00                   movq    (%rbp), %rax
// 35: 85 c0                         testl   %eax, %eax
// 37: 78 09                         js      0x42 <_JIT_ENTRY+0x42>
// 39: 48 ff c8                      decq    %rax
// 3c: 48 89 45 00                   movq    %rax, (%rbp)
// 40: 74 22                         je      0x64 <_JIT_ENTRY+0x64>
// 42: 49 8b 45 00                   movq    (%r13), %rax
// 46: 85 c0                         testl   %eax, %eax
// 48: 78 31                         js      0x7b <_JIT_ENTRY+0x7b>
// 4a: 48 ff c8                      decq    %rax
// 4d: 49 89 45 00                   movq    %rax, (%r13)
// 51: 75 28                         jne     0x7b <_JIT_ENTRY+0x7b>
// 53: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000055:  R_X86_64_64  PyObject_Free
// 5d: 4c 89 ef                      movq    %r13, %rdi
// 60: ff d0                         callq   *%rax
// 62: eb 17                         jmp     0x7b <_JIT_ENTRY+0x7b>
// 64: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 0000000000000066:  R_X86_64_64  PyObject_Free
// 6e: 48 89 ef                      movq    %rbp, %rdi
// 71: ff d0                         callq   *%rax
// 73: 49 8b 45 00                   movq    (%r13), %rax
// 77: 85 c0                         testl   %eax, %eax
// 79: 79 cf                         jns     0x4a <_JIT_ENTRY+0x4a>
// 7b: 49 8d 47 f0                   leaq    -0x10(%r15), %rax
// 7f: 4d 85 e4                      testq   %r12, %r12
// 82: 74 2a                         je      0xae <_JIT_ENTRY+0xae>
// 84: 49 83 c7 f8                   addq    $-0x8, %r15
// 88: 4c 89 20                      movq    %r12, (%rax)
// 8b: 48 b8 00 00 00 00 00 00 00 00 movabsq $0x0, %rax
// 000000000000008d:  R_X86_64_64  _JIT_CONTINUE
// 95: 48 89 df                      movq    %rbx, %rdi
// 98: 4c 89 fe                      movq    %r15, %rsi
// 9b: 4c 89 f2                      movq    %r14, %rdx
// 9e: 48 83 c4 08                   addq    $0x8, %rsp
// a2: 5b                            popq    %rbx
// a3: 41 5c                         popq    %r12
// a5: 41 5d                         popq    %r13
// a7: 41 5e                         popq    %r14
// a9: 41 5f                         popq    %r15
// ab: 5d                            popq    %rbp
// ac: ff e0                         jmpq    *%rax
// ae: 48 29 d8                      subq    %rbx, %rax
// b1: 48 83 c0 b8                   addq    $-0x48, %rax
// b5: 48 c1 e8 03                   shrq    $0x3, %rax
// b9: 89 43 40                      movl    %eax, 0x40(%rbx)
// bc: 31 c0                         xorl    %eax, %eax
// be: 48 83 c4 08                   addq    $0x8, %rsp
// c2: 5b                            popq    %rbx
// c3: 41 5c                         popq    %r12
// c5: 41 5d                         popq    %r13
// c7: 41 5e                         popq    %r14
// c9: 41 5f                         popq    %r15
// cb: 5d                            popq    %rbp
// cc: c3                            retq
// cd: 
static const unsigned char _BINARY_OP_ADD_INT_code_body[206] = {0x55, 0x41, 0x57, 0x41, /* 200 bytes omitted */ 0xc3};
static const Hole _BINARY_OP_ADD_INT_code_holes[5] = {
    {0x1e, HoleKind_R_X86_64_64, HoleValue_ZERO, &_PyLong_Add, 0x0},
    {0x55, HoleKind_R_X86_64_64, HoleValue_ZERO, &PyObject_Free, 0x0},
    {0x66, HoleKind_R_X86_64_64, HoleValue_ZERO, &PyObject_Free, 0x0},
    {0x8d, HoleKind_R_X86_64_64, HoleValue_CONTINUE, NULL, 0x0},
};
// 0: 
static const unsigned char _BINARY_OP_ADD_INT_data_body[1];
static const Hole _BINARY_OP_ADD_INT_data_holes[1];

That is about 70 instructions in a total of 206 bytes. It is not a small instruction sequence for an ADD operation. The JIT compiler is simply going to copy these bytes into the generated machine code sequence and patch a few holes.

The code sequence basically implements the switch case in executor_cases.c.h:

        case _BINARY_OP_ADD_INT: {
            PyObject *right;
            PyObject *left;
            PyObject *res;
            right = stack_pointer[-1];
            left = stack_pointer[-2];
            STAT_INC(BINARY_OP, hit);
            res = _PyLong_Add((PyLongObject *)left, (PyLongObject *)right);
            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
            _Py_DECREF_SPECIALIZED(left, (destructor)PyObject_Free);
            if (res == NULL) goto pop_2_error_tier_two;
            stack_pointer[-2] = res;
            stack_pointer += -1;
            break;
        }

It calls PyLong_Add regardless whether the operands are big or small. After that, it also performs decrement operations and may potentially free the operands.

If every + operator is going to be translated to 70 instructions, among which there is one function call to PyLong_Add and two reference counting operations, there is no way it can be fast, even if it is JIT compiled into machine code.

UnladenSwallow did roughly the same, except it translates each Python bytecode into a sequence of LLVM IR. But when it encountered an ADD operation, it would call PyNumber_Add, too. It is in principle the same as this copy-and-patch JIT compiler. Such template JIT compilers can eliminate the interpretation loop, i.e. the loop that loads the bytecode and uses switch to dispatch the instruction into its case:. IIRC, UnladenSwallow can run 2x as fast as CPython (that is, making Python 5x-50x slower than C instead of 10x-100x slower), and unsurprisingly that's basically as much as template JIT compilers can do. Template JIT can eliminate the interpretation loop, but other execution overheads remain, namely the pervasive type checks and the very naive form of reference counting garbage collection.

To run dynamic languages fast, a VM needs a specializing JIT compiler capable of reducing type checks and lowering high-level operations (such as arbitrary-precision adding) to low-level operations similar to what the CPU provides (such as adding two 64-bit integers). The VM also needs an efficient garbage collector, but that's another topic. There are many high-performance VMs for dynamic languges that do those. PyPy has a tracing JIT compiler capable of specialization, and it has a garbage-collection framework inspired by MMTk. SpiderMonkey, V8 and JSCore do specializing JIT compilation at method level, and have relatively efficient GC (compared to naive RC), too. And there are LuaJIT, HHVM, etc. to name a few.

Seriously, so many open-source project developers have devoted themselves into the efficient execution of dynamic languages, and the academia worked on this topic, too. We should learn something from them before we redo what UnladenSwallow did all over again.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 14:36 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

Yes, the 2%-9% number reminded me of Catenation and Specialization for Tcl Virtual Machine Performance, where they found that the large size of the machine code for their VM instructions reduced how much benefit resulted from eliminating the interpretive overhead, and they also found problems with I-cache misses in some cases.

So yes, if you want a fast Python implementation, you also have to reduce the weight of each VM instruction implementation (at least the frequnetly-executed ones). And the more you do that, the bigger the benefit of a technique like copying-and-patching will be.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:21 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> So yes, if you want a fast Python implementation, you also have to reduce the weight of each VM instruction implementation (at least the frequnetly-executed ones)

Or increase the weight (as in the amount of real work each individual instruction does), which again reduces the relative impact of the interpreter.

Cheers,
Wol

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:44 UTC (Thu) by qwertyface (subscriber, #84167) [Link]

Traditionally CPython uses pretty chunky operations, but that is not very convenient for any sort of optimisation, so recent work has been on lowering to what they call a "tier-2" bytecode, which is more amenable to the sort of optimisations that parent comment calls for. Naturally, this makes interpretation (much) slower, and this JIT seems to reclaim some of the performance.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 17:35 UTC (Thu) by anton (subscriber, #25547) [Link]

Python has followed the path of heavy-weight stuff since the start, and where it works, it's fine. E.g., when there is just a little work in Python that just calls library functions in C, and the lion's share of work is done in that library. But when it does not work fine, it results in programs that run much slower than C code and also quite a bit slower than code from a code-copying interpreter with light-weight operations.

E.g., I translated my_mul above into Forth code that's close to the Python version (using locals and a while loop) rather than idiomatic:

: my_mul {: x y -- s :}
    x 1 begin {: s i :}
        i y < while
            s x +
            i 1 +
    repeat
    s ;

gforth-fast, a code-copying interpreter (without patching), produces the following code:

$7F2156294F80 >l    1->1 
   0x00007f2155f3c002:  mov    %rbp,%rax
   0x00007f2155f3c005:  add    $0x8,%r13
   0x00007f2155f3c009:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c00d:  mov    %r8,-0x8(%rax)
   0x00007f2155f3c011:  mov    0x0(%r13),%r8
$7F2156294F88 >l    1->0 
   0x00007f2155f3c015:  mov    %rbp,%rax
   0x00007f2155f3c018:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c01c:  mov    %r8,-0x8(%rax)
$7F2156294F90 @local0    0->1 
   0x00007f2155f3c020:  mov    0x0(%rbp),%r8
$7F2156294F98 lit    1->1 
$7F2156294FA0 #1 
   0x00007f2155f3c024:  mov    %r8,0x0(%r13)
   0x00007f2155f3c028:  sub    $0x8,%r13
   0x00007f2155f3c02c:  mov    0x20(%rbx),%r8
   0x00007f2155f3c030:  add    $0x28,%rbx
$7F2156294FA8 >l    1->1 
   0x00007f2155f3c034:  mov    %rbp,%rax
   0x00007f2155f3c037:  add    $0x8,%r13
   0x00007f2155f3c03b:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c03f:  mov    %r8,-0x8(%rax)
   0x00007f2155f3c043:  mov    0x0(%r13),%r8
$7F2156294FB0 >l    1->0 
   0x00007f2155f3c047:  mov    %rbp,%rax
   0x00007f2155f3c04a:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c04e:  mov    %r8,-0x8(%rax)
$7F2156294FB8 @local1    0->1 
   0x00007f2155f3c052:  mov    0x8(%rbp),%r8
$7F2156294FC0 @local3    1->1 
   0x00007f2155f3c056:  mov    %r8,0x0(%r13)
   0x00007f2155f3c05a:  mov    0x18(%rbp),%r8
   0x00007f2155f3c05e:  sub    $0x8,%r13
$7F2156294FC8 < ?branch     1->1 
$7F2156294FD0 ?branch
$7F2156294FD8 <my_mul+$A8> 
   0x00007f2155f3c062:  add    $0x38,%rbx
   0x00007f2155f3c066:  mov    0x8(%r13),%rax
   0x00007f2155f3c06a:  add    $0x10,%r13
   0x00007f2155f3c06e:  mov    -0x8(%rbx),%rsi
   0x00007f2155f3c072:  cmp    %r8,%rax
   0x00007f2155f3c075:  mov    0x0(%r13),%r8
   0x00007f2155f3c079:  jl     0x7f2155f3c083
   0x00007f2155f3c07b:  mov    (%rsi),%rax
   0x00007f2155f3c07e:  mov    %rsi,%rbx
   0x00007f2155f3c081:  jmp    *%rax
$7F2156294FE0 @local0    1->2 
   0x00007f2155f3c083:  mov    0x0(%rbp),%r15
$7F2156294FE8 @local2    2->3 
   0x00007f2155f3c087:  mov    0x10(%rbp),%r9
$7F2156294FF0 +    3->2 
   0x00007f2155f3c08b:  add    %r9,%r15
$7F2156294FF8 @local1    2->1 
   0x00007f2155f3c08e:  mov    %r15,-0x8(%r13)
   0x00007f2155f3c092:  sub    $0x10,%r13
   0x00007f2155f3c096:  mov    %r8,0x10(%r13)
   0x00007f2155f3c09a:  mov    0x8(%rbp),%r8
$7F2156295000 lit+    1->1 
$7F2156295008 #1 
   0x00007f2155f3c09e:  add    0x28(%rbx),%r8
$7F2156295010 lp+2    1->1 
   0x00007f2155f3c0a2:  add    $0x10,%rbp
$7F2156295018 branch    1->1 
$7F2156295020 <my_mul+$28> 
   0x00007f2155f3c0a6:  mov    0x40(%rbx),%rbx
   0x00007f2155f3c0aa:  mov    (%rbx),%rax
   0x00007f2155f3c0ad:  jmp    *%rax
   0x00007f2155f3c0af:  nop
$7F2156295028 @local0    1->1 
   0x00007f2155f3c0b0:  mov    %r8,0x0(%r13)
   0x00007f2155f3c0b4:  sub    $0x8,%r13
   0x00007f2155f3c0b8:  mov    0x0(%rbp),%r8
$7F2156295030 lp+!#    1->1 
$7F2156295038 #32 
   0x00007f2155f3c0bc:  add    $0x18,%rbx
   0x00007f2155f3c0c0:  add    -0x8(%rbx),%rbp
$7F2156295040 ;s    1->1 
   0x00007f2155f3c0c4:  mov    (%r14),%rbx
   0x00007f2155f3c0c7:  add    $0x8,%r14
   0x00007f2155f3c0cb:  mov    (%rbx),%rax
   0x00007f2155f3c0ce:  jmp    *%rax

Note that the + (integer addition) at 7F2156294FF0 is implemented with one instruction. There is also a "heavy-weight" VM instruction lit+ that results from combining the sequence 1 +. It also results in one instruction (on AMD64).

However, Forth does not have arbitrary-length integers nor run-time type checking. Python is by necessity more heavy-weight, but it should be possible to check the types of s and x to be small integers at the start, and then compile the + of s+x into

add    %r9,%r15
jo slow_path

and do that in a copy-and-patch system.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:40 UTC (Thu) by qwertyface (subscriber, #84167) [Link]

I think you're being a bit unfair on the developers here! I'm pretty sure that they do understand that it will be useful to do things like operating on unboxed integers, deferring stores etc. they're absolutely not beginners at this, and nor are they unaware of the workings of other runtimes. That just isn't the point they've reached yet. They're trying to rebuild the plane while it's flying and make each change be a worthwhile improvement in its own right. So far the faster CPython project has been pretty successful.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 18:00 UTC (Thu) by mb (subscriber, #50428) [Link]

Only having an improvement of 2-9% didn't really surprise me at all and that was exactly the order of magnitude I expected before even reading the article.

Why?
Because that is basically about the performance gain (up to 10%) if you compile a Python program with Cython.
Not type-annotated Cython does exactly that: It replaces the interpreter loop with optimized C code.
But that improvement is insignificant, because most of the program time is spent in the Python runtime library.

There will only be significant gains, if the JIT will at some point be able to avoid Python runtime library calls for common cases.

Shaw: Python 3.13 gets a JIT

Posted Jan 15, 2024 6:43 UTC (Mon) by amarao (guest, #87073) [Link]

If python is 100 times slower than C, and there is 9% optimization, that's 9 whole C worth of optimizations.

Shaw: Python 3.13 gets a JIT

Posted Mar 7, 2024 4:38 UTC (Thu) by Rudd-O (guest, #61155) [Link]

Clever halfway solution!