Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:21 UTC (Thu) by Wol (subscriber, #4433)
In reply to: Shaw: Python 3.13 gets a JIT by anton
Parent article: Shaw: Python 3.13 gets a JIT

> So yes, if you want a fast Python implementation, you also have to reduce the weight of each VM instruction implementation (at least the frequnetly-executed ones)

Or increase the weight (as in the amount of real work each individual instruction does), which again reduces the relative impact of the interpreter.

Cheers,
Wol

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 15:44 UTC (Thu) by qwertyface (subscriber, #84167) [Link]

Traditionally CPython uses pretty chunky operations, but that is not very convenient for any sort of optimisation, so recent work has been on lowering to what they call a "tier-2" bytecode, which is more amenable to the sort of optimisations that parent comment calls for. Naturally, this makes interpretation (much) slower, and this JIT seems to reclaim some of the performance.

Shaw: Python 3.13 gets a JIT

Posted Jan 11, 2024 17:35 UTC (Thu) by anton (subscriber, #25547) [Link]

Python has followed the path of heavy-weight stuff since the start, and where it works, it's fine. E.g., when there is just a little work in Python that just calls library functions in C, and the lion's share of work is done in that library. But when it does not work fine, it results in programs that run much slower than C code and also quite a bit slower than code from a code-copying interpreter with light-weight operations.

E.g., I translated my_mul above into Forth code that's close to the Python version (using locals and a while loop) rather than idiomatic:

: my_mul {: x y -- s :}
    x 1 begin {: s i :}
        i y < while
            s x +
            i 1 +
    repeat
    s ;

gforth-fast, a code-copying interpreter (without patching), produces the following code:

$7F2156294F80 >l    1->1 
   0x00007f2155f3c002:  mov    %rbp,%rax
   0x00007f2155f3c005:  add    $0x8,%r13
   0x00007f2155f3c009:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c00d:  mov    %r8,-0x8(%rax)
   0x00007f2155f3c011:  mov    0x0(%r13),%r8
$7F2156294F88 >l    1->0 
   0x00007f2155f3c015:  mov    %rbp,%rax
   0x00007f2155f3c018:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c01c:  mov    %r8,-0x8(%rax)
$7F2156294F90 @local0    0->1 
   0x00007f2155f3c020:  mov    0x0(%rbp),%r8
$7F2156294F98 lit    1->1 
$7F2156294FA0 #1 
   0x00007f2155f3c024:  mov    %r8,0x0(%r13)
   0x00007f2155f3c028:  sub    $0x8,%r13
   0x00007f2155f3c02c:  mov    0x20(%rbx),%r8
   0x00007f2155f3c030:  add    $0x28,%rbx
$7F2156294FA8 >l    1->1 
   0x00007f2155f3c034:  mov    %rbp,%rax
   0x00007f2155f3c037:  add    $0x8,%r13
   0x00007f2155f3c03b:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c03f:  mov    %r8,-0x8(%rax)
   0x00007f2155f3c043:  mov    0x0(%r13),%r8
$7F2156294FB0 >l    1->0 
   0x00007f2155f3c047:  mov    %rbp,%rax
   0x00007f2155f3c04a:  lea    -0x8(%rbp),%rbp
   0x00007f2155f3c04e:  mov    %r8,-0x8(%rax)
$7F2156294FB8 @local1    0->1 
   0x00007f2155f3c052:  mov    0x8(%rbp),%r8
$7F2156294FC0 @local3    1->1 
   0x00007f2155f3c056:  mov    %r8,0x0(%r13)
   0x00007f2155f3c05a:  mov    0x18(%rbp),%r8
   0x00007f2155f3c05e:  sub    $0x8,%r13
$7F2156294FC8 < ?branch     1->1 
$7F2156294FD0 ?branch
$7F2156294FD8 <my_mul+$A8> 
   0x00007f2155f3c062:  add    $0x38,%rbx
   0x00007f2155f3c066:  mov    0x8(%r13),%rax
   0x00007f2155f3c06a:  add    $0x10,%r13
   0x00007f2155f3c06e:  mov    -0x8(%rbx),%rsi
   0x00007f2155f3c072:  cmp    %r8,%rax
   0x00007f2155f3c075:  mov    0x0(%r13),%r8
   0x00007f2155f3c079:  jl     0x7f2155f3c083
   0x00007f2155f3c07b:  mov    (%rsi),%rax
   0x00007f2155f3c07e:  mov    %rsi,%rbx
   0x00007f2155f3c081:  jmp    *%rax
$7F2156294FE0 @local0    1->2 
   0x00007f2155f3c083:  mov    0x0(%rbp),%r15
$7F2156294FE8 @local2    2->3 
   0x00007f2155f3c087:  mov    0x10(%rbp),%r9
$7F2156294FF0 +    3->2 
   0x00007f2155f3c08b:  add    %r9,%r15
$7F2156294FF8 @local1    2->1 
   0x00007f2155f3c08e:  mov    %r15,-0x8(%r13)
   0x00007f2155f3c092:  sub    $0x10,%r13
   0x00007f2155f3c096:  mov    %r8,0x10(%r13)
   0x00007f2155f3c09a:  mov    0x8(%rbp),%r8
$7F2156295000 lit+    1->1 
$7F2156295008 #1 
   0x00007f2155f3c09e:  add    0x28(%rbx),%r8
$7F2156295010 lp+2    1->1 
   0x00007f2155f3c0a2:  add    $0x10,%rbp
$7F2156295018 branch    1->1 
$7F2156295020 <my_mul+$28> 
   0x00007f2155f3c0a6:  mov    0x40(%rbx),%rbx
   0x00007f2155f3c0aa:  mov    (%rbx),%rax
   0x00007f2155f3c0ad:  jmp    *%rax
   0x00007f2155f3c0af:  nop
$7F2156295028 @local0    1->1 
   0x00007f2155f3c0b0:  mov    %r8,0x0(%r13)
   0x00007f2155f3c0b4:  sub    $0x8,%r13
   0x00007f2155f3c0b8:  mov    0x0(%rbp),%r8
$7F2156295030 lp+!#    1->1 
$7F2156295038 #32 
   0x00007f2155f3c0bc:  add    $0x18,%rbx
   0x00007f2155f3c0c0:  add    -0x8(%rbx),%rbp
$7F2156295040 ;s    1->1 
   0x00007f2155f3c0c4:  mov    (%r14),%rbx
   0x00007f2155f3c0c7:  add    $0x8,%r14
   0x00007f2155f3c0cb:  mov    (%rbx),%rax
   0x00007f2155f3c0ce:  jmp    *%rax

Note that the + (integer addition) at 7F2156294FF0 is implemented with one instruction. There is also a "heavy-weight" VM instruction lit+ that results from combining the sequence 1 +. It also results in one instruction (on AMD64).

However, Forth does not have arbitrary-length integers nor run-time type checking. Python is by necessity more heavy-weight, but it should be possible to check the types of s and x to be small integers at the start, and then compile the + of s+x into

add    %r9,%r15
jo slow_path

and do that in a copy-and-patch system.