Improving performance in Python 2.7
Backporting a major performance improvement from Python 3 to Python 2 might seem to be in the "no-brainer" category, but things are not quite that simple. Python 2 (in the form of Python 2.7.x) is in a "no new features" mode that would normally preclude large changes, but 2.7 will be around for a lot longer than was previously envisioned. That makes the core development team more willing to consider this kind of patch, especially since it seems to come with a promise of more contributions. As with any topic that touches on the 2 vs. 3 question, though, it led to a long discussion and some dissent.
Vamsi Parasa of the Server Scripting Languages Optimization team at Intel posted a patch to change the switch statement that executes Python bytecode in the CPython interpreter to use computed gotos instead. GCC has support for the feature, which has already been used as an optimization in Python 3. As Eli Bendersky explained in a 2012 blog post, the enormous (2000+ line) switch statement for interpreting Python bytecodes can be made 15-20% faster by changing it to use computed gotos. There are two reasons for that: computed gotos avoid a bounds check that is required by the C99 standard for switch and, perhaps more significantly, the CPU can do better branch prediction, which reduces expensive pipeline flushes.
Several core developers spoke up in favor of the patch, but Berker Peksağ
was not so sure. He complained that
"performance improvements are not bug fixes
" and that it would
be better to spend time making Python 3 better. But there is more to this
patch than meets the eye, Nick Coghlan explained:
So when folks ask "What changed?" in relation to this request, what changed is the fact that it isn't expected to be a one off contribution, but rather part of a broader effort focused on improving the performance of both Python 2 and Python 3, including contributions to ongoing maintenance activities.
As was noted in an email exchange tacked onto Parasa's post, Intel has
hired Python core developer David Murray's company to help out navigating
the Python development process with
contributions it wants to make. But those contributions "came with one string attached: that the Python 2.7 branch be
opened up for performance improvements in addition to bug fixes
",
Coghlan said. Because the proposal is "backed by a credible offer of ongoing contributions to CPython
maintenance and support
", that makes it different than other
performance enhancements (or features) offered up for Python 2.7 along
the way, he continued.
But accepting the offer does signal something of a shift in the way Python 2.7 will be maintained going forward. Encouraging more developers who are being paid to do the "boring" parts of Python maintenance (bug fixes and performance enhancements for 2.7, mostly) will allow volunteers to concentrate on the more interesting bits. Coghlan again:
Python's benevolent dictator for life (BDFL) Guido van Rossum was
"strongly in favor
" of the patch noting that it could save companies like his
employer, Dropbox, "a lot of
money
". Dropbox has been slow to move to Python 3 but regularly
updates to the latest Python 2.7 release, he said. But Victor Stinner wondered if it made more sense to put that effort
into Python 3 in the hopes of getting more migration to that version
of Python.
Van Rossum was clearly not happy with the idea of crippling Python 2 to somehow promote Python 3:
As Larry Hastings pointed out, though, the question had already become moot. Python 2.7 release manager Benjamin Peterson had merged the patch for 2.7.11, which is slated for release in December.
The rate of Python 3 adoption and the best way to encourage it are often somewhat contentious subjects in Python circles. Though many have bemoaned the lack of new features in Python 2.7, a large part of the decision to do that has been driven by a lack of core developers interested in continuing to work on that branch. Adding more paid developers should help alleviate some of those concerns. It seems unlikely to lead to any wholesale changes to 2.7, but there may be enhancements that can be made in the next few years. Meanwhile, new features like async/await, type hints, and others will continue to be added to Python 3 to hopefully provide a carrot to draw more and more to that version of the language. Or at least that is the plan ...
Posted Jun 4, 2015 17:01 UTC (Thu)
by dashesy (guest, #74652)
[Link] (1 responses)
Posted Jun 11, 2015 6:07 UTC (Thu)
by njs (subscriber, #40338)
[Link]
Posted Jun 4, 2015 21:09 UTC (Thu)
by arjan (subscriber, #36785)
[Link] (1 responses)
Well for various cases I try to use/write p3 code anyway, but various linux distros are not helping by making it harder-than-needed to get p3 (and the various addon components) for it installed.
Posted Jun 5, 2015 5:56 UTC (Fri)
by voltagex (guest, #86296)
[Link]
Posted Jun 4, 2015 23:21 UTC (Thu)
by dlang (guest, #313)
[Link]
Posted Jun 5, 2015 9:50 UTC (Fri)
by eru (subscriber, #2753)
[Link] (8 responses)
- Guy L. Steele Jr. (quoted in "Bumper-sticker Computer Science", in Jon Bentleys "Programming Pearls" column, CACM September 1985).
If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.
Posted Jun 5, 2015 18:34 UTC (Fri)
by madscientist (subscriber, #16861)
[Link] (6 responses)
Posted Jun 5, 2015 20:49 UTC (Fri)
by eru (subscriber, #2753)
[Link] (5 responses)
If the computed goto interpreter is as shown in http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, it will crash the interpreter uncontrollably if an invalid opcode is seen, because the table access has no range check. I think this is not robust programming, skipping the range check is cheating. One way around this would be to have 256 entries in the table, which makes the range check unnecessary, when indexed by a byte and the unused entries are set to jump to an error handler. But a smart compiler could perform the equivalent optimization for switches when the value is a byte or other guaranteed small range, like the result of code[pc++] & SUITABLE_MASK.
Posted Jun 5, 2015 21:21 UTC (Fri)
by madscientist (subscriber, #16861)
[Link] (4 responses)
Yes, exactly, and since it's your code that's up to you. But the compiler can't do that.
> I think this is not robust programming, skipping the range check is cheating.
It's only not robust and cheating if you're not sure that the value will always be within the appropriate range. If you write your code such that it will always be the case that the opcode is valid (say, because your interpreter generated it so you know that the opcode can only be one of a preset list) then it's perfectly robust and "within the rules". But the compiler can't assume that (except in the most trivial situations, perhaps).
> But a smart compiler could perform the equivalent optimization for switches when the value is [various restrictive assumptions]
Yes, that's true. But, the question is how worthwhile is it for someone to magick up the compiler to do these optimizations for the very limited situations in which those restrictive assumptions hold true? For example, I would say that the type of the switch expression is very often int, not char, and a C compiler very rarely can know what the "guaranteed small range" might be (maybe in C++ with the enum class more could be done here). Even when you know the type of the expression is char, is it worthwhile to generate 256 * sizeof(void*) arrays for every switch statement just to get this level of speed improvement?
Posted Jun 6, 2015 5:25 UTC (Sat)
by eru (subscriber, #2753)
[Link]
I personally would put in a range check (or other method of ensuring an invalid opcode would not crash) even if the opcode came from other parts of the same program. Just because I know I and other programmers are fallible. Checks like this have often helped me.
About the switch optimization, of course putting in a 256-entry jump table for every switch(somechar) does not make sense, but if there are around 50 cases or more, it would be OK, and that is pretty common (branching for each printable character, for example).
Posted Jun 8, 2015 20:12 UTC (Mon)
by Yorick (guest, #19241)
[Link] (2 responses)
Posted Jun 8, 2015 23:59 UTC (Mon)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Well, clang is considered a modern C compiler, right?
int my_switch(int number)
$ clang -O3 -save-temps -c foo.c
Posted Jun 9, 2015 4:36 UTC (Tue)
by eru (subscriber, #2753)
[Link]
"gcc -O3" (version 4.8.3) manages to avoid that redundant comparison:
Posted Jun 16, 2015 14:04 UTC (Tue)
by anton (subscriber, #25547)
[Link]
However, the main benefit is coming from better branch prediction. One can get that with labels-as-values, and one can also get this with switch (see next paragraph); there are some gcc versions where gcc "optimizes" the code by combining all the "goto *" into one (e.g. PR15242), resulting in a branch prediction as bad as if you do the classic single switch. In this case, the compiler writers have really blown it badly.
Interestingly, if you replicate the switch (one per VM instruction), the same gcc versions do not perform the same "optimization" (data). The switch-replication technique is interesting, but (with a typical C compiler) will have a significant space consumption for dispatch tables (and cache misses from that), because the dispatch tables are replicated as well.
Ideally, a compiler that sees a loop containing a switch should do the replication of the indirect branches itself (without replicating the dispatch tables). A student of mine implemented that in gcc, and found that it is pretty hard, because this kind of thing goes against the grain of gcc (that's probably also why combining all "goto *" into one is coming back regularly).
BTW, labels-as-values and "goto *" are like Fortran's "assigned goto", while Fortran's "computed goto" is more like switch; still, the gcc maintainers usually call that feature "computed goto".
Improving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
(and I'm pretty sure I'm not the only one who'd do that)
Improving performance in Python 2.7
Improving performance in Python 2.7
If the programmer can simulate a construct faster than the compiler can implement the construct itself, then the compiler writer has blown it badlyImproving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
Improving performance in Python 2.7
{
switch (number & 0x7) {
case 0: return fn1(); break;
case 1: return fn2(); break;
case 2: return fn3(); break;
case 3: return fn4(); break;
case 4: return fn5(); break;
case 5: return fn6(); break;
case 6: return fn7(); break;
case 7: return fn8(); break;
}
crash();
return 0;
}
$ cat foo.s
[...]
.globl my_switch
.align 16, 0x90
.type my_switch,@function
my_switch: # @my_switch
.cfi_startproc
# BB#0:
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
# kill: EDI<def> EDI<kill> RDI<def>
andl $7, %edi
cmpl $7, %edi
jbe .LBB0_1
# BB#10:
xorl %eax, %eax
callq crash
xorl %eax, %eax
popq %rdx
retq
.LBB0_1:
jmpq *.LJTI0_0(,%rdi,8)
.LBB0_2:
xorl %eax, %eax
popq %rdx
jmp fn1 # TAILCALL
[...]
.section .rodata,"a",@progbits
.align 8
.LJTI0_0:
.quad .LBB0_2
.quad .LBB0_3
[...]
Improving performance in Python 2.7
...
andl $7, %edi
cmpl $7, %edi
...
my_switch:
.LFB0:
.cfi_startproc
andl $7, %edi
xorl %eax, %eax
jmp *.L4(,%rdi,8)
.section .rodata
.align 8
.align 4
.L4:
.quad .L2
....
Improving performance in Python 2.7
If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.
Labels-as-values are a more basic feature than switch; the "goto *" has to do less than a switch, and cannot be simulated efficiently with a switch, while you can use labels-as-values to implement a switch; if you did that, and the labels-as-values version was faster than the switch, then yes, the implementation of switch by the compiler writer would be suboptimal. But being faster than switch for interpreter dispatch is not a sign that the compiler writers did a mistake, because intrpreter dispatch does not need all the stuff that switch does (in particular range checking).