Improving performance in Python 2.7

By Jake Edge
June 3, 2015

Backporting a major performance improvement from Python 3 to Python 2 might seem to be in the "no-brainer" category, but things are not quite that simple. Python 2 (in the form of Python 2.7.x) is in a "no new features" mode that would normally preclude large changes, but 2.7 will be around for a lot longer than was previously envisioned. That makes the core development team more willing to consider this kind of patch, especially since it seems to come with a promise of more contributions. As with any topic that touches on the 2 vs. 3 question, though, it led to a long discussion and some dissent.

Vamsi Parasa of the Server Scripting Languages Optimization team at Intel posted a patch to change the switch statement that executes Python bytecode in the CPython interpreter to use computed gotos instead. GCC has support for the feature, which has already been used as an optimization in Python 3. As Eli Bendersky explained in a 2012 blog post, the enormous (2000+ line) switch statement for interpreting Python bytecodes can be made 15-20% faster by changing it to use computed gotos. There are two reasons for that: computed gotos avoid a bounds check that is required by the C99 standard for switch and, perhaps more significantly, the CPU can do better branch prediction, which reduces expensive pipeline flushes.

Several core developers spoke up in favor of the patch, but Berker Peksağ was not so sure. He complained that "performance improvements are not bug fixes" and that it would be better to spend time making Python 3 better. But there is more to this patch than meets the eye, Nick Coghlan explained:

Internal performance improvements, by contrast, don't hurt end users at all beyond the stability risks, and in this case, the request to make the change is being accompanied by the offer to assist with ongoing maintenance (including engaging an experienced core developer to help coach Intel contributors through the contribution process).

So when folks ask "What changed?" in relation to this request, what changed is the fact that it isn't expected to be a one off contribution, but rather part of a broader effort focused on improving the performance of both Python 2 and Python 3, including contributions to ongoing maintenance activities.

As was noted in an email exchange tacked onto Parasa's post, Intel has hired Python core developer David Murray's company to help out navigating the Python development process with contributions it wants to make. But those contributions "came with one string attached: that the Python 2.7 branch be opened up for performance improvements in addition to bug fixes", Coghlan said. Because the proposal is "backed by a credible offer of ongoing contributions to CPython maintenance and support", that makes it different than other performance enhancements (or features) offered up for Python 2.7 along the way, he continued.

But accepting the offer does signal something of a shift in the way Python 2.7 will be maintained going forward. Encouraging more developers who are being paid to do the "boring" parts of Python maintenance (bug fixes and performance enhancements for 2.7, mostly) will allow volunteers to concentrate on the more interesting bits. Coghlan again:

Giving the nod to an increased corporate developer presence in Python 2 maintenance should eventually let volunteers stop worrying about even Python 2.7 bug fix changes with a clear conscience, confident that as volunteer efforts drop away redistributors and other folks with an institutional interest will pick up the slack with paid development time. "Do the fun stuff for free, figure out a way to get paid for the boring-but-necessary stuff (or leave those tasks to someone else that's getting paid to handle them)" is a good sustainable approach to open source development, while trying to do it *all* for free is a fast path to burnout.

Python's benevolent dictator for life (BDFL) Guido van Rossum was "strongly in favor" of the patch noting that it could save companies like his employer, Dropbox, "a lot of money". Dropbox has been slow to move to Python 3 but regularly updates to the latest Python 2.7 release, he said. But Victor Stinner wondered if it made more sense to put that effort into Python 3 in the hopes of getting more migration to that version of Python.

Van Rossum was clearly not happy with the idea of crippling Python 2 to somehow promote Python 3:

However this talk of "wasting our time with Python 2" needs to stop, and if you think that making Python 2 less attractive will encourage people to migrate to Python 3, think again. Companies like Intel are *contributing* by offering this backport up publicly.

As Larry Hastings pointed out, though, the question had already become moot. Python 2.7 release manager Benjamin Peterson had merged the patch for 2.7.11, which is slated for release in December.

The rate of Python 3 adoption and the best way to encourage it are often somewhat contentious subjects in Python circles. Though many have bemoaned the lack of new features in Python 2.7, a large part of the decision to do that has been driven by a lack of core developers interested in continuing to work on that branch. Adding more paid developers should help alleviate some of those concerns. It seems unlikely to lead to any wholesale changes to 2.7, but there may be enhancements that can be made in the next few years. Meanwhile, new features like async/await, type hints, and others will continue to be added to Python 3 to hopefully provide a carrot to draw more and more to that version of the language. Or at least that is the plan ...

Improving performance in Python 2.7

Posted Jun 4, 2015 17:01 UTC (Thu) by dashesy (guest, #74652) [Link] (1 responses)

Type hints will go to 3.5 but after upgrading my PyCharm I noticed it already uses it for type inference (on a version 2 code base), a very pleasant surprise.

Improving performance in Python 2.7

Posted Jun 11, 2015 6:07 UTC (Thu) by njs (subscriber, #40338) [Link]

A lot of the motivation for the type hints work has been to consolidate the existing type inference stuff that's being done independently by PyCharm, Google's internal linting tools, Microsoft's Visual-Studio-for-Python, etc. So what's new in 3.5 isn't so much the existence of type inference, but that now there's a standard way for these different projects to collaborate with each other and with third-party libraries using a common language.

Improving performance in Python 2.7

Posted Jun 4, 2015 21:09 UTC (Thu) by arjan (subscriber, #36785) [Link] (1 responses)

Being more of a casual python programmer, the 2/3 split is very painful. For me, the best way to encourage people to go from 2 to 3 is actually to make 2 MORE like 3, at least on a language level. Let p2 accept more of the p3 syntax, and I'll write p3 code!
(and I'm pretty sure I'm not the only one who'd do that)

Well for various cases I try to use/write p3 code anyway, but various linux distros are not helping by making it harder-than-needed to get p3 (and the various addon components) for it installed.

Improving performance in Python 2.7

Posted Jun 5, 2015 5:56 UTC (Fri) by voltagex (guest, #86296) [Link]

http://python-future.org/compatible_idioms.html

Improving performance in Python 2.7

Posted Jun 4, 2015 23:21 UTC (Thu) by dlang (guest, #313) [Link]

avoiding improving Python 2 does encourage people to move off of it. But that's all it does. It doesn't encourage them to move to Python 3, just off of Python 2. Some subset of those users will move to Python 3, but some other subset will be so disgusted with this sort of coercion that they will not only move away from Python entirely, but will become vocal opponents of Python and work to encourage others to move away from Python as well.

Improving performance in Python 2.7

Posted Jun 5, 2015 9:50 UTC (Fri) by eru (subscriber, #2753) [Link] (8 responses)

If the programmer can simulate a construct faster than the compiler can implement the construct itself, then the compiler writer has blown it badly

- Guy L. Steele Jr. (quoted in "Bumper-sticker Computer Science", in Jon Bentleys "Programming Pearls" column, CACM September 1985).

If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.

Improving performance in Python 2.7

Posted Jun 5, 2015 18:34 UTC (Fri) by madscientist (subscriber, #16861) [Link] (6 responses)

If you read the discussion you'll see that switch cannot be as fast as the computed goto solution, because of requirements on switch behavior mandated by the C standard that the computed goto solution doesn't need to follow.

Improving performance in Python 2.7

Posted Jun 5, 2015 20:49 UTC (Fri) by eru (subscriber, #2753) [Link] (5 responses)

If the computed goto interpreter is as shown in http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, it will crash the interpreter uncontrollably if an invalid opcode is seen, because the table access has no range check. I think this is not robust programming, skipping the range check is cheating. One way around this would be to have 256 entries in the table, which makes the range check unnecessary, when indexed by a byte and the unused entries are set to jump to an error handler. But a smart compiler could perform the equivalent optimization for switches when the value is a byte or other guaranteed small range, like the result of code[pc++] & SUITABLE_MASK.

Improving performance in Python 2.7

Posted Jun 5, 2015 21:21 UTC (Fri) by madscientist (subscriber, #16861) [Link] (4 responses)

> it will crash the interpreter uncontrollably if an invalid opcode is seen

Yes, exactly, and since it's your code that's up to you. But the compiler can't do that.

> I think this is not robust programming, skipping the range check is cheating.

It's only not robust and cheating if you're not sure that the value will always be within the appropriate range. If you write your code such that it will always be the case that the opcode is valid (say, because your interpreter generated it so you know that the opcode can only be one of a preset list) then it's perfectly robust and "within the rules". But the compiler can't assume that (except in the most trivial situations, perhaps).

> But a smart compiler could perform the equivalent optimization for switches when the value is [various restrictive assumptions]

Yes, that's true. But, the question is how worthwhile is it for someone to magick up the compiler to do these optimizations for the very limited situations in which those restrictive assumptions hold true? For example, I would say that the type of the switch expression is very often int, not char, and a C compiler very rarely can know what the "guaranteed small range" might be (maybe in C++ with the enum class more could be done here). Even when you know the type of the expression is char, is it worthwhile to generate 256 * sizeof(void*) arrays for every switch statement just to get this level of speed improvement?

Improving performance in Python 2.7

Posted Jun 6, 2015 5:25 UTC (Sat) by eru (subscriber, #2753) [Link]

I personally would put in a range check (or other method of ensuring an invalid opcode would not crash) even if the opcode came from other parts of the same program. Just because I know I and other programmers are fallible. Checks like this have often helped me.

About the switch optimization, of course putting in a 256-entry jump table for every switch(somechar) does not make sense, but if there are around 50 cases or more, it would be OK, and that is pretty common (branching for each printable character, for example).

Improving performance in Python 2.7

Posted Jun 8, 2015 20:12 UTC (Mon) by Yorick (guest, #19241) [Link] (2 responses)

Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2). Don't take my word for it; write a test program and look at what GCC emits.

Improving performance in Python 2.7

Posted Jun 8, 2015 23:59 UTC (Mon) by cesarb (subscriber, #6266) [Link] (1 responses)

> Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2).

Well, clang is considered a modern C compiler, right?

int my_switch(int number)
{
switch (number & 0x7) {
case 0: return fn1(); break;
case 1: return fn2(); break;
case 2: return fn3(); break;
case 3: return fn4(); break;
case 4: return fn5(); break;
case 5: return fn6(); break;
case 6: return fn7(); break;
case 7: return fn8(); break;
}
crash();
return 0;
}

$ clang -O3 -save-temps -c foo.c
$ cat foo.s
[...]
.globl my_switch
.align 16, 0x90
.type my_switch,@function
my_switch: # @my_switch
.cfi_startproc
# BB#0:
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
# kill: EDI<def> EDI<kill> RDI<def>
andl $7, %edi
cmpl $7, %edi
jbe .LBB0_1
# BB#10:
xorl %eax, %eax
callq crash
xorl %eax, %eax
popq %rdx
retq
.LBB0_1:
jmpq *.LJTI0_0(,%rdi,8)
.LBB0_2:
xorl %eax, %eax
popq %rdx
jmp fn1 # TAILCALL
[...]
.section .rodata,"a",@progbits
.align 8
.LJTI0_0:
.quad .LBB0_2
.quad .LBB0_3
[...]

Improving performance in Python 2.7

Posted Jun 9, 2015 4:36 UTC (Tue) by eru (subscriber, #2753) [Link]


...
andl $7, %edi
cmpl $7, %edi
...

"gcc -O3" (version 4.8.3) manages to avoid that redundant comparison:

my_switch:
.LFB0:
	.cfi_startproc
	andl	$7, %edi
	xorl	%eax, %eax
	jmp	*.L4(,%rdi,8)
	.section	.rodata
	.align 8
	.align 4
.L4:
	.quad	.L2
       ....

Improving performance in Python 2.7

Posted Jun 16, 2015 14:04 UTC (Tue) by anton (subscriber, #25547) [Link]

If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.

Labels-as-values are a more basic feature than switch; the "goto *" has to do less than a switch, and cannot be simulated efficiently with a switch, while you can use labels-as-values to implement a switch; if you did that, and the labels-as-values version was faster than the switch, then yes, the implementation of switch by the compiler writer would be suboptimal. But being faster than switch for interpreter dispatch is not a sign that the compiler writers did a mistake, because intrpreter dispatch does not need all the stuff that switch does (in particular range checking).

However, the main benefit is coming from better branch prediction. One can get that with labels-as-values, and one can also get this with switch (see next paragraph); there are some gcc versions where gcc "optimizes" the code by combining all the "goto *" into one (e.g. PR15242), resulting in a branch prediction as bad as if you do the classic single switch. In this case, the compiler writers have really blown it badly.

Interestingly, if you replicate the switch (one per VM instruction), the same gcc versions do not perform the same "optimization" (data). The switch-replication technique is interesting, but (with a typical C compiler) will have a significant space consumption for dispatch tables (and cache misses from that), because the dispatch tables are replicated as well.

Ideally, a compiler that sees a loop containing a switch should do the replication of the indirect branches itself (without replicating the dispatch tables). A student of mine implemented that in gcc, and found that it is pretty hard, because this kind of thing goes against the grain of gcc (that's probably also why combining all "goto *" into one is coming back regularly).

BTW, labels-as-values and "goto *" are like Fortran's "assigned goto", while Fortran's "computed goto" is more like switch; still, the gcc maintainers usually call that feature "computed goto".