LWN: Comments on "Improving performance in Python 2.7"

Improving performance in Python 2.7

anton — Tue, 16 Jun 2015 14:04:24 +0000

If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.

Labels-as-values are a more basic feature than switch; the "goto *" has to do less than a switch, and cannot be simulated efficiently with a switch, while you can use labels-as-values to implement a switch; if you did that, and the labels-as-values version was faster than the switch, then yes, the implementation of switch by the compiler writer would be suboptimal. But being faster than switch for interpreter dispatch is not a sign that the compiler writers did a mistake, because intrpreter dispatch does not need all the stuff that switch does (in particular range checking).

However, the main benefit is coming from better branch prediction. One can get that with labels-as-values, and one can also get this with switch (see next paragraph); there are some gcc versions where gcc "optimizes" the code by combining all the "goto *" into one (e.g. PR15242), resulting in a branch prediction as bad as if you do the classic single switch. In this case, the compiler writers have really blown it badly.

Interestingly, if you replicate the switch (one per VM instruction), the same gcc versions do not perform the same "optimization" (data). The switch-replication technique is interesting, but (with a typical C compiler) will have a significant space consumption for dispatch tables (and cache misses from that), because the dispatch tables are replicated as well.

Ideally, a compiler that sees a loop containing a switch should do the replication of the indirect branches itself (without replicating the dispatch tables). A student of mine implemented that in gcc, and found that it is pretty hard, because this kind of thing goes against the grain of gcc (that's probably also why combining all "goto *" into one is coming back regularly).

BTW, labels-as-values and "goto *" are like Fortran's "assigned goto", while Fortran's "computed goto" is more like switch; still, the gcc maintainers usually call that feature "computed goto".

Improving performance in Python 2.7

njs — Thu, 11 Jun 2015 06:07:45 +0000

A lot of the motivation for the type hints work has been to consolidate the existing type inference stuff that's being done independently by PyCharm, Google's internal linting tools, Microsoft's Visual-Studio-for-Python, etc. So what's new in 3.5 isn't so much the existence of type inference, but that now there's a standard way for these different projects to collaborate with each other and with third-party libraries using a common language.

Improving performance in Python 2.7

eru — Tue, 09 Jun 2015 04:36:49 +0000


...
andl $7, %edi
cmpl $7, %edi
...

"gcc -O3" (version 4.8.3) manages to avoid that redundant comparison:

my_switch:
.LFB0:
	.cfi_startproc
	andl	$7, %edi
	xorl	%eax, %eax
	jmp	*.L4(,%rdi,8)
	.section	.rodata
	.align 8
	.align 4
.L4:
	.quad	.L2
       ....

Improving performance in Python 2.7

cesarb — Mon, 08 Jun 2015 23:59:12 +0000

> Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2).

Well, clang is considered a modern C compiler, right?

int my_switch(int number)
{
switch (number & 0x7) {
case 0: return fn1(); break;
case 1: return fn2(); break;
case 2: return fn3(); break;
case 3: return fn4(); break;
case 4: return fn5(); break;
case 5: return fn6(); break;
case 6: return fn7(); break;
case 7: return fn8(); break;
}
crash();
return 0;
}

$ clang -O3 -save-temps -c foo.c
$ cat foo.s
[...]
.globl my_switch
.align 16, 0x90
.type my_switch,@function
my_switch: # @my_switch
.cfi_startproc
# BB#0:
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
# kill: EDI<def> EDI<kill> RDI<def>
andl $7, %edi
cmpl $7, %edi
jbe .LBB0_1
# BB#10:
xorl %eax, %eax
callq crash
xorl %eax, %eax
popq %rdx
retq
.LBB0_1:
jmpq *.LJTI0_0(,%rdi,8)
.LBB0_2:
xorl %eax, %eax
popq %rdx
jmp fn1 # TAILCALL
[...]
.section .rodata,"a",@progbits
.align 8
.LJTI0_0:
.quad .LBB0_2
.quad .LBB0_3
[...]

Improving performance in Python 2.7

Yorick — Mon, 08 Jun 2015 20:12:56 +0000

Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2). Don't take my word for it; write a test program and look at what GCC emits.

Improving performance in Python 2.7

eru — Sat, 06 Jun 2015 05:25:27 +0000

I personally would put in a range check (or other method of ensuring an invalid opcode would not crash) even if the opcode came from other parts of the same program. Just because I know I and other programmers are fallible. Checks like this have often helped me.

About the switch optimization, of course putting in a 256-entry jump table for every switch(somechar) does not make sense, but if there are around 50 cases or more, it would be OK, and that is pretty common (branching for each printable character, for example).

Improving performance in Python 2.7

madscientist — Fri, 05 Jun 2015 21:21:04 +0000

> it will crash the interpreter uncontrollably if an invalid opcode is seen

Yes, exactly, and since it's your code that's up to you. But the compiler can't do that.

> I think this is not robust programming, skipping the range check is cheating.

It's only not robust and cheating if you're not sure that the value will always be within the appropriate range. If you write your code such that it will always be the case that the opcode is valid (say, because your interpreter generated it so you know that the opcode can only be one of a preset list) then it's perfectly robust and "within the rules". But the compiler can't assume that (except in the most trivial situations, perhaps).

> But a smart compiler could perform the equivalent optimization for switches when the value is [various restrictive assumptions]

Yes, that's true. But, the question is how worthwhile is it for someone to magick up the compiler to do these optimizations for the very limited situations in which those restrictive assumptions hold true? For example, I would say that the type of the switch expression is very often int, not char, and a C compiler very rarely can know what the "guaranteed small range" might be (maybe in C++ with the enum class more could be done here). Even when you know the type of the expression is char, is it worthwhile to generate 256 * sizeof(void*) arrays for every switch statement just to get this level of speed improvement?

Improving performance in Python 2.7

eru — Fri, 05 Jun 2015 20:49:24 +0000

If the computed goto interpreter is as shown in http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, it will crash the interpreter uncontrollably if an invalid opcode is seen, because the table access has no range check. I think this is not robust programming, skipping the range check is cheating. One way around this would be to have 256 entries in the table, which makes the range check unnecessary, when indexed by a byte and the unused entries are set to jump to an error handler. But a smart compiler could perform the equivalent optimization for switches when the value is a byte or other guaranteed small range, like the result of code[pc++] & SUITABLE_MASK.

Improving performance in Python 2.7

madscientist — Fri, 05 Jun 2015 18:34:43 +0000

If you read the discussion you'll see that switch cannot be as fast as the computed goto solution, because of requirements on switch behavior mandated by the C standard that the computed goto solution doesn't need to follow.

Improving performance in Python 2.7

eru — Fri, 05 Jun 2015 09:50:52 +0000

If the programmer can simulate a construct faster than the compiler can implement the construct itself, then the compiler writer has blown it badly

- Guy L. Steele Jr. (quoted in "Bumper-sticker Computer Science", in Jon Bentleys "Programming Pearls" column, CACM September 1985).

If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization.

Improving performance in Python 2.7

voltagex — Fri, 05 Jun 2015 05:56:39 +0000

http://python-future.org/compatible_idioms.html

Improving performance in Python 2.7

dlang — Thu, 04 Jun 2015 23:21:57 +0000

avoiding improving Python 2 does encourage people to move off of it. But that's all it does. It doesn't encourage them to move to Python 3, just off of Python 2. Some subset of those users will move to Python 3, but some other subset will be so disgusted with this sort of coercion that they will not only move away from Python entirely, but will become vocal opponents of Python and work to encourage others to move away from Python as well.

Improving performance in Python 2.7

arjan — Thu, 04 Jun 2015 21:09:57 +0000

Being more of a casual python programmer, the 2/3 split is very painful. For me, the best way to encourage people to go from 2 to 3 is actually to make 2 MORE like 3, at least on a language level. Let p2 accept more of the p3 syntax, and I'll write p3 code!
(and I'm pretty sure I'm not the only one who'd do that)

Well for various cases I try to use/write p3 code anyway, but various linux distros are not helping by making it harder-than-needed to get p3 (and the various addon components) for it installed.

Improving performance in Python 2.7

dashesy — Thu, 04 Jun 2015 17:01:35 +0000

Type hints will go to 3.5 but after upgrading my PyCharm I noticed it already uses it for type inference (on a version 2 code base), a very pleasant surprise.