LWN: Comments on "Improving performance in Python 2.7" https://lwn.net/Articles/646888/ This is a special feed containing comments posted to the individual LWN article titled "Improving performance in Python 2.7". en-us Mon, 22 Sep 2025 22:35:35 +0000 Mon, 22 Sep 2025 22:35:35 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Improving performance in Python 2.7 https://lwn.net/Articles/648249/ https://lwn.net/Articles/648249/ anton <blockquote> If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization. </blockquote> Labels-as-values are a more basic feature than switch; the "goto *" has to do less than a switch, and cannot be simulated efficiently with a switch, while you can use labels-as-values to implement a switch; if you did that, and the labels-as-values version was faster than the switch, then yes, the implementation of switch by the compiler writer would be suboptimal. But being faster than switch for interpreter dispatch is not a sign that the compiler writers did a mistake, because intrpreter dispatch does not need all the stuff that switch does (in particular range checking). <p>However, the main benefit is coming from better branch prediction. One can get that with labels-as-values, and one can also get this with switch (see next paragraph); there are some gcc versions where gcc "optimizes" the code by combining all the "goto *" into one (e.g. PR15242), resulting in a branch prediction as bad as if you do the classic single switch. In this case, the compiler writers have really blown it badly. <p>Interestingly, if you replicate the switch (one per VM instruction), the same gcc versions do not perform the same "optimization" (<a rel="nofollow" href="http://compilers.iecc.com/comparch/article/07-05-046">data</a>). The switch-replication technique is interesting, but (with a typical C compiler) will have a significant space consumption for dispatch tables (and cache misses from that), because the dispatch tables are replicated as well. <p>Ideally, a compiler that sees a loop containing a switch should do the replication of the indirect branches itself (without replicating the dispatch tables). A student of mine implemented that in gcc, and found that it is pretty hard, because this kind of thing goes against the grain of gcc (that's probably also why combining all "goto *" into one is coming back regularly). <p>BTW, labels-as-values and "goto *" are like Fortran's "assigned goto", while Fortran's "computed goto" is more like switch; still, the gcc maintainers usually call that feature "computed goto". Tue, 16 Jun 2015 14:04:24 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647850/ https://lwn.net/Articles/647850/ njs <div class="FormattedComment"> A lot of the motivation for the type hints work has been to consolidate the existing type inference stuff that's being done independently by PyCharm, Google's internal linting tools, Microsoft's Visual-Studio-for-Python, etc. So what's new in 3.5 isn't so much the existence of type inference, but that now there's a standard way for these different projects to collaborate with each other and with third-party libraries using a common language.<br> </div> Thu, 11 Jun 2015 06:07:45 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647555/ https://lwn.net/Articles/647555/ eru <pre><i> ... andl $7, %edi cmpl $7, %edi ... </i></pre> <p> "gcc -O3" (version 4.8.3) manages to avoid that redundant comparison: <pre> my_switch: .LFB0: .cfi_startproc andl $7, %edi xorl %eax, %eax jmp *.L4(,%rdi,8) .section .rodata .align 8 .align 4 .L4: .quad .L2 .... </pre> Tue, 09 Jun 2015 04:36:49 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647543/ https://lwn.net/Articles/647543/ cesarb <div class="FormattedComment"> <font class="QuotedText">&gt; Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2).</font><br> <p> Well, clang is considered a modern C compiler, right?<br> <p> int my_switch(int number)<br> {<br> switch (number &amp; 0x7) {<br> case 0: return fn1(); break;<br> case 1: return fn2(); break;<br> case 2: return fn3(); break;<br> case 3: return fn4(); break;<br> case 4: return fn5(); break;<br> case 5: return fn6(); break;<br> case 6: return fn7(); break;<br> case 7: return fn8(); break;<br> }<br> crash();<br> return 0;<br> }<br> <p> $ clang -O3 -save-temps -c foo.c<br> $ cat foo.s<br> [...]<br> .globl my_switch<br> .align 16, 0x90<br> .type my_switch,@function<br> my_switch: # @my_switch<br> .cfi_startproc<br> # BB#0:<br> pushq %rax<br> .Ltmp0:<br> .cfi_def_cfa_offset 16<br> # kill: EDI&lt;def&gt; EDI&lt;kill&gt; RDI&lt;def&gt;<br> andl $7, %edi<br> cmpl $7, %edi<br> jbe .LBB0_1<br> # BB#10:<br> xorl %eax, %eax<br> callq crash<br> xorl %eax, %eax<br> popq %rdx<br> retq<br> .LBB0_1:<br> jmpq *.LJTI0_0(,%rdi,8)<br> .LBB0_2:<br> xorl %eax, %eax<br> popq %rdx<br> jmp fn1 # TAILCALL<br> [...]<br> .section .rodata,"a",@progbits<br> .align 8<br> .LJTI0_0:<br> .quad .LBB0_2<br> .quad .LBB0_3<br> [...]<br> </div> Mon, 08 Jun 2015 23:59:12 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647526/ https://lwn.net/Articles/647526/ Yorick <div class="FormattedComment"> Modern C compilers already do optimise away range checks if the switch is fully populated and within a known range (such as a power of 2). Don't take my word for it; write a test program and look at what GCC emits.<br> <p> </div> Mon, 08 Jun 2015 20:12:56 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647333/ https://lwn.net/Articles/647333/ eru <p>I personally would put in a range check (or other method of ensuring an invalid opcode would not crash) even if the opcode came from other parts of the same program. Just because I know I and other programmers are fallible. Checks like this have often helped me. <p>About the switch optimization, of course putting in a 256-entry jump table for every switch(somechar) does not make sense, but if there are around 50 cases or more, it would be OK, and that is pretty common (branching for each printable character, for example). Sat, 06 Jun 2015 05:25:27 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647325/ https://lwn.net/Articles/647325/ madscientist <div class="FormattedComment"> <font class="QuotedText">&gt; it will crash the interpreter uncontrollably if an invalid opcode is seen</font><br> <p> Yes, exactly, and since it's your code that's up to you. But the compiler can't do that.<br> <p> <font class="QuotedText">&gt; I think this is not robust programming, skipping the range check is cheating.</font><br> <p> It's only not robust and cheating if you're not sure that the value will always be within the appropriate range. If you write your code such that it will always be the case that the opcode is valid (say, because your interpreter generated it so you know that the opcode can only be one of a preset list) then it's perfectly robust and "within the rules". But the compiler can't assume that (except in the most trivial situations, perhaps).<br> <p> <font class="QuotedText">&gt; But a smart compiler could perform the equivalent optimization for switches when the value is [various restrictive assumptions]</font><br> <p> Yes, that's true. But, the question is how worthwhile is it for someone to magick up the compiler to do these optimizations for the very limited situations in which those restrictive assumptions hold true? For example, I would say that the type of the switch expression is very often int, not char, and a C compiler very rarely can know what the "guaranteed small range" might be (maybe in C++ with the enum class more could be done here). Even when you know the type of the expression is char, is it worthwhile to generate 256 * sizeof(void*) arrays for every switch statement just to get this level of speed improvement?<br> </div> Fri, 05 Jun 2015 21:21:04 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647321/ https://lwn.net/Articles/647321/ eru <p>If the computed goto interpreter is as shown in http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, it will crash the interpreter uncontrollably if an invalid opcode is seen, because the table access has no range check. I think this is not robust programming, skipping the range check is cheating. One way around this would be to have 256 entries in the table, which makes the range check unnecessary, when indexed by a byte and the unused entries are set to jump to an error handler. But a smart compiler could perform the equivalent optimization for switches when the value is a byte or other guaranteed small range, like the result of code[pc++] &amp; SUITABLE_MASK. Fri, 05 Jun 2015 20:49:24 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647314/ https://lwn.net/Articles/647314/ madscientist <div class="FormattedComment"> If you read the discussion you'll see that switch cannot be as fast as the computed goto solution, because of requirements on switch behavior mandated by the C standard that the computed goto solution doesn't need to follow.<br> </div> Fri, 05 Jun 2015 18:34:43 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647272/ https://lwn.net/Articles/647272/ eru <i>If the programmer can simulate a construct faster than the compiler can implement the construct itself, then the compiler writer has blown it badly</i><p> - Guy L. Steele Jr. (quoted in "Bumper-sticker Computer Science", in Jon Bentleys "Programming Pearls" column, CACM September 1985).<p> If computed gotos are faster than a switch in GCC, then it seems GCC is missing an important optimization. Fri, 05 Jun 2015 09:50:52 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647258/ https://lwn.net/Articles/647258/ voltagex <div class="FormattedComment"> <a href="http://python-future.org/compatible_idioms.html">http://python-future.org/compatible_idioms.html</a><br> </div> Fri, 05 Jun 2015 05:56:39 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647247/ https://lwn.net/Articles/647247/ dlang <div class="FormattedComment"> avoiding improving Python 2 does encourage people to move off of it. But that's all it does. It doesn't encourage them to move to Python 3, just off of Python 2. Some subset of those users will move to Python 3, but some other subset will be so disgusted with this sort of coercion that they will not only move away from Python entirely, but will become vocal opponents of Python and work to encourage others to move away from Python as well.<br> </div> Thu, 04 Jun 2015 23:21:57 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647235/ https://lwn.net/Articles/647235/ arjan <div class="FormattedComment"> Being more of a casual python programmer, the 2/3 split is very painful. For me, the best way to encourage people to go from 2 to 3 is actually to make 2 MORE like 3, at least on a language level. Let p2 accept more of the p3 syntax, and I'll write p3 code! <br> (and I'm pretty sure I'm not the only one who'd do that)<br> <p> Well for various cases I try to use/write p3 code anyway, but various linux distros are not helping by making it harder-than-needed to get p3 (and the various addon components) for it installed.<br> <p> </div> Thu, 04 Jun 2015 21:09:57 +0000 Improving performance in Python 2.7 https://lwn.net/Articles/647221/ https://lwn.net/Articles/647221/ dashesy <div class="FormattedComment"> Type hints will go to 3.5 but after upgrading my PyCharm I noticed it already uses it for type inference (on a version 2 code base), a very pleasant surprise.<br> </div> Thu, 04 Jun 2015 17:01:35 +0000