Making Python faster
The Python core developers, and Victor Stinner in particular, have been focusing on improving the performance of Python 3 over the last few years. At PyCon 2017, Stinner gave a talk on some of the optimizations that have been added recently and the effect they have had on various benchmarks. Along the way, he took a detour into some improvements that have been made for benchmarking Python.
He started his talk by noting that he has been working on porting OpenStack to Python 3 as part of his day job at Red Hat. So far, most of the unit tests are passing. That means that an enormous Python program (with some 3 million lines of code) has largely made the transition to the Python 3 world.
Benchmarks
Back in March 2016, developers did not really trust the Python benchmark suite, he said. The benchmark results were not stable, which made it impossible to tell if a particular optimization made CPython (the Python reference implementation) faster or slower. So he set out to improve that situation.
He created a new module, perf, as a framework for running benchmarks. It calibrates the number of loops to run the benchmark based on a time budget. Each benchmark run then consists of sequentially spawning twenty processes, each of which performs the appropriate number of loops three times. That generates 60 time values; the average and standard deviation are calculated from those. He noted that the standard deviation can be used to spot problems in the benchmark or the system; if it is large, meaning lots of variation, that could indicate a problem.
Using perf has provided more stable and predictable results, he said. That has led to a new Python performance benchmark suite. It is being used at the speed.python.org site to provide benchmark numbers for Python. Part of that work has resulted in CPython being compiled with link-time optimization and profile-guided optimization by default.
The perf module has a "system tune" command that can be used to tune a Linux system for benchmarking. That includes using a fixed CPU frequency, rather than allowing each core's frequency to change all the time, disabling the Intel Turbo Boost feature, using CPU pinning, and running the benchmarks on an isolated CPU if that feature is enabled in the kernel.
Having stable benchmarks makes it much easier to spot a performance regression, Stinner said. For a real example, he pointed to a graph in his slides [PDF] that showed the python_startup benchmark time increasing dramatically during the development of 3.6 (from 20ms to 27ms). The problem was a new import in the code; the fix dropped the benchmark to 17ms.
The speed.python.org site allows developers to look at a timeline of the performance of CPython since April 2014 on various benchmarks. Sometimes it makes sense to focus on micro-benchmarks, he said, but the timelines of the larger benchmarks can be even more useful for finding regressions.
Stinner put up a series of graphs showing that 3.6 is faster than 3.5 and 2.7 on multiple benchmarks. He chose the most significant changes to show in the graphs, and there are a few benchmarks that go against these trends. The differences between 3.6 and 2.7 are larger than those for 3.6 versus 3.5, which is probably not a huge surprise.
The SymPy benchmarks show some of the largest performance increases. They are 22-42% faster in 3.6 than they are in 2.7. The largest increase, though, was on the telco benchmark, which is 40x faster on 3.6 versus 2.7. That is because the decimal module was rewritten in C for Python 3.3.
Preliminary results indicate that the in-development Python 3.7 is faster than 3.6, as well. There were some optimizations that were merged just after the 3.6 release; there were worries about regressions, which is why they were held back, he said.
Optimizations
Stinner then turned to some of the optimizations that have made those benchmarks faster. For 3.5, several developers rewrote the functools.lru_cache() decorator in C. That made the SymPy benchmarks 20% faster. The cache is "quite complex" with many corner cases, which made it hard to get right. In fact, it took three and a half years to close the bug associated with it.
Another 3.5 optimization was for ordered dictionaries (collections.OrderedDict). Rewriting it in C made the html5lib benchmark 20% faster, but it was also tricky code. It took two and a half years to close that bug, he said.
Moving on to optimizations for 3.6, he described the change he made for memory allocation in CPython. Instead of using PyMem_Malloc() for smaller allocations, he switched to the Python fast memory allocator that is used for Python objects. It only changed two lines of code, but resulted in many benchmarks getting 5-22% faster—and no benchmarks ran slower.
The xml.tree.ElementTree.iterparse() routine was optimized in response to a PyCon Canada 2015 keynote [YouTube video] by Brett Cannon. That resulted in the etree_parse and etree_iterparse benchmarks running twice as fast, which Stinner called "quite good". As noted in the bug report, though, it is still somewhat slower than 2.7.
The profile-guided optimization for CPython was improved by using the Python test suite. Previously, CPython would be compiled twice using the pidigits module to guide the optimization. That only tested a few, math-oriented Python functions, so using the test suite instead covers more of the interpreter. That resulted in many benchmarks showing 5-27% improvement just by changing the build process.
In 3.6, Python moved from using a bytecode for its virtual machine to a "wordcode". Instead of either having one or three bytes per instruction, now all instructions are two bytes long. That removed an if statement from the hot path in ceval.c (the main execution loop).
Stinner added a way to make C function calls faster using a new internal _PyObject_FastCall() routine. Creating and destroying the tuple that is used to call C functions would take around 20ns, which is expensive if the call itself is only, say, 100ns. So the new function dispenses with creating the tuple to pass the function arguments. It shows a 12-50% speedup for many micro-benchmarks.
He also optimized the ASCII and UTF-8 codecs when using the "ignore", "replace", "surrogateescape", and "surrogatepass" error handlers. Those codecs were full of "bad code", he said. His work resulted in UTF-8 decoding being 15x faster and encoding to be 75x faster. For ASCII, decoding is now 60x faster, while encoding is 3x faster.
Python 3.5 added byte-string formatting back into the language as a result of PEP 461, but the code was inefficient. He used the _PyBytesWriter() interface to handle byte-string formatting. That resulted in 2-3x speedups for those types of operations.
There were also improvements to the filename pattern matching or "globbing" operations (in the glob module and in the pathlib.Path.glob() routine). Those improved glob by 3-6x and pathlib globbing by 1.5-4x by using the new os.scandir() iterator that was added in Python 3.5.
The last 3.6 optimization that Stinner described was an improvement for the asyncio module that increased the performance of some asynchronous programs by 30%. The asyncio.Future and asyncio.Task classes were rewritten in C (for reference, here is the bug for Future and the bug for Task).
There are "lots of ideas" for optimizations for 3.7, Stinner said, but he is not sure which will be implemented or if they will be helpful. One that has been merged already is to add new opcodes (LOAD_METHOD and CALL_METHOD) to support making method calls as fast calls, which makes method calls 10-20% faster. It is an idea that has come to CPython from PyPy.
He concluded his talk by pointing out that on some benchmarks, Python 3.7 is still slower than 2.7. Most of those are on the order of 10-20% slower, but the python_startup benchmarks are 2-3x slower. There is a need to find a way to optimize interpreter startup in Python 3. There are, of course, more opportunities to optimize the language and he encouraged those interested to check out speed.python.org, as well as his Faster CPython site (which he mentioned in his Python Language Summit session earlier in the week).
YouTube video of Stinner's talk is also available.
[I would like to thank the Linux Foundation for travel assistance to
Portland for PyCon.]
| Index entries for this article | |
|---|---|
| Conference | PyCon/2017 |
