Making Python faster
The Python core developers, and Victor Stinner in particular, have been focusing on improving the performance of Python 3 over the last few years. At PyCon 2017, Stinner gave a talk on some of the optimizations that have been added recently and the effect they have had on various benchmarks. Along the way, he took a detour into some improvements that have been made for benchmarking Python.
He started his talk by noting that he has been working on porting OpenStack to Python 3 as part of his day job at Red Hat. So far, most of the unit tests are passing. That means that an enormous Python program (with some 3 million lines of code) has largely made the transition to the Python 3 world.
Benchmarks
Back in March 2016, developers did not really trust the Python benchmark suite, he said. The benchmark results were not stable, which made it impossible to tell if a particular optimization made CPython (the Python reference implementation) faster or slower. So he set out to improve that situation.
He created a new module, perf, as a framework for running benchmarks. It calibrates the number of loops to run the benchmark based on a time budget. Each benchmark run then consists of sequentially spawning twenty processes, each of which performs the appropriate number of loops three times. That generates 60 time values; the average and standard deviation are calculated from those. He noted that the standard deviation can be used to spot problems in the benchmark or the system; if it is large, meaning lots of variation, that could indicate a problem.
Using perf has provided more stable and predictable results, he said. That has led to a new Python performance benchmark suite. It is being used at the speed.python.org site to provide benchmark numbers for Python. Part of that work has resulted in CPython being compiled with link-time optimization and profile-guided optimization by default.
The perf module has a "system tune" command that can be used to tune a Linux system for benchmarking. That includes using a fixed CPU frequency, rather than allowing each core's frequency to change all the time, disabling the Intel Turbo Boost feature, using CPU pinning, and running the benchmarks on an isolated CPU if that feature is enabled in the kernel.
Having stable benchmarks makes it much easier to spot a performance regression, Stinner said. For a real example, he pointed to a graph in his slides [PDF] that showed the python_startup benchmark time increasing dramatically during the development of 3.6 (from 20ms to 27ms). The problem was a new import in the code; the fix dropped the benchmark to 17ms.
The speed.python.org site allows developers to look at a timeline of the performance of CPython since April 2014 on various benchmarks. Sometimes it makes sense to focus on micro-benchmarks, he said, but the timelines of the larger benchmarks can be even more useful for finding regressions.
Stinner put up a series of graphs showing that 3.6 is faster than 3.5 and 2.7 on multiple benchmarks. He chose the most significant changes to show in the graphs, and there are a few benchmarks that go against these trends. The differences between 3.6 and 2.7 are larger than those for 3.6 versus 3.5, which is probably not a huge surprise.
The SymPy benchmarks show some of the largest performance increases. They are 22-42% faster in 3.6 than they are in 2.7. The largest increase, though, was on the telco benchmark, which is 40x faster on 3.6 versus 2.7. That is because the decimal module was rewritten in C for Python 3.3.
Preliminary results indicate that the in-development Python 3.7 is faster than 3.6, as well. There were some optimizations that were merged just after the 3.6 release; there were worries about regressions, which is why they were held back, he said.
Optimizations
Stinner then turned to some of the optimizations that have made those benchmarks faster. For 3.5, several developers rewrote the functools.lru_cache() decorator in C. That made the SymPy benchmarks 20% faster. The cache is "quite complex" with many corner cases, which made it hard to get right. In fact, it took three and a half years to close the bug associated with it.
Another 3.5 optimization was for ordered dictionaries (collections.OrderedDict). Rewriting it in C made the html5lib benchmark 20% faster, but it was also tricky code. It took two and a half years to close that bug, he said.
Moving on to optimizations for 3.6, he described the change he made for memory allocation in CPython. Instead of using PyMem_Malloc() for smaller allocations, he switched to the Python fast memory allocator that is used for Python objects. It only changed two lines of code, but resulted in many benchmarks getting 5-22% faster—and no benchmarks ran slower.
The xml.tree.ElementTree.iterparse() routine was optimized in response to a PyCon Canada 2015 keynote [YouTube video] by Brett Cannon. That resulted in the etree_parse and etree_iterparse benchmarks running twice as fast, which Stinner called "quite good". As noted in the bug report, though, it is still somewhat slower than 2.7.
The profile-guided optimization for CPython was improved by using the Python test suite. Previously, CPython would be compiled twice using the pidigits module to guide the optimization. That only tested a few, math-oriented Python functions, so using the test suite instead covers more of the interpreter. That resulted in many benchmarks showing 5-27% improvement just by changing the build process.
In 3.6, Python moved from using a bytecode for its virtual machine to a "wordcode". Instead of either having one or three bytes per instruction, now all instructions are two bytes long. That removed an if statement from the hot path in ceval.c (the main execution loop).
Stinner added a way to make C function calls faster using a new internal _PyObject_FastCall() routine. Creating and destroying the tuple that is used to call C functions would take around 20ns, which is expensive if the call itself is only, say, 100ns. So the new function dispenses with creating the tuple to pass the function arguments. It shows a 12-50% speedup for many micro-benchmarks.
He also optimized the ASCII and UTF-8 codecs when using the "ignore", "replace", "surrogateescape", and "surrogatepass" error handlers. Those codecs were full of "bad code", he said. His work resulted in UTF-8 decoding being 15x faster and encoding to be 75x faster. For ASCII, decoding is now 60x faster, while encoding is 3x faster.
Python 3.5 added byte-string formatting back into the language as a result of PEP 461, but the code was inefficient. He used the _PyBytesWriter() interface to handle byte-string formatting. That resulted in 2-3x speedups for those types of operations.
There were also improvements to the filename pattern matching or "globbing" operations (in the glob module and in the pathlib.Path.glob() routine). Those improved glob by 3-6x and pathlib globbing by 1.5-4x by using the new os.scandir() iterator that was added in Python 3.5.
The last 3.6 optimization that Stinner described was an improvement for the asyncio module that increased the performance of some asynchronous programs by 30%. The asyncio.Future and asyncio.Task classes were rewritten in C (for reference, here is the bug for Future and the bug for Task).
There are "lots of ideas" for optimizations for 3.7, Stinner said, but he is not sure which will be implemented or if they will be helpful. One that has been merged already is to add new opcodes (LOAD_METHOD and CALL_METHOD) to support making method calls as fast calls, which makes method calls 10-20% faster. It is an idea that has come to CPython from PyPy.
He concluded his talk by pointing out that on some benchmarks, Python 3.7 is still slower than 2.7. Most of those are on the order of 10-20% slower, but the python_startup benchmarks are 2-3x slower. There is a need to find a way to optimize interpreter startup in Python 3. There are, of course, more opportunities to optimize the language and he encouraged those interested to check out speed.python.org, as well as his Faster CPython site (which he mentioned in his Python Language Summit session earlier in the week).
YouTube video of Stinner's talk is also available.
[I would like to thank the Linux Foundation for travel assistance to
Portland for PyCon.]
| Index entries for this article | |
|---|---|
| Conference | PyCon/2017 |
Posted Jun 15, 2017 6:19 UTC (Thu)
by meyert (subscriber, #32097)
[Link] (5 responses)
Posted Jun 15, 2017 13:22 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (1 responses)
Posted Jun 15, 2017 19:04 UTC (Thu)
by zlynx (guest, #2285)
[Link]
Posted Jun 17, 2017 23:18 UTC (Sat)
by robert_s (subscriber, #42402)
[Link] (1 responses)
Posted Jun 22, 2017 17:33 UTC (Thu)
by HelloWorld (guest, #56129)
[Link]
Posted Jun 18, 2017 17:18 UTC (Sun)
by fujimotos (guest, #111905)
[Link]
Yes, C is definitely faster and will improve benchmark results. But a whole
So: will the trend make CPython more insecure? Or is this not the case?
Posted Jun 15, 2017 11:30 UTC (Thu)
by mb (subscriber, #50428)
[Link] (8 responses)
$ python2.7 ./awlsim-test examples/EXAMPLE.awlpro
$ python3.5 ./awlsim-test examples/EXAMPLE.awlpro
$ python3.6 ./awlsim-test examples/EXAMPLE.awlpro
I'm using the Debian/sid versions of Python.
Posted Jun 15, 2017 13:04 UTC (Thu)
by vstinner (subscriber, #42675)
[Link] (5 responses)
Posted Jun 15, 2017 17:41 UTC (Thu)
by mb (subscriber, #50428)
[Link] (4 responses)
Well, there are some of articles and talks that claim Python 3.6 would be faster than Python 2.7.
There might be a lot of cases where 3.6 is faster than 2.7, but as long as there is a single thing where 3.6 is slower than 2.7 and my application hits it, that means 3.6 is slower.
I don't use fancy Python 3 features. So this is 2.7 stuff that got slower in 3.x. Not new 3.x features that got faster during the 3.x development.
Most of the time in my application is spent by doing method calls and attribute lookup.
>You should try to measure the difference, profile the code using cProfile
My code has cProfile support built in and it is a matter of setting an environment variable for the performance critical parts to be measured. So I think I have a pretty good idea of where most of the time is spent.
In the end I think this shows one thing:
So in the end Python developers should be interested in explaining performance regressions on "random applications", because "random applications" are the real world. Test cases that show an impressive performance improvement are not.
Posted Jun 15, 2017 21:35 UTC (Thu)
by vstinner (subscriber, #42675)
[Link] (3 responses)
And the last slide says that in a few cases, Python 3.6 is still 10-20% slower in a few specific benchmarks, and up to 3x slower for the startup time.
> Most of the time in my application is spent by doing method calls and attribute lookup.
I would be interested by a benchmark on Python 3.7 since we finally implemented a CALL_METHOD bytecode to optimize method calls ;-)
> It could possibly only give a clue about the regression between 3.5 and 3.6.
Yes, according to your benchmark, there is a net regression, 3.6 is slower than 3.5. It would help to know if your Python binary was built using PGO. Sadly, I don't know how to check in Python :-( You have to look how Python was built in your Linux distro.
> So in the end Python developers should be interested in explaining performance regressions on "random applications"
I am interested. I just wanted to say that I'm sorry, I have no clue since I don't know your application at all. Please contact me directly to explain me how to run your benchmark.
Posted Jun 16, 2017 9:22 UTC (Fri)
by mb (subscriber, #50428)
[Link] (2 responses)
I just ran a quick test using the latest git versions of 3.6 and 3.7 with and without PGO:
with pgo
without pgo
with pgo
without pgo
So I guess that Debian does not enable PGO.
The improvement in 3.7 looks quite impressive. Although the non-PGO variant still is a bit slower than 2.7, but it is pretty damn close.
I think the goal must be to beat 2.7 performance without PGO, because compiling two versions of a program with different compilers and then claiming one version being faster is cheating. :)
>I am interested. I just wanted to say that I'm sorry, I have no clue since I don't know your application at all. Please contact me directly to explain me how to run your benchmark.
Just clone the git repository and run the command as shown above. That's it. :)
Thanks a lot for your support.
Posted Jun 16, 2017 10:52 UTC (Fri)
by vstinner (subscriber, #42675)
[Link] (1 responses)
And perf documentation:
Posted Jun 16, 2017 11:00 UTC (Fri)
by mb (subscriber, #50428)
[Link]
I did not check this.
Posted Jun 16, 2017 11:14 UTC (Fri)
by vstinner (subscriber, #42675)
[Link] (1 responses)
It's unclear to me how to run correctly your benchmark, since the speed changes depending on the time. So I chose to run the program and interrupt it (CTRL+c) when it displays "t: 10.2s" at the top.
I compiled manually the 3.5 and 3.6 Git branches using "./configure --with-lto && make" and I get a different trend than you.
* Python 3.5 @ t: 10.2s: 190.83 k stmt/s
In my case, Python 3.6 is 1.03x faster (3%) than Python 3.5. Well, basically it runs at the same speed.
My system Python (I'm running Fedora 25):
* python2 (2.7) @ t: 10.2s: 197.74 k stmt/s
Again, no significant different between python2 and python3 here.
Note: Fedora doesn't use PGO to build python2 and python3 packages. (But they are working on enabling it!)
Posted Jun 16, 2017 14:36 UTC (Fri)
by mb (subscriber, #50428)
[Link]
> It's unclear to me how to run correctly your benchmark, since the speed changes depending on the time.
Yes, it's a bit jerky and I'll have to work on this. It doesn't do CPU pinning and other things to get rock stable results.
Thanks a lot for digging into this.
Making Python faster
Making Python faster
Making Python faster
Making Python faster
Making Python faster
What is the implication of this trend for security?
can introduce a number of nasty memory-related bugs, especially if these
modules involve tricky data structures.
class of issues (overflow, memory corruption etc.) can be avoided if we stick
to Python.
Making Python faster
In my application the speed of 3.5 is about the same as that of 2.7.
However I can't confirm that 3.6 got even faster. In fact it is quite a bit slower for me.
Speed: 195.28 k stmt/s (= 5.121 us/stmt) 55.0 stmt/cycle
Speed: 196.27 k stmt/s (= 5.095 us/stmt) 55.0 stmt/cycle
Speed: 181.10 k stmt/s (= 5.522 us/stmt) 55.0 stmt/cycle
Is there anything wrong with these? Perhaps not using LTO or profile guided optimization?
Making Python faster
Making Python faster
That's not true per se.
The performance critical part basically does not call into libraries.
However for me it's hard to relate that to Python itself.
And profiling does not help at all to get an idea of why I don't see a 3.5/3.6 being faster than 2.7.
It could possibly only give a clue about the regression between 3.5 and 3.6.
The Python test suite does not cover "my" use case very good. And I would be very surprised if I was the only one with that problem.
Making Python faster
Making Python faster
$ python3.7 ./awlsim-test examples/EXAMPLE.awlpro
Speed: 215.80 k stmt/s (= 4.634 us/stmt) 55.0 stmt/cycle
$ python3.7 ./awlsim-test examples/EXAMPLE.awlpro
Speed: 191.10 k stmt/s (= 5.233 us/stmt) 55.0 stmt/cycle
$ python3.6 ./awlsim-test examples/EXAMPLE.awlpro
Speed: 201.55 k stmt/s (= 4.961 us/stmt) 55.0 stmt/cycle
$ python3.6 ./awlsim-test examples/EXAMPLE.awlpro
Speed: 183.01 k stmt/s (= 5.464 us/stmt) 55.0 stmt/cycle
This might also make sense w.r.t. stable builds.
Disclaimer: awlsim-test is not a proper benchmark. It's just a rough estimation.
Making Python faster
https://fosdem.org/2017/schedule/event/python_stable_benc...
http://perf.readthedocs.io/en/latest/run_benchmark.html#h...
and
http://perf.readthedocs.io/en/latest/system.html
Making Python faster
I only concluded it from my benchmark results.
So I might be completely wrong.
Making Python faster
* Python 3.6 @ t: 10.2s: 196.36 k stmt/s
* python3 (3.5) @ t: 10.2s: 198.65 k stmt/s
Making Python faster
I very much appreciate your work on Python and it enables me to eventually drop Python 2 support. That would be great. :)
