Speeding up CPython
Python, at least in the CPython reference implementation, is not a particularly speedy language. That is not at all surprising to anyone who has used it—the language is optimized for understandability and development speed, instead. There have been lots of efforts over the years to speed up various parts of the interpreter, compiler, and virtual-machine bytecode execution, though no comprehensive overhaul has been merged into CPython. An interesting new proposal could perhaps change that, though it is unclear at this point if it will take off.
Mark Shannon posted
a message
about his ideas to the python-dev mailing list in late October. He noted
that CPython is slow, "yet little is done to fix it
"; he would
like to change that. He has a four-phase plan to get there on the
technical side, along with some ideas
on how to fund the effort. Beyond that, he had some thoughts on
what was different from earlier efforts:
Here are three reasons:
- I already have working code for the first stage.
- I'm not promising a silver bullet. I recognize that this is a substantial amount of work and needs funding.
- I have extensive experience in VM [virtual machine] implementation, not to mention a PhD in the subject.
He pointed those interested to a GitHub repository
where he had placed a few documents describing the plan. The overall
idea
is for each phase to increase CPython performance by 50%, which results in
a 5x speedup at the end of all four phases. The intent is for these
improvements to apply widely: "Performance should improve for all
code, not just loop-heavy and numeric code.
" He described a tiered
execution model that would separate code into tiers based on how
frequently it runs; each tier would be handled differently:
It is, of course, impossible to know in general which parts of a program will be "hot", that is run very frequently, and which parts will be "cold", that is [never] run or run only once. Not only that, but there is not simple distinction between hot and cold, but a range. Much code is "warm", that is run reasonably often, but considerably less than the hottest part of the program.
To address this range of runtime characteristics, several tiers of execution should be employed. We will number these 0 to 3. A minimum viable implementation would consist of tiers 0 and 1. Tier 2 can then be added later for improved performance. Tier 3 can then be added for better performance of longer running programs.
The funding plan is rather creative and different, but also seems to be drawing some scrutiny—and perhaps controversy. Each phase would require roughly $500,000; half of that would be used to pay for the development and the other half would be held by the organization that is coordinating the funding for use in the maintenance of CPython. He suggests the Python Software Foundation (PSF) as the obvious choice for that role.
In order to reduce the risks of a contractor for a specific phase being unable to deliver the specified performance increase, payment would only be made on delivery. For the first phase, Shannon has working code, so he is essentially suggesting that the PSF raise the funds to engage him as the contractor for that phase. But that approach raised the eyebrows of Steven D'Aprano:
He went on to ask what happens to the code if the PSF decides not to pursue the idea. Shannon did not directly address that question, but he is trying to negotiate a business deal of sorts:
I have this thing, e.g an iPhone, if you want it you must pay me. I think that speeding CPython 50% is worth a few hundred iPhones.
Chris Angelico wondered
about Shannon's plan for phase two, which is envisioned as a grab bag of
fairly small improvements: "you're proposing a number
of small tweaks in stage 2 and expecting that (combined) they can give
a 50% improvement in overall performance?
" But Shannon pointed
out that compounding helps here: "20 tweaks each providing a 2%
is a 49% speedup.
"
Also, as D'Aprano noted,
each phase stands on its own. No commitment would need to made for phase
two until phase one has been delivered (and been paid for). Then if
Shannon was unable to achieve the 50% performance increase for phase two,
as Angelico worries, Shannon would not be paid for that work.
There is some wiggle room in the proposal, D'Aprano pointed out, so that getting
a 49% increase would still result in payment, perhaps on "some sort of
sliding scale
".
Angelico also asked if the $2 million might be better spent on improving PyPy. Shannon said that there is a difference in what partial success would mean for the different projects. The main hurdle for PyPy as a replacement for CPython is getting the C extensions to work; a partial solution to that problem only goes so far:
Partial success in getting PyPy to support C extensions well and perform well when it currently does, is much less valuable.
CPython that is "only" 2 or 3 times faster is a major improvement, but a PyPy that supports 80% of the C extensions that it currently does not is still not a replacement for CPython.
Paul Moore was generally in favor of the proposal if someone could be found to fund it. But it is important to ensure that the usual development workflow, including code review, pull requests, and so forth, is going to be followed, he said. While Shannon did not directly reply to that message, elsewhere in the thread he made it clear that he plans to handle these changes in the normal fashion. In fact, both he and Moore said that some of the money might be used to pay for code review.
Gregory P. Smith was favorably
inclined toward the idea as well. Shannon is effectively
proposing a path toward bringing ideas from his earlier HotPy (2)
project to CPython "with an intent to help maintain them over the long
term
", Smith said. He believes the terms are quite reasonable, with
regard to both the money and the risk protections; it is a good fit for CPython:
The Python steering council also weighed in. Thomas Wouters posted a lengthy reply in early November that relayed some of the council discussion of the idea. The council is not opposed to the idea at all, but does caution that there are a lot of details that need to be worked out before it could make a serious effort to find an entity or entities to fund phase one. There is also a question of prioritization:
And it may not be immediately obvious from Mark's plans, but as far as we can tell, the proposal is for speeding up pure-Python code. It will do little for code that is hampered, speed-wise, by CPython's object model, or threading model, or the C API. We have no idea how much this will actually matter to users. Making pure-Python code execution faster is always welcome, but it depends on the price. It may not be a good place to spend $500k or more, and it may even not be considered worth the implementation complexity.
Shannon replied that he had not yet made a proposal to the council but was glad to get the feedback. He agreed that maintenance is an important priority, which is why he included money for that in his proposal, but that it will be easier to justify performance improvements to potential funding entities:
He also said that the total amount of code being added to CPython is not that large and that he hopes to offset any complexities in the implementation with better organization of the code base.
It is an ambitious proposal that was generally met with approval, even if the "show me the money" funding model is somewhat unorthodox. It is not entirely clear where things go from here, but it would seem a bit sad to miss out on an opportunity to make such a large difference in CPython performance. For many Python-using companies, $500,000 (or even $2 million) is not much of a stretch financially—gathering a few of them together should make that even easier. With luck, some discussions are already underway.
Index entries for this article | |
---|---|
Python | CPython |
Python | Performance |
Posted Dec 16, 2020 21:09 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (33 responses)
Posted Dec 16, 2020 22:09 UTC (Wed)
by oliwarner (subscriber, #81320)
[Link]
Posted Dec 16, 2020 22:29 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (28 responses)
1. It's actually really hard to do. You would need to introduce lots of fine-grained locking absolutely everywhere, and do a lot of complicated reasoning about thread safety and which bits of the interpreter need to interact with which other bits.
Posted Dec 17, 2020 0:43 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (9 responses)
Posted Dec 17, 2020 1:02 UTC (Thu)
by atnot (subscriber, #124910)
[Link] (2 responses)
Posted Dec 17, 2020 2:30 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Dec 17, 2020 16:28 UTC (Thu)
by ms-tg (subscriber, #89231)
[Link]
In the Ruby concept as I understand it today, only “frozen” (fully immutable) objects will be possible to send across ractors (which are like subinterpreters) in place - any mutable data must be essentially cloned to send to another ractor.
I wonder if Python and Ruby parallel evolution here will lead to a similar new focus on immutable structures on both languages?
Posted Dec 17, 2020 11:57 UTC (Thu)
by renejsum (guest, #124634)
[Link] (3 responses)
Posted Dec 17, 2020 12:52 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
I do see https://doc.pypy.org/en/latest/embedding.html but it being marked as deprecated without a link to a replacement isn't promising…
Even then, distribution of an embedded Python environment is a nightmare with CPython, but it's a (mostly) known one at this point. Unless PyPy makes that vastly easier, I don't want to have to slay another dragon :/ .
Posted Dec 24, 2020 10:13 UTC (Thu)
by njs (subscriber, #40338)
[Link] (1 responses)
Posted Dec 26, 2020 0:14 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Dec 17, 2020 13:40 UTC (Thu)
by anselm (subscriber, #2796)
[Link] (1 responses)
John Ousterhout's Tcl had that figured out in the early 1990s. Those were the days.
Posted Dec 22, 2020 23:12 UTC (Tue)
by nkuitse (guest, #62915)
[Link]
Posted Dec 17, 2020 7:17 UTC (Thu)
by jayalane (subscriber, #133964)
[Link] (3 responses)
Posted Dec 17, 2020 22:03 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
* It is not impossible to write a parallel implementation of Bellman-Ford, as Google will happily tell you. But I've looked at several of the results, and I still only have a very rough idea of what the resulting algorithm would look like or how it would perform. The clearest result I could find seems to say that the serial component is at least O(|V|) in time complexity, but I'm having a hard time following the synchronization primitives which it describes, so I might be misinterpreting something. But it's definitely not an embarrassingly parallel algorithm.
Posted Dec 18, 2020 1:26 UTC (Fri)
by excors (subscriber, #95769)
[Link]
I guess it may look like the BGP routing protocol, since that's basically parallel Bellman-Ford. Each processor (i.e. router) is responsible for a single vertex and its associated edges, and the 'relax' step runs in parallel across all the processors. But instead of relaxing in lock-step |V| times, they all relax asynchronously and repeatedly and never explicitly stop, which makes it trickier to calculate bounds on convergence time.
Unfortunately BGP doesn't guarantee it will ever converge, because its metric is much more complex than just shortest-paths, and it can (and sometimes does) oscillate forever. If you simplify it enough to guarantee convergence, apparently it can still require O(|V|!) messages across the network before converging (https://dl.acm.org/doi/pdf/10.1145/347059.347428).
Posted Dec 24, 2020 15:55 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
A search through Google Scholar or something similar likely will turn up an open copy.
Posted Dec 17, 2020 11:44 UTC (Thu)
by k3ninho (subscriber, #50375)
[Link] (3 responses)
I think that V8 js engine challenges Amdahl's Law. V8 is in Chromium (and derived web browsers), the Node.js server-side execution environment and more. The use-case is focused on high network latency relative to local data access rates and local processing time, so V8 races to idle with as many parts executed in parallel.
This is all in aid of getting a web page in front of a user with as little delay as possible between having data to work on and that work being complete. The up-front costs of parcelling up the work are paid as 'cost of doing business' for V8 and so the wide core counts, large memory spaces and billions of available instructions per second get used to arrive at an end result as fast as possible.
The optimisation problem that is constrained by Amdahl's Law contains the assumption that the algorithm could not be stated more efficiently before you spread out to parallel workers. This doesn't apply for the web ecosystem -- the web page and its bundled libraries are inefficient and the network latency is the worst aspect of the delay to the user experience.
This teaches us that we might harvest faster computation with enhanced parallelism.
K3n.
Posted Dec 17, 2020 20:10 UTC (Thu)
by miquels (guest, #59247)
[Link] (2 responses)
V8 might run the various JIT engines on separate threads in order to compile the program and get it to start as soon as possible with the least parse/compile time delay, but the actual javascript program is single-threaded and runs on 1 CPU core only.
Posted Dec 21, 2020 12:46 UTC (Mon)
by k3ninho (subscriber, #50375)
[Link] (1 responses)
It might be true that the event loop is single-threaded (funny how this is in a comment thread on the Python GIL and program sequencing), but once events are enqueued, program execution is as parallel as it can be in a race to idle such that the callback hierachy is one pattern for ensuring that program flow is followed (and promises or async/await are the others) across many execution threads.
K3n.
Posted Feb 15, 2021 18:15 UTC (Mon)
by ksandstr (guest, #60862)
[Link]
tl;dr -- I don't suppose you have a benchmark we could use to replicate your experience, and/or the basis for your thoughts?
[0] "implied transactions", anyone? Grand f##kery right thur!
Posted Dec 17, 2020 11:55 UTC (Thu)
by renejsum (guest, #124634)
[Link] (6 responses)
Cpython: Uses one core 50-60%
Jython uses fine grained locking, so it seams possible to do. I understand that WAY more money has been thrown at the development of the JavaVM than CPython, and that is probably the biggest difference.
Posted Dec 17, 2020 12:06 UTC (Thu)
by pradyunsg (guest, #116784)
[Link]
Posted Dec 17, 2020 16:43 UTC (Thu)
by mb (subscriber, #50428)
[Link] (3 responses)
Yes. But the single thread performance of Jython is much slower than CPython.
I'm wondering whether this could be made selectable at runtime by the user.
Posted Dec 17, 2020 21:56 UTC (Thu)
by renejsum (guest, #124634)
[Link] (2 responses)
The JIT in the JVM makes Jython faster in most cases. Only startup time is a bit slower...
Posted Dec 18, 2020 0:45 UTC (Fri)
by mb (subscriber, #50428)
[Link]
Posted Dec 18, 2020 10:09 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link]
Posted Dec 19, 2020 8:34 UTC (Sat)
by jond (subscriber, #37669)
[Link]
Posted Feb 15, 2021 18:11 UTC (Mon)
by ksandstr (guest, #60862)
[Link] (2 responses)
Python's GIL is what happened when people who didn't properly know how went ahead and multithreaded something anyway. (BKL, anyone? It's good for Linus, it's good for Guido?) The counterargument involving fine-grained locking is similarly silly: consumers of a given data item can easily be changed over to a different static lock of lesser scope that, when it's necessary to take both, is dominated by the GIL. If it were only about the GIL, it'd only take a properly-designed locking hierarchy to enable concurrency between threads accessing unrelated data.
So it's not hard to do. You just need to know how, and the "what" is nearly anything that doesn't do pingpong for every instruction.
What is hard to do is to convert existing multithreaded Python code to a brave new world where individual instructions aren't atomic under the GIL. One might even suggest that a reasonable substitute for the GIL would be an userspace threading library since those can be hugely efficient (even more so on recent cores), preempt predictably[0], require next to no extra kernel interaction, and achieve dramatically higher throughputs when less processing is done to data arriving through a multiplexed set of channels than kernel-side thread switching (and worse, inter-core ping-pong!) costs; and these gains keep increasing as cores multiply.
[0] this is particularly appropriate for an interpreter, which can just call "pth_yield()" or something every 5000 instructions or so. This produces significant cache and scheduling advantages.
Posted Feb 15, 2021 18:17 UTC (Mon)
by corbet (editor, #1)
[Link] (1 responses)
Posted Feb 15, 2021 18:39 UTC (Mon)
by ksandstr (guest, #60862)
[Link]
The BKL was quick and nasty, and that didn't matter when the target hardware was two to four 386 on some crazy board with a bespoke multiprocessing architecture. Contrast with the GIL, which exists to push I/O multiplexing into the kernel and exploit concurrency in some FFI calls while (historically) avoiding concurrent reference counting misery.
Replacing the BKL is a problem involving primitives provided by the fixed processor architecture, which is necessarily unforgiving. On the other hand, replacing the GIL could be as simple as switching Python contexts in the interpreter, all on the same stack, and letting epoll (which didn't exist in 1996) handle the kernel part while FFI calls interact with their own concurrency models as they would in any other program.
Linus is not the goose, and Guido is not the gander. Heck, one of them isn't even warm-blooded.
Posted Dec 17, 2020 9:15 UTC (Thu)
by quietbritishjim (subscriber, #114117)
[Link] (2 responses)
In my experience, many Python programs use C extension modules heavily anyway, especially those that are CPU intensive (which are the only ones you probably multithread anyway, apart from those using blocking IO which also releases the GIL of course). Note that this includes not only third-party libraries, like numpy (as mentioned), scipy, OpenCV, etc., but also a lot of the CPU-heavy standard library modules e.g. zipfile (the compression code, not just file IO) and sqlite3 (again, the whole lot, not just IO). It might seem to defeat the point of Python a bit from the library writer's perspective, because they still need to write their code in C (or Cython) to be useful in this way, but it makes a lot of sense from the application writer's perspective: you only need to write a small amount of Python glue code to take advantage of other people's C code.
Where this falls down is in the granularity of the calls. If you're multiplying a small number of huge matrices from multiple threads then the GIL is just irrelevant. If you're multiplying a very large number of tiny matrices from multiple threads then it could become more of an issue.
Posted Dec 17, 2020 9:44 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
This works fine. But then you want to add an in-memory cache and that's where the problems start. If each process gets its own cache, then you have a massive RAM waste. And that's the only easy option (before you go off into using Redis or memcached with all the associated issues).
It's not really any better for a lot of computational code as well. In my company we have Python code that runs computations for each hour of the year, almost completely independently of each other. It was quite liberating to switch from Python to C++ and get true multi-threading, and a huge performance boost.
Posted Dec 17, 2020 19:20 UTC (Thu)
by Bluehorn (subscriber, #17484)
[Link]
I have fallen for this argument. Unfortunately.
This has been the argument for the GIL for as long as I can remember: Just optimize the inner loop in something like C++. This works if you have a tool that does some costly computation (solving of linear equations, FFT, scene rendering, ray tracing) at the core with some Python code doing the stuff around that is not time critical.
That is definitely a big domain and includes a lot of scientific computing (where Python seems to shine).
Unfortunately I maintain an application written in Python that does not have such a core but deals with lots and lots of small objects representing parts of a FE-model. And it is slow. When we started, we had small datasets so we did not worry too much - we'll just rewrite the relevant implementation details in C.
Only after I hit the performance wall I noticed that I basically had to rewrite most of the classes in C to replace the slow Python code. But since everything is held together by Python collections, manipulating those objects would either require holding the GIL or doing our own locking. Also, we would lose our database integration (which is using SQLAlchemy, no idea how to adapt this to C++ objects).
So we tried to use caching to avoid repeated computations. Which means we spent a lot of development time in fixing cache invalidation problems. Still the system was slow and customers (with 24 core systems) told us that they can't understand how the software performs that bad while not even loading the system. Just parallelize...
As rewriting in another language was out of the question, we implemented multi process concurrency. At last this improved things a bit (especially the user interface stays responsive even if the system is doing something), but at a hefty price. multiprocessing was unusable because this is a GUI application. Using fork wrecks the X11 connection, and the forkserver approach was not available in Python 2 when we started. So we wrote our own multi process IPC library.
I still love Python for writing scripts, the language is extremely expressive. But I am now wary about using it for applications that have to do simple actions on many diverse and small objects.
Posted Dec 17, 2020 11:38 UTC (Thu)
by norphonic (guest, #93563)
[Link] (1 responses)
That's a really oblique way of saying 'unoptimized'.
Posted Dec 17, 2020 17:11 UTC (Thu)
by randomguy3 (subscriber, #71063)
[Link]
Posted Dec 22, 2020 12:11 UTC (Tue)
by georgm (subscriber, #19574)
[Link] (1 responses)
real 0m2.023s
For a simple dbus call + response, the rest is just loading stuff from files.
Ok, machine is not the newest (800MHz AMD G-T40E), but anyway...
Posted Jan 2, 2021 5:26 UTC (Sat)
by wtarreau (subscriber, #51152)
[Link]
Speeding up CPython
Speeding up CPython
Speeding up CPython
2. Fine-grained locks are (usually) slower than big locks for single-threaded performance, particularly if the big lock can be turned off in single-threaded mode (as is the case with CPython).
3. Much existing Python code intentionally avoids depending on multi-threading, so single-threaded performance is arguably more important in the short term.
4. Much existing Python code tacitly assumes that only one bytecode instruction is executed at a time. For example, you can take a shallow copy of a list with x[:], and under the current version of CPython, this is atomic with respect to concurrent modifications of the list (it's ultimately a single BINARY_SUBSCR bytecode instruction, and the GIL must be held while this bytecode instruction is executed). As a result, your fine-grained locking needs to be pretty coarse-grained in practice, or else you need to break backwards compatibility all over the place.
5. Amdahl's law places a sharp upper bound on possible benefits from multithreading. You cannot simply divide the runtime of a program by the number of cores; the specific algorithm used must be amenable to concurrency.
6. Between the multiprocessing module, concurrent.futures.ProcessPoolExecutor, and C extensions, the GIL can (for most purposes) be worked around without an unreasonable level of effort.
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
I can't even spin up two in-process interpreters to try and do per-thread interpreters to do some local Python code processing. If Python3 had gone and added a `pyctx` call to every API call so that the global state could at least be /scoped/ in some way, that'd have at least allowed someone to go and try a GIL-less context a lot easier.
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
V. T. Chakaravarthy, F. Checconi, F. Petrini and Y. Sabharwal, "Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems," 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, 2014, pp. 889-901, doi: 10.1109/IPDPS.2014.96.
Speeding up CPython
Speeding up CPython
Speeding up CPython
Not in my experience: a bunch of program code at the same level of the callback hierarchy gets executed out-of-order by Node and the V8 engine such that the function returns part-calculated data before critical calculations are done.
Speeding up CPython
Speeding up CPython
Machine: MacBook Pro i9 8 cores + 8 hyper threads
Jython: Uses all 8+8 cores at 100% (and is much faster)
Speeding up CPython
Speeding up CPython
> Jython uses fine grained locking, so it seams possible to do.
That's the price you pay.
Not by cluttering up the code flow with if-then-else, because that would also kill performance (caching).
But by runtime patching of the interpreter. Just like Linux does for so many things (e.g. tracepoints).
So if the user wants multicore, patch in all those fine grained locks.
If the user wants full single core performance, remove all those fine grained locks (replace them with NOPs, which have virtually no runtime).
Speeding up CPython
Speeding up CPython
That's not what I actually noticed with Jython 2.7 some time ago.
2.7 was much slower in single thread performance than CPython 2.7 in my tests. I did not test any further 3.x development of Jython , because it's basically stuck at 2.7. So it's essentatially dead from a practical pov.
Speeding up CPython
Speeding up CPython
The GIL is far, far worse than that
Many of us have spent a lot of time complaining about the big kernel lock — and doing the work to get rid of it as well. That said, one shouldn't just poke at it this way. The BKL had a specific and well-defined purpose: allow most of the code in the kernel to continue to run as if it were on a uniprocessor machine, with all of the associated assumptions intact. It was, I believe, the only way to break the problem down into tractable steps.
BKL
BKL
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
Speeding up CPython
running
user 0m1.561s
sys 0m0.120s
Speeding up CPython