Speeding up CPython

By Jake Edge
December 16, 2020

Python, at least in the CPython reference implementation, is not a particularly speedy language. That is not at all surprising to anyone who has used it—the language is optimized for understandability and development speed, instead. There have been lots of efforts over the years to speed up various parts of the interpreter, compiler, and virtual-machine bytecode execution, though no comprehensive overhaul has been merged into CPython. An interesting new proposal could perhaps change that, though it is unclear at this point if it will take off.

Mark Shannon posted a message about his ideas to the python-dev mailing list in late October. He noted that CPython is slow, "yet little is done to fix it"; he would like to change that. He has a four-phase plan to get there on the technical side, along with some ideas on how to fund the effort. Beyond that, he had some thoughts on what was different from earlier efforts:

I am aware that there have been several promised speed ups in the past that have failed. You might wonder why this is different.

Here are three reasons:

I already have working code for the first stage.
I'm not promising a silver bullet. I recognize that this is a substantial amount of work and needs funding.
I have extensive experience in VM [virtual machine] implementation, not to mention a PhD in the subject.

He pointed those interested to a GitHub repository where he had placed a few documents describing the plan. The overall idea is for each phase to increase CPython performance by 50%, which results in a 5x speedup at the end of all four phases. The intent is for these improvements to apply widely: "Performance should improve for all code, not just loop-heavy and numeric code." He described a tiered execution model that would separate code into tiers based on how frequently it runs; each tier would be handled differently:

All programs have parts that are executed very often and other parts that are executed infrequently or not at all. It makes sense to treat these different parts differently. For parts of the code that are never executed, or executed just once, it makes sense to spend the minimum amount of effort loading that code, even if it slows down execution a bit. For code that is executed frequently, it makes sense to streamline its execution at much as possible, even if that slows down loading considerably and costs more of other resources, such as memory.

It is, of course, impossible to know in general which parts of a program will be "hot", that is run very frequently, and which parts will be "cold", that is [never] run or run only once. Not only that, but there is not simple distinction between hot and cold, but a range. Much code is "warm", that is run reasonably often, but considerably less than the hottest part of the program.

To address this range of runtime characteristics, several tiers of execution should be employed. We will number these 0 to 3. A minimum viable implementation would consist of tiers 0 and 1. Tier 2 can then be added later for improved performance. Tier 3 can then be added for better performance of longer running programs.

The funding plan is rather creative and different, but also seems to be drawing some scrutiny—and perhaps controversy. Each phase would require roughly $500,000; half of that would be used to pay for the development and the other half would be held by the organization that is coordinating the funding for use in the maintenance of CPython. He suggests the Python Software Foundation (PSF) as the obvious choice for that role.

In order to reduce the risks of a contractor for a specific phase being unable to deliver the specified performance increase, payment would only be made on delivery. For the first phase, Shannon has working code, so he is essentially suggesting that the PSF raise the funds to engage him as the contractor for that phase. But that approach raised the eyebrows of Steven D'Aprano:

I don't mean to be negative, or hostile, but this sounds like you are saying "I have a patch for Python that will make it 1.5 times faster, but you will never see it unless you pay me!"

He went on to ask what happens to the code if the PSF decides not to pursue the idea. Shannon did not directly address that question, but he is trying to negotiate a business deal of sorts:

I believe that's how business works ;)
I have this thing, e.g an iPhone, if you want it you must pay me. I think that speeding CPython 50% is worth a few hundred iPhones.

Chris Angelico wondered about Shannon's plan for phase two, which is envisioned as a grab bag of fairly small improvements: "you're proposing a number of small tweaks in stage 2 and expecting that (combined) they can give a 50% improvement in overall performance?" But Shannon pointed out that compounding helps here: "20 tweaks each providing a 2% is a 49% speedup."

Also, as D'Aprano noted, each phase stands on its own. No commitment would need to made for phase two until phase one has been delivered (and been paid for). Then if Shannon was unable to achieve the 50% performance increase for phase two, as Angelico worries, Shannon would not be paid for that work. There is some wiggle room in the proposal, D'Aprano pointed out, so that getting a 49% increase would still result in payment, perhaps on "some sort of sliding scale".

Angelico also asked if the $2 million might be better spent on improving PyPy. Shannon said that there is a difference in what partial success would mean for the different projects. The main hurdle for PyPy as a replacement for CPython is getting the C extensions to work; a partial solution to that problem only goes so far:

Partial success of speeding up CPython is very valuable.
Partial success in getting PyPy to support C extensions well and perform well when it currently does, is much less valuable.

CPython that is "only" 2 or 3 times faster is a major improvement, but a PyPy that supports 80% of the C extensions that it currently does not is still not a replacement for CPython.

Paul Moore was generally in favor of the proposal if someone could be found to fund it. But it is important to ensure that the usual development workflow, including code review, pull requests, and so forth, is going to be followed, he said. While Shannon did not directly reply to that message, elsewhere in the thread he made it clear that he plans to handle these changes in the normal fashion. In fact, both he and Moore said that some of the money might be used to pay for code review.

Gregory P. Smith was favorably inclined toward the idea as well. Shannon is effectively proposing a path toward bringing ideas from his earlier HotPy (2) project to CPython "with an intent to help maintain them over the long term", Smith said. He believes the terms are quite reasonable, with regard to both the money and the risk protections; it is a good fit for CPython:

We've been discussing on and off in the past many years how to pay people for focused work on CPython and the ecosystem and balance that with being an open source community project. We've got some people employed along these lines already, this would become more of that and in many ways just makes sense to me.

The Python steering council also weighed in. Thomas Wouters posted a lengthy reply in early November that relayed some of the council discussion of the idea. The council is not opposed to the idea at all, but does caution that there are a lot of details that need to be worked out before it could make a serious effort to find an entity or entities to fund phase one. There is also a question of prioritization:

Regarding donations to CPython development (as earmarked donations, or from the PSF's general fund), the SC drew up a plan for investment that is centered around maintenance: reducing the maintenance burden, easing the load on volunteers where desired, working through our bug and PR backlog. [...] Speeding up pure-Python programs is not something we consider a priority at this point, at least not until we can address the larger maintenance issues.

And it may not be immediately obvious from Mark's plans, but as far as we can tell, the proposal is for speeding up pure-Python code. It will do little for code that is hampered, speed-wise, by CPython's object model, or threading model, or the C API. We have no idea how much this will actually matter to users. Making pure-Python code execution faster is always welcome, but it depends on the price. It may not be a good place to spend $500k or more, and it may even not be considered worth the implementation complexity.

Shannon replied that he had not yet made a proposal to the council but was glad to get the feedback. He agreed that maintenance is an important priority, which is why he included money for that in his proposal, but that it will be easier to justify performance improvements to potential funding entities:

I think it is a lot easier to say to corporations, give us X dollars to speed up Python and you save Y dollars, than give us X dollars to improve maintenance with no quantifiable benefit to them.

He also said that the total amount of code being added to CPython is not that large and that he hopes to offset any complexities in the implementation with better organization of the code base.

It is perfectly possible to *improve* code quality, if not necessarily size, while increasing performance. Simpler code is often faster and better algorithms do not make worse code.

It is an ambitious proposal that was generally met with approval, even if the "show me the money" funding model is somewhat unorthodox. It is not entirely clear where things go from here, but it would seem a bit sad to miss out on an opportunity to make such a large difference in CPython performance. For many Python-using companies, $500,000 (or even $2 million) is not much of a stretch financially—gathering a few of them together should make that even easier. With luck, some discussions are already underway.

Index entries for this article
Python	CPython
Python	Performance

Speeding up CPython

Posted Dec 16, 2020 21:09 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (33 responses)

Honestly, I wish they would tackle the GIL problem first. A 5x single-thread performance improvement pales before possible 8-32x improvement from utilizing multiple cores.

Speeding up CPython

Posted Dec 16, 2020 22:09 UTC (Wed) by oliwarner (subscriber, #81320) [Link]

Single thread performance affects everybody, all the time, though. I don't disagree, throwing cash at removing the GIL seems worthwhile, but why not both at once?

Speeding up CPython

Posted Dec 16, 2020 22:29 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (28 responses)

There are several problems with removing the GIL:

1. It's actually really hard to do. You would need to introduce lots of fine-grained locking absolutely everywhere, and do a lot of complicated reasoning about thread safety and which bits of the interpreter need to interact with which other bits.
2. Fine-grained locks are (usually) slower than big locks for single-threaded performance, particularly if the big lock can be turned off in single-threaded mode (as is the case with CPython).
3. Much existing Python code intentionally avoids depending on multi-threading, so single-threaded performance is arguably more important in the short term.
4. Much existing Python code tacitly assumes that only one bytecode instruction is executed at a time. For example, you can take a shallow copy of a list with x[:], and under the current version of CPython, this is atomic with respect to concurrent modifications of the list (it's ultimately a single BINARY_SUBSCR bytecode instruction, and the GIL must be held while this bytecode instruction is executed). As a result, your fine-grained locking needs to be pretty coarse-grained in practice, or else you need to break backwards compatibility all over the place.
5. Amdahl's law places a sharp upper bound on possible benefits from multithreading. You cannot simply divide the runtime of a program by the number of cores; the specific algorithm used must be amenable to concurrency.
6. Between the multiprocessing module, concurrent.futures.ProcessPoolExecutor, and C extensions, the GIL can (for most purposes) be worked around without an unreasonable level of effort.

Speeding up CPython

Posted Dec 17, 2020 0:43 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (9 responses)

That's all well and fine, but the problem I see is the fact that it is *global*. I can't even spin up two in-process interpreters to try and do per-thread interpreters to do some local Python code processing. If Python3 had gone and added a `pyctx` call to every API call so that the global state could at least be /scoped/ in some way, that'd have at least allowed someone to go and try a GIL-less context a lot easier. I also think PyPy would also have had a much easier time mocking up a context for easier C API use rather than a bunch of un-namespaced top-level names and direct struct member access being the ABI. Additionally, it makes any kind of hopes for a Python sandbox completely hopeless since I can't hide sensitive modules from just one part of the code in a meaningful way.

Speeding up CPython

Posted Dec 17, 2020 1:02 UTC (Thu) by atnot (subscriber, #124910) [Link] (2 responses)

Exactly this is being done right now under the "subinterpreters" umbrella. Global state has already been removed from most of the interpreter and you can run multiple of them in one thread. However, last I checked, this ability has not been usefully exposed to programs so far.

Speeding up CPython

Posted Dec 17, 2020 2:30 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

That is good news to hear. I think I read about it here on LWN (either an article or from comments), but I think the lack of release notes about how we could use it has degraded the memory. Thanks for the reminder :) .

Speeding up CPython

Posted Dec 17, 2020 16:28 UTC (Thu) by ms-tg (subscriber, #89231) [Link]

Fascinating that the CPython “subinterpreters” proposal sounds similar to the Ruby 3.0 “ractors” implementation.

In the Ruby concept as I understand it today, only “frozen” (fully immutable) objects will be possible to send across ractors (which are like subinterpreters) in place - any mutable data must be essentially cloned to send to another ractor.

I wonder if Python and Ruby parallel evolution here will lead to a similar new focus on immutable structures on both languages?

Speeding up CPython

Posted Dec 17, 2020 11:57 UTC (Thu) by renejsum (guest, #124634) [Link] (3 responses)

I don't understand why PyPy is not pushed as the default VM for Python. It's written in RPython, it's fast, it's compatible....

Speeding up CPython

Posted Dec 17, 2020 12:52 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

How do I embed a PyPy interpreter into an executable? Without an API to do so, CPython is the only answer there (or IronPython/Jython if your app is in those language families).

I do see https://doc.pypy.org/en/latest/embedding.html but it being marked as deprecated without a link to a replacement isn't promising…

Even then, distribution of an embedded Python environment is a nightmare with CPython, but it's a (mostly) known one at this point. Unless PyPy makes that vastly easier, I don't want to have to slay another dragon :/ .

Speeding up CPython

Posted Dec 24, 2020 10:13 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

The link could be more prominent, but that page does link to the new and supported pypy embedding interface:

https://cffi.readthedocs.io/en/latest/embedding.html

Speeding up CPython

Posted Dec 26, 2020 0:14 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

This is fine, but not suitable (at all) for what I'm talking about. It looks like for any Python library we'd want to provide, we'd have to write some cffi code. No one is going to do that for matplotlib, numpy, scipy, or the umpteen other libraries we get asked to provide in our package.

Speeding up CPython

Posted Dec 17, 2020 13:40 UTC (Thu) by anselm (subscriber, #2796) [Link] (1 responses)

I can't even spin up two in-process interpreters to try and do per-thread interpreters to do some local Python code processing. If Python3 had gone and added a `pyctx` call to every API call so that the global state could at least be /scoped/ in some way, that'd have at least allowed someone to go and try a GIL-less context a lot easier.

John Ousterhout's Tcl had that figured out in the early 1990s. Those were the days.

Speeding up CPython

Posted Dec 22, 2020 23:12 UTC (Tue) by nkuitse (guest, #62915) [Link]

Amen to that!

Speeding up CPython

Posted Dec 17, 2020 7:17 UTC (Thu) by jayalane (subscriber, #133964) [Link] (3 responses)

I use python a lot and I have been using Go a bit and I never have a Go program that doesn’t take up all the cores, pretty equally. I toss work to be done into a channel and spin up a configurable number of go routines, most often more than CPUs, and it all uses my 8 or or 16 or more core machine quite nicely. So yeah using all the cores takes parallelism in the algorithms but it’s not that hard and makes things so so fast. And go isn’t far behind in dev productivity, but is behind.

Speeding up CPython

Posted Dec 17, 2020 22:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

It heavily depends on the problem you are trying to solve. If you want to do something embarrassingly parallel, then Amdahl's law basically says you can use all of the cores at 100%, just as you might naively expect. If you are trying to do something with a significant non-parallelizable component (e.g. the Bellman-Ford algorithm), you will likely find it difficult* to even write a highly parallel implementation, and performance will suffer.

* It is not impossible to write a parallel implementation of Bellman-Ford, as Google will happily tell you. But I've looked at several of the results, and I still only have a very rough idea of what the resulting algorithm would look like or how it would perform. The clearest result I could find seems to say that the serial component is at least O(|V|) in time complexity, but I'm having a hard time following the synchronization primitives which it describes, so I might be misinterpreting something. But it's definitely not an embarrassingly parallel algorithm.

Speeding up CPython

Posted Dec 18, 2020 1:26 UTC (Fri) by excors (subscriber, #95769) [Link]

> It is not impossible to write a parallel implementation of Bellman-Ford, as Google will happily tell you. But I've looked at several of the results, and I still only have a very rough idea of what the resulting algorithm would look like or how it would perform.

I guess it may look like the BGP routing protocol, since that's basically parallel Bellman-Ford. Each processor (i.e. router) is responsible for a single vertex and its associated edges, and the 'relax' step runs in parallel across all the processors. But instead of relaxing in lock-step |V| times, they all relax asynchronously and repeatedly and never explicitly stop, which makes it trickier to calculate bounds on convergence time.

Unfortunately BGP doesn't guarantee it will ever converge, because its metric is much more complex than just shortest-paths, and it can (and sometimes does) oscillate forever. If you simplify it enough to guarantee convergence, apparently it can still require O(|V|!) messages across the network before converging (https://dl.acm.org/doi/pdf/10.1145/347059.347428).

Speeding up CPython

Posted Dec 24, 2020 15:55 UTC (Thu) by ejr (subscriber, #51652) [Link]

For people who have done this and published, see:
V. T. Chakaravarthy, F. Checconi, F. Petrini and Y. Sabharwal, "Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems," 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, 2014, pp. 889-901, doi: 10.1109/IPDPS.2014.96.

A search through Google Scholar or something similar likely will turn up an open copy.

Speeding up CPython

Posted Dec 17, 2020 11:44 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (3 responses)

> Amdahl's law places a sharp upper bound on possible benefits from multithreading. You cannot simply divide the runtime of a program by the number of cores; the specific algorithm used must be amenable to concurrency.

I think that V8 js engine challenges Amdahl's Law. V8 is in Chromium (and derived web browsers), the Node.js server-side execution environment and more. The use-case is focused on high network latency relative to local data access rates and local processing time, so V8 races to idle with as many parts executed in parallel.

This is all in aid of getting a web page in front of a user with as little delay as possible between having data to work on and that work being complete. The up-front costs of parcelling up the work are paid as 'cost of doing business' for V8 and so the wide core counts, large memory spaces and billions of available instructions per second get used to arrive at an end result as fast as possible.

The optimisation problem that is constrained by Amdahl's Law contains the assumption that the algorithm could not be stated more efficiently before you spread out to parallel workers. This doesn't apply for the web ecosystem -- the web page and its bundled libraries are inefficient and the network latency is the worst aspect of the delay to the user experience.

This teaches us that we might harvest faster computation with enhanced parallelism.

K3n.

Speeding up CPython

Posted Dec 17, 2020 20:10 UTC (Thu) by miquels (guest, #59247) [Link] (2 responses)

> I think that V8 js engine challenges Amdahl's Law. V8 is in Chromium (and derived web browsers), the Node.js server-side execution environment and more. The use-case is focused on high network latency relative to local data access rates and local processing time, so V8 races to idle with as many parts executed in parallel.

V8 might run the various JIT engines on separate threads in order to compile the program and get it to start as soon as possible with the least parse/compile time delay, but the actual javascript program is single-threaded and runs on 1 CPU core only.

Speeding up CPython

Posted Dec 21, 2020 12:46 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (1 responses)

>the actual javascript program is single-threaded and runs on 1 CPU core only
Not in my experience: a bunch of program code at the same level of the callback hierarchy gets executed out-of-order by Node and the V8 engine such that the function returns part-calculated data before critical calculations are done.

It might be true that the event loop is single-threaded (funny how this is in a comment thread on the Python GIL and program sequencing), but once events are enqueued, program execution is as parallel as it can be in a race to idle such that the callback hierachy is one pattern for ensuring that program flow is followed (and promises or async/await are the others) across many execution threads.

K3n.

Speeding up CPython

Posted Feb 15, 2021 18:15 UTC (Mon) by ksandstr (guest, #60862) [Link]

Thinking that it is so, and believing one's experience to be a consequence of that, is one thing. Demonstrating that V8 somehow pulls blood from stone is quite another. I'm not saying it's impossible, but actual solutions involve bizarre memory-ordering esoterica[0] that should try the sanity of reasonable people; and I very much doubt V8 is engaging in any of that.

tl;dr -- I don't suppose you have a benchmark we could use to replicate your experience, and/or the basis for your thoughts?

[0] "implied transactions", anyone? Grand f##kery right thur!

Speeding up CPython

Posted Dec 17, 2020 11:55 UTC (Thu) by renejsum (guest, #124634) [Link] (6 responses)

Testcase: Running 100 threads each sending messages to the next using queue.Queue. Thread 0 sends 1000 messages to Thread 1, which send them to thread 2 etc. until they are back at Thread 0
Machine: MacBook Pro i9 8 cores + 8 hyper threads

Cpython: Uses one core 50-60%
Jython: Uses all 8+8 cores at 100% (and is much faster)

Jython uses fine grained locking, so it seams possible to do. I understand that WAY more money has been thrown at the development of the JavaVM than CPython, and that is probably the biggest difference.

Speeding up CPython

Posted Dec 17, 2020 12:06 UTC (Thu) by pradyunsg (guest, #116784) [Link]

If someone is throwing that kind of money at CPython, the kinds of projects for exploring speedup would be significantly different to what this is. :)

Speeding up CPython

Posted Dec 17, 2020 16:43 UTC (Thu) by mb (subscriber, #50428) [Link] (3 responses)

> Jython: Uses all 8+8 cores at 100% (and is much faster)
> Jython uses fine grained locking, so it seams possible to do.

Yes. But the single thread performance of Jython is much slower than CPython.
That's the price you pay.

I'm wondering whether this could be made selectable at runtime by the user.
Not by cluttering up the code flow with if-then-else, because that would also kill performance (caching).
But by runtime patching of the interpreter. Just like Linux does for so many things (e.g. tracepoints).
So if the user wants multicore, patch in all those fine grained locks.
If the user wants full single core performance, remove all those fine grained locks (replace them with NOPs, which have virtually no runtime).

Speeding up CPython

Posted Dec 17, 2020 21:56 UTC (Thu) by renejsum (guest, #124634) [Link] (2 responses)

Single thread performance in Jython is also often *much* faster than CPython.

The JIT in the JVM makes Jython faster in most cases. Only startup time is a bit slower...

Speeding up CPython

Posted Dec 18, 2020 0:45 UTC (Fri) by mb (subscriber, #50428) [Link]

Ok.
That's not what I actually noticed with Jython 2.7 some time ago.
2.7 was much slower in single thread performance than CPython 2.7 in my tests. I did not test any further 3.x development of Jython , because it's basically stuck at 2.7. So it's essentatially dead from a practical pov.

Speeding up CPython

Posted Dec 18, 2020 10:09 UTC (Fri) by LtWorf (subscriber, #124958) [Link]

A bit slower? It takes like 1000x to start up!

Speeding up CPython

Posted Dec 19, 2020 8:34 UTC (Sat) by jond (subscriber, #37669) [Link]

Yes compared to the kind of support that JVM development has had, $500k sounds cheap!

The GIL is far, far worse than that

Posted Feb 15, 2021 18:11 UTC (Mon) by ksandstr (guest, #60862) [Link] (2 responses)

The main point of removing the GIL today is that, since the interpreter takes the GIL for each instruction and then re-takes it for the next, it introduces an atomic operation each time in the single thread case, an additional off-core cache ping (or pong) when there are multiple threads, and further two kernel round trips (to the scheduler, no less) when the lock is contested. So its price in programs that multithread for I/O multiplexing is much, much larger than in single-threaded programs; but even in the optimal case it's outright pointless to bang a global lock since clearly multiple threads aren't used.

Python's GIL is what happened when people who didn't properly know how went ahead and multithreaded something anyway. (BKL, anyone? It's good for Linus, it's good for Guido?) The counterargument involving fine-grained locking is similarly silly: consumers of a given data item can easily be changed over to a different static lock of lesser scope that, when it's necessary to take both, is dominated by the GIL. If it were only about the GIL, it'd only take a properly-designed locking hierarchy to enable concurrency between threads accessing unrelated data.

So it's not hard to do. You just need to know how, and the "what" is nearly anything that doesn't do pingpong for every instruction.

What is hard to do is to convert existing multithreaded Python code to a brave new world where individual instructions aren't atomic under the GIL. One might even suggest that a reasonable substitute for the GIL would be an userspace threading library since those can be hugely efficient (even more so on recent cores), preempt predictably[0], require next to no extra kernel interaction, and achieve dramatically higher throughputs when less processing is done to data arriving through a multiplexed set of channels than kernel-side thread switching (and worse, inter-core ping-pong!) costs; and these gains keep increasing as cores multiply.

[0] this is particularly appropriate for an interpreter, which can just call "pth_yield()" or something every 5000 instructions or so. This produces significant cache and scheduling advantages.

BKL

Posted Feb 15, 2021 18:17 UTC (Mon) by corbet (editor, #1) [Link] (1 responses)

Many of us have spent a lot of time complaining about the big kernel lock — and doing the work to get rid of it as well. That said, one shouldn't just poke at it this way. The BKL had a specific and well-defined purpose: allow most of the code in the kernel to continue to run as if it were on a uniprocessor machine, with all of the associated assumptions intact. It was, I believe, the only way to break the problem down into tractable steps.

BKL

Posted Feb 15, 2021 18:39 UTC (Mon) by ksandstr (guest, #60862) [Link]

The point I'm making, albeit glibly, is this:

The BKL was quick and nasty, and that didn't matter when the target hardware was two to four 386 on some crazy board with a bespoke multiprocessing architecture. Contrast with the GIL, which exists to push I/O multiplexing into the kernel and exploit concurrency in some FFI calls while (historically) avoiding concurrent reference counting misery.

Replacing the BKL is a problem involving primitives provided by the fixed processor architecture, which is necessarily unforgiving. On the other hand, replacing the GIL could be as simple as switching Python contexts in the interpreter, all on the same stack, and letting epoll (which didn't exist in 1996) handle the kernel part while FFI calls interact with their own concurrency models as they would in any other program.

Linus is not the goose, and Guido is not the gander. Heck, one of them isn't even warm-blooded.

Speeding up CPython

Posted Dec 17, 2020 9:15 UTC (Thu) by quietbritishjim (subscriber, #114117) [Link] (2 responses)

The GIL affects fewer Python programs than you might think. The GIL is released by C extension modules, so Python programs that use those can use threads properly. For example, if you multiply matrices with numpy.dot from multiple threads, those calls will genuinely execute concurrently.

In my experience, many Python programs use C extension modules heavily anyway, especially those that are CPU intensive (which are the only ones you probably multithread anyway, apart from those using blocking IO which also releases the GIL of course). Note that this includes not only third-party libraries, like numpy (as mentioned), scipy, OpenCV, etc., but also a lot of the CPU-heavy standard library modules e.g. zipfile (the compression code, not just file IO) and sqlite3 (again, the whole lot, not just IO). It might seem to defeat the point of Python a bit from the library writer's perspective, because they still need to write their code in C (or Cython) to be useful in this way, but it makes a lot of sense from the application writer's perspective: you only need to write a small amount of Python glue code to take advantage of other people's C code.

Where this falls down is in the granularity of the calls. If you're multiplying a small number of huge matrices from multiple threads then the GIL is just irrelevant. If you're multiplying a very large number of tiny matrices from multiple threads then it could become more of an issue.

Speeding up CPython

Posted Dec 17, 2020 9:44 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

The problem with Python is that you quite often have something like a web-server. It can naturally be ran as multiple processes, so that each interpreter gets its own CPU.

This works fine. But then you want to add an in-memory cache and that's where the problems start. If each process gets its own cache, then you have a massive RAM waste. And that's the only easy option (before you go off into using Redis or memcached with all the associated issues).

It's not really any better for a lot of computational code as well. In my company we have Python code that runs computations for each hour of the year, almost completely independently of each other. It was quite liberating to switch from Python to C++ and get true multi-threading, and a huge performance boost.

Speeding up CPython

Posted Dec 17, 2020 19:20 UTC (Thu) by Bluehorn (subscriber, #17484) [Link]

> The GIL affects fewer Python programs than you might think.

I have fallen for this argument. Unfortunately.

This has been the argument for the GIL for as long as I can remember: Just optimize the inner loop in something like C++. This works if you have a tool that does some costly computation (solving of linear equations, FFT, scene rendering, ray tracing) at the core with some Python code doing the stuff around that is not time critical.

That is definitely a big domain and includes a lot of scientific computing (where Python seems to shine).

Unfortunately I maintain an application written in Python that does not have such a core but deals with lots and lots of small objects representing parts of a FE-model. And it is slow. When we started, we had small datasets so we did not worry too much - we'll just rewrite the relevant implementation details in C.

Only after I hit the performance wall I noticed that I basically had to rewrite most of the classes in C to replace the slow Python code. But since everything is held together by Python collections, manipulating those objects would either require holding the GIL or doing our own locking. Also, we would lose our database integration (which is using SQLAlchemy, no idea how to adapt this to C++ objects).

So we tried to use caching to avoid repeated computations. Which means we spent a lot of development time in fixing cache invalidation problems. Still the system was slow and customers (with 24 core systems) told us that they can't understand how the software performs that bad while not even loading the system. Just parallelize...

As rewriting in another language was out of the question, we implemented multi process concurrency. At last this improved things a bit (especially the user interface stays responsive even if the system is doing something), but at a hefty price. multiprocessing was unusable because this is a GUI application. Using fork wrecks the X11 connection, and the forkserver approach was not available in Python 2 when we started. So we wrote our own multi process IPC library.

I still love Python for writing scripts, the language is extremely expressive. But I am now wary about using it for applications that have to do simple actions on many diverse and small objects.

Speeding up CPython

Posted Dec 17, 2020 11:38 UTC (Thu) by norphonic (guest, #93563) [Link] (1 responses)

> That is not at all surprising to anyone who has used it—the language is optimized for understandability and development speed, instead.

That's a really oblique way of saying 'unoptimized'.

Speeding up CPython

Posted Dec 17, 2020 17:11 UTC (Thu) by randomguy3 (subscriber, #71063) [Link]

Only if you have a very narrow view of "optimisation" as only referring to run-time costs. There are many things you can optimise for, and I think this phrasing was meant to acknowledge that.

Speeding up CPython

Posted Dec 22, 2020 12:11 UTC (Tue) by georgm (subscriber, #19574) [Link] (1 responses)

$ time firewall-cmd --state
running

real 0m2.023s
user 0m1.561s
sys 0m0.120s

For a simple dbus call + response, the rest is just loading stuff from files.

Ok, machine is not the newest (800MHz AMD G-T40E), but anyway...

Speeding up CPython

Posted Jan 2, 2021 5:26 UTC (Sat) by wtarreau (subscriber, #51152) [Link]

Probably that it decided to mine cryptocurrencies in your back for 2 seconds to reward itself for the request before giving you this trivial result that doesn't need more than one millisecond of CPU :-)