A better story for multi-core Python

By Jake Edge
July 8, 2015

Running the standard Python interpreter in threads on multiple CPU cores has always resulted in a smaller performance gain than one might naively think—or hope for. Because of the CPython global interpreter lock (GIL), only one thread of execution can be running in the interpreter core at any given time. Removing the GIL has long been a topic of discussion in Python circles, and various alternative Python implementations have either removed or worked around the GIL. A recent discussion on the python-ideas mailing list looked at a different approach to providing a better multi-core story for Python.

In a post that was optimistically titled "solving multi-core Python", Eric Snow outlined an approach that did not rely on removing the GIL, but instead relies on "subinterpreters" and a mechanism to share objects between them. The multi-core problem is partly a public relations problem for the language, Snow said, but it needs solving for that and other, more technical reasons.

Subinterpreters

The basic principle behind Snow's proposal is to take the existing subinterpreter support and to expose it in the Python language as a concurrency mechanism. The subinterpreters would run in separate threads, but would not generally share data with each other, at least implicitly, unlike the typical use of threads. Data would only be exchanged explicitly via channels (similar to those in the Go language). One of the main influences for Snow's thoughts (and for Go's concurrency model) is Tony Hoare's "Communicating Sequential Processes".

Handling objects shared between subinterpreters is one of the areas that requires more thought, Snow said. One way forward might be to only allow immutable objects to be shared between the subinterpreters. In order to do that, though, it probably makes sense to move the reference counts (used for garbage collection) out of the objects themselves and into a separate table. That would allow the objects themselves to be truly unchanging, which could also help performance in the multi-processing (i.e. fork()) case by avoiding page copies (via copy-on-write) of objects that are simply being referenced again, as Nick Coghlan pointed out.

Other areas that need to be considered are what the restrictions on subinterpreters would be. If, for example, subinterpreters were not allowed to start new threads, they would be single-threaded and not require a GIL. Or the GIL for subinterpreters could be replaced with a "local interpreter lock", with the main GIL used in the main interpreter and to mediate interaction between subinterpreters. There is also a question about using fork() in subinterpreters. In the initial email, Snow suggested disallowing that, but in the discussion that followed, he seemed to rethink that.

The proposal is clearly a kind of early stage "request for comment" (or "a shot over the bow" as Snow put it) but it did spark quite a bit of discussion and some fairly favorable comments. Yury Selivanov was quite interested in the idea, for example, noting that just being able to share immutable objects would be useful:

Even if this is the only thing we have -- an efficient way for sharing immutable objects (such as bytes, strings, ints, and, stretching the definition of immutable, FDs) that will allow us to do a lot.

Concerns

But Gregory Smith was concerned about the impact of each subinterpreter needing to re-import all of the modules used by the main interpreter, since those would not be shared. That would reduce the effectiveness of Snow's model. On the other hand, though, Smith sees a potential upside as well: "I think a result of it could be to make our subinterpreter support better which would be a good thing." Several suggestions were made for ways to speed up the startup time for subinterpreters or to share more state (such as modules) between the interpreters.

Several in the thread believed that the existing, fork()-based concurrency was the right way forward, at least for POSIX systems. For example, Devin Jeanpierre said:

So if I have a web server, each independent serving thread has to do all of the initialization (import HTTP libraries, etc.), right? Compare with forking, where the initialization is all done and then you fork, and you are immediately ready to serve, using the data structures shared with all the other workers, which is only copied when it is written to. So forking starts up faster and uses less memory (due to shared memory.)

While fork() does provide those benefits, it is only available on POSIX systems. It is different than Snow's goal, which is "to make it obvious and undeniable that Python (3.6+) has a good multi-core story", which is partly a matter of public perception. The subinterpreter idea is just a means to that end, he said, and he would be happy to see a different solution if it fulfilled that goal. In the meantime, though, his proposal has some characteristics that multi-processing with fork() lacks:

But we are aiming for a share-nothing model with an efficient object-passing mechanism. Furthermore, subinterpreters do not have to be single-use. My proposal includes running tasks in an existing subinterpreter (e.g. executor pool), so that start-up cost is mitigated in cases where it matters.

But Sturla Molden pointed to the lack of fork() for Windows as one of the real reasons behind Snow's proposal: "It then boils down to a workaround for the fact that Windows cannot fork, which makes it particularly bad for running CPython". But, as Snow said, Python cannot ignore Windows. Beyond that, though, even with the "superior" fork() solution available, the perception of multi-core Python is much different:

If the multi-core problem is already solved in Python then why does it fail in the court of public opinion. The perception that Python lacks a good multi-core story is real, leads organizations away from Python, and will not improve without concrete changes. Contrast that with Go or Rust or many other languages that make it simple to leverage multiple cores (even if most people never need to).

Molden replied with a long list of answers to the "FUD" that is promulgated about Python and the GIL, but that doesn't really change anything. That is why Snow's goal is to make multi-core support "obvious and undeniable". It also seems that Molden is coming from a scientific/numeric Python background, which is not generally where the complaints about Python's multi-core support originate, as Coghlan noted.

Shared data

The reasoning behind restricting the data shared between interpreters to immutable types (at least initially) can be seen from a question asked by Nathaniel Smith. He wondered how two subinterpreters could share a complicated data structure containing several different types of Python objects. Snow acknowledged that concern, and suggested that avoiding the "trickiness involved" in handling that kind of data by sticking to immutable objects; though there may be "some sort of support for mutable objects" added later, he said.

Coghlan summarized Snow's proposal as really being three separate things:

Filing off enough of the rough edges of the subinterpreter support that we're comfortable giving them a public Python level API that other interpreter implementations can reasonably support
Providing the primitives needed for safe and efficient message passing between subinterpreters
Allowing subinterpreters to truly execute in parallel on multicore machines

All 3 of those are useful enhancements in their own right, which offers the prospect of being able to make incremental progress towards the ultimate goal of native Python level support for distributing across multiple cores within a single process.

In addition, Coghlan has published a summary of the state of multi-core Python that looks at the problem along with alternatives and possible solutions. It is an update from an earlier entry in his Python 3 Q&A and is well worth a read to get the background on the issues.

There seems to be enough interest in Snow's proposal that it could be on the radar for Python 3.6 (which is roughly 18 months off). There is a long road before that happens, though. A PEP will have to be written—as will a good bit of code. We also have yet to see what Guido van Rossum's thoughts on the whole idea are, though Snow did mention some discussions with Python's benevolent dictator for life in his initial post. As Nathaniel Smith put it, Snow's approach seems like the "least impossible" one. That is not the same as "possible", of course, but seems hopeful at least.

Index entries for this article
Python	Subinterpreters

A better story for multi-core Python

Posted Jul 9, 2015 14:44 UTC (Thu) by flussence (guest, #85566) [Link]

This sounds like a carbon copy of the threading model introduced in Perl 5.8, design mistakes and all. I suppose going from having nothing whatsoever to that is still an improvement, but it's a seriously weak offering for something released in 2015.

A public relations problem

Posted Jul 10, 2015 14:49 UTC (Fri) by Kwi (subscriber, #59584) [Link] (13 responses)

Calling the GIL a public relations problem is exactly right. It's more a PR problem than a technical problem, anyway.

If you want concurrency in your Python application, you have the following options:

Just use threads (if you're I/O-bound). The GIL is only a problem if you're CPU-bound.
Use message passing instead of shared memory for communication. Besides removing a bunch of race condition worries, it enables you to use Python concurrency solutions like multiprocessing that avoid the GIL. (The proposed subinterpreter support is just one more of these solutions.)
Implement your performance critical code in a C module, allowing you to avoid the GIL - and harness the improved performance of C.
Implement your performance critical code in Cython, for the same benefits as C, with a Python-like syntax.
Use Jython.
Use IronPython.
Use PyPy. In fact, if you're CPU-bound, why are you not using this already?

"If you want your code to run faster, you should probably just use PyPy." — Guido van Rossum

Would it be easier for everybody if we could all just use threads and not worry about the GIL? Yes. But performance is rarely easy - not in Python, nor in any language.

A public relations problem

Posted Jul 11, 2015 5:54 UTC (Sat) by alankila (guest, #47141) [Link] (12 responses)

> Calling the GIL a public relations problem is exactly right. It's more a PR problem than a technical problem, anyway.

Do not undersell this. This is an actual technical problem, caused by design decisions appropriate for uniprocessor machines, and general neglect of execution performance. That there are solutions, such as using a different language or implementation just underline (C)Python's unsuitability for certain tasks where other languages that are still fairly similar to Python are far better. PyPy is perhaps the thing that allows Python to remain relevant -- from my outsider point of view, the whole Python community should simply abandon CPython as soon as possible and make PyPy the official way to run Python.

A public relations problem

Posted Jul 11, 2015 12:30 UTC (Sat) by Kwi (subscriber, #59584) [Link] (11 responses)

As for allowing Python to "remain relevant" and your outsider(?) PoV, that's exactly my point about the GIL being a PR problem.

For years, I've used Python for server administration and orchestration, web development, and even network packet processing. All of these things have been highly concurrent, and not once has the GIL been a problem.

And where is Python in danger of becoming irrelevant? Python might be facing fierce competition in web development from Rails and Node.js, but it's hardly because of the GIL (or performance at all, when it comes to Ruby). NumPy and SciPy are not going anywhere in the scientific/numeric world, and Python reigns supreme in Linux server administration, having even dethroned Perl.

Complaints about the GIL are many, but concrete examples of people having problems with the GIL in actual application code are rare. I'd like to hear more of those, because otherwise, it's just a bunch of people saying "I can't write threaded code like I'm used to in Python", and honestly, there are a lot of things you're used to which you can't (or shouldn't) do in Python. That's a feature, not a bug.

A public relations problem

Posted Jul 13, 2015 13:59 UTC (Mon) by arvidma (guest, #6353) [Link] (10 responses)

I learnt about the GIL, when I was writing a piece of code that needed (frequently) to traverse a (big) tree structure and perform some calculations on each node.

The problem itself is super easy to parallelize, the tree is static at the time of traversal, there are no cross-branch references and you only modify branch-local data (in the nodes). You spin-off a thread per first-level branch (or second level branch if you want to use moar cores), wait for the recursion to trickle down and up again and summarize each thread-result on the top node when they are finished.

Problem is, only one thread at the time is active, due to GIL. Using the multiprocessing module is of no help, since that means you have to serialize the full tree in one chunk per process, copy each chunk to its new context, deserialize the chunks in every new context, reserialize it when you're finished, copy back to main context and then deserialize again... Unless the computations per node are extremely heavy, you lose way more from the overhead than you gain from the threads.

I would very much appreciate if this type of problem could be handled efficiently in Python.

A public relations problem

Posted Jul 13, 2015 15:41 UTC (Mon) by Kwi (subscriber, #59584) [Link] (9 responses)

I learnt about the GIL, when I was writing a piece of code that needed (frequently) to traverse [preferably in parallel] a (big) tree structure and perform some calculations on each node.

I agree that Python is not the best tool for that job. The answer here would be C or Cython.

Even without the GIL, you'd probably have lock contention on the reference counters for any Python function called while processing your tree.

All CPython objects, including functions, are reference counted; while executing a function, the reference count is increased.

>>> import sys
>>> def foo():
...     print(sys.getrefcount(foo))
...
>>> print(sys.getrefcount(foo))
2
>>> foo()
4

Reference counting is another reason why multithreaded CPython is bad for performance critical stuff. Note that, unlike removing the GIL, removing reference counting would change the semantics of the language. That's why e.g. PyPy (which uses garbage collection) is not the "standard" interpreter.

(Now, PyPy still has the GIL – except in the experimental STM branch – but my experience indicates that PyPy using a single thread is likely faster that a hypotehical GIL-free CPython using four cores.)

Python compromises performance in numerous places, by design, whether it's by allowing crazy monkey patching of modules at runtime or by rejecting tail call optimizations.

Did you know that from module import func and calling func gives better performance than import module and calling module.func (in a tight loop)? It's obvious when you know Python, but it can be surprising to newcomers.

In the end, Python values other features higher than performance; and again, that's largely a design decision.

A public relations problem

Posted Jul 13, 2015 16:19 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

C or Cythons won't help. If you wish to access Python objects then you still have to lock the interpreter.

We've learned that hard way, while building a messaging system in Python. We had a similar problem - a largish shared object queue on which multiple consumers performed some operations. Python simply crashed and burned.

A public relations problem

Posted Jul 13, 2015 16:56 UTC (Mon) by Kwi (subscriber, #59584) [Link] (4 responses)

Sorry, I should've clarified that the data structure should be in C as well, for the reasons you give.

Anyway, the secondary point (besides "the GIL is just one of many problems for performance") is that all languages to some extent trade CPU hours for programmer hours... with the possible exception of assembler code (as written by humans), which often performs worse than the equivalent C.

The Java JITs are generally considered to provide excellent performance, while providing a high-level language. However, let's face it, outside microbenchmarks, the "Java performs better than C/C++" claims are completely bogus.

Now, C++ is a high-performance and high-level language, but you'll find people making reasonable arguments that C performs slightly better, by stripping away the (very thin) abstractions provided by C++.

And C is considered the king of performance, except that you'll find people making reasonable arguments that Fortran performs slightly better, by stripping away the (exceedingly thin) abstractions provided by C.

A trivial example of this trade-off is signed integer overflow, which goes from impossible (Python) to possible but well-defined (Java) to undefined behavior (C).

Now, if developer performance is the primary concern, I use Python, in which my performance is (rough estimate) 10x better than in Java, or 100x better than in C. If CPU performance is the primary concern, the reverse holds.

(Except nowadays, I'd look into using Rust instead of C. And if both developer performance and CPU performance was a concern, I'd use Python and curse its performance, because I really don't like the Java language... but that's besides the point.)

A public relations problem

Posted Jul 18, 2015 6:04 UTC (Sat) by linuxrocks123 (subscriber, #34648) [Link] (3 responses)

Unless we're talking assembler versus something else, I don't think developer productivity impact should be anywhere near 10x between different high-level languages. Python is supremely convenient, but that convenience is worth no more to me than a 2x productivity increase over C or C++. In fact, for very large projects, Python starts to break down for a variety of reasons, and the speedup could even turn negative.

What type of code are you writing where you think you get a 10x productivity boost by switching languages?

A public relations problem

Posted Jul 19, 2015 17:09 UTC (Sun) by Kwi (subscriber, #59584) [Link] (2 responses)

Maybe I'm just a bad C developer. ;-)

All joking aside: For a project where I can fully harvest the benefits of Python features like tuples, generators, memory safety and the wide selection of readily available libraries, I routinely write in 5 lines what would have taken 50 in a language like C. (I'd put modern C++ – that is C++11 or later – somewhere in the middle, let's say 3x faster than C and 3x slower than Python.)

Not only does that save me the time it takes to type those lines, but several studies suggest that the bug density (bugs per line) is roughly independent of the choice of programming language*, which means I save the time needed to debug those lines.

Coming up with a simple example to demonstrate the benefits of a programming languages is always difficult, but I'll try anyway.

Here's a 5-line Python function. The function depends on the standard library re (regular expression) module, and it's used with the built-in sorted function.

def natural(s, _re=re.compile('([0-9]+)')):
    """ Provides a sort key for obtaining a natural collation of strings.

        >>> sorted(['Figure 2', 'Figure 11a', 'Figure 7b'])
        ['Figure 11a', 'Figure 2', 'Figure 7b']
        >>> sorted(['Figure 2', 'Figure 11a', 'Figure 7b'], key=natural)
        ['Figure 2', 'Figure 7b', 'Figure 11a']
    """
    return tuple(
        int(text) if text.isdigit() else text.lower()
        for text in _re.split(s)
    )

If you count the docstring, it's 10 lines, but then you also have unit tests (python -m doctest natural_sort.py).

And yes, I'll go out on a limb and say that the above is representative of maybe 80% of the Python code I write – except for the number of lines, of course. ;-)

If put to the challenge, I'm sure that someone can come up a more or less equivalent C function in less than 50 lines (or less than 15 lines of C++). But it'll take them significantly longer than the 10 minutes it took to write the above, and it won't be nearly as readable (YMMV).

*) I know, I know, it's nearly impossible to measure with any level of scientific rigor, and the research is highly contested. Still, some references:

Ray et al., 2014. A Large Scale Study of Programming Languages and Code Quality in Github.

While the paper draws no conclusions, its data suggests that Python has roughly twice the bug density (bugs per SLOC) of C, C++ or Java. (Assuming Python has at most half as many SLOC than the equivalent C, that's still a win.)

Phipps, 1999. Comparing observed bug and productivity rates for Java and C++.

Apparently (haven't read the study) suggests that C++ has 15–30% more defects per line than Java.

A public relations problem

Posted Jul 19, 2015 17:41 UTC (Sun) by Kwi (subscriber, #59584) [Link]

Oh, I also wanted to add the following correction to my earlier post, though it evidently got dropped while I was editing my reply:

A "100x" boost between C and Python is overstating it, but I'm confident that 10x is a lower bound.

In the end, it's all fuzzy numbers, obviously. :-)

A public relations problem

Posted Jul 20, 2015 1:00 UTC (Mon) by Kwi (subscriber, #59584) [Link]

A quick Internet survey of implementations in other languages demonstrates my point.

There's a 76 line C implementation and a related 63 line Java implementation. The large number of lines reflect that both C and Java are lacking in their native support for high-level list and string operations.

I struggled to find an idiomatic C++ implementation (found plenty of "C in C++" style implementations), though I did find one using Boost (39 lines).

With C# we're finally getting somewhere; it can be done in 7 lines, plus a 17 line utility class that really ought to be in the standard library (but isn't). (C# in general seems to be a good fit if one wants a statically typed, compiled and managed language with a level of expressiveness that approaches Python.)

Again, I'm sure that an experienced C/C++/Java developer could do it in fewer lines than the above, but according to Google, those examples are the best the Internet has to offer. Google also finds several Python implementations, all of them variations on the same 5 lines as I posted above. (I guess there's only one obvious way to do it.)

A public relations problem

Posted Jul 17, 2015 11:51 UTC (Fri) by arvidma (guest, #6353) [Link] (1 responses)

I had no idea about the difference in cost between x.y() and y(), I assumed that that type of thing would be optimized away by the interpreter.

Thanks for a very informative response!

A public relations problem

Posted Jul 17, 2015 15:51 UTC (Fri) by Kwi (subscriber, #59584) [Link]

I should clarify that 99% of the time, the performance hit is insignificant, but in a tight spot, one may want to replace while ...: x.y() by:

z = x.y
while ...: z()

This saves a lookup of the y attribute on every iteration, in favor of a much faster local variable access (assuming this is in a function body).

The Python interpreter can't do this optimization automatically, because it'd change the semantics if one thread assigned to x.y while another was in the loop. It's just one example of the performance difficulties imposed by a highly dynamic language like Python (which doesn't have C's volatile keyword).

But again, 99% of the time, you care more about the language than the performance. So don't go "optimize" every bit of Python code like this. :-)

A public relations problem

Posted Jul 28, 2015 23:19 UTC (Tue) by pboddie (guest, #50784) [Link]

I agree that Python is not the best tool for that job. The answer here would be C or Cython.

I heard this myself from a bunch of "scientific Python" people a few years ago. The response from a colleague of mine who isn't (or wasn't) really a Python user was, "Why not just write the code in C in the first place and just ignore Python?" That's a pretty hard question to answer even for those of us who feel moderately productive in Python.

The big problems with Python's evolution have been the denial that various disadvantages are "real enough" and that everything has to be tied somehow to CPython or not be completely legitimate (although some in the "scientific" community are slowly accepting things like PyPy after choosing to ignore it for years). Need to cross-compile Python in a sane way or target embedded systems or mobile devices? No-one needs to do that! Wasn't Python for Symbian Series 60 not enough?! Thankfully, stuff like Micro Python has been developed and has presumably thrived by filling an otherwise neglected niche. Meanwhile, attempts to deliver CPython as a mobile platform seem to be stuck on repeat at the earliest stage. Plenty of examples of other domains exist if you care to look.

In the end, Python values other features higher than performance; and again, that's largely a design decision.

People have been saying this for twenty years. Making a virtue of such things - that performance is a lost cause and that everyone should instead celebrate other things including the tendency to make the language even more baroque - is precisely what has held the language back for a good long time, too. Such attitudes probably put Perl in the place it currently resides today, in case any lessons from history were needed.

A better story for multi-core Python

Posted Jul 12, 2015 17:32 UTC (Sun) by wahern (subscriber, #37304) [Link]

This project claims to have a copy-on-write fork implementation for Windows: http://midipix.org/

I don't think they've released their implementation, yet, though.