A better story for multi-core Python
Running the standard Python interpreter in threads on multiple CPU cores has always resulted in a smaller performance gain than one might naively think—or hope for. Because of the CPython global interpreter lock (GIL), only one thread of execution can be running in the interpreter core at any given time. Removing the GIL has long been a topic of discussion in Python circles, and various alternative Python implementations have either removed or worked around the GIL. A recent discussion on the python-ideas mailing list looked at a different approach to providing a better multi-core story for Python.
In a post that was optimistically titled
"solving multi-core Python
", Eric Snow outlined an approach that did not
rely on removing the GIL, but instead relies on "subinterpreters" and a
mechanism to
share objects between them. The multi-core problem is partly
a public
relations problem for the language, Snow said, but it needs solving for
that and other, more technical reasons.
Subinterpreters
The basic principle behind Snow's proposal is to take the existing subinterpreter support and to expose it in the Python language as a concurrency mechanism. The subinterpreters would run in separate threads, but would not generally share data with each other, at least implicitly, unlike the typical use of threads. Data would only be exchanged explicitly via channels (similar to those in the Go language). One of the main influences for Snow's thoughts (and for Go's concurrency model) is Tony Hoare's "Communicating Sequential Processes".
Handling objects shared between subinterpreters is one of the areas that requires more thought, Snow said. One way forward might be to only allow immutable objects to be shared between the subinterpreters. In order to do that, though, it probably makes sense to move the reference counts (used for garbage collection) out of the objects themselves and into a separate table. That would allow the objects themselves to be truly unchanging, which could also help performance in the multi-processing (i.e. fork()) case by avoiding page copies (via copy-on-write) of objects that are simply being referenced again, as Nick Coghlan pointed out.
Other areas that need to be considered are what the restrictions on subinterpreters would be. If, for example, subinterpreters were not allowed to start new threads, they would be single-threaded and not require a GIL. Or the GIL for subinterpreters could be replaced with a "local interpreter lock", with the main GIL used in the main interpreter and to mediate interaction between subinterpreters. There is also a question about using fork() in subinterpreters. In the initial email, Snow suggested disallowing that, but in the discussion that followed, he seemed to rethink that.
The proposal is clearly a kind of early stage "request for comment" (or
"a shot over the bow
" as Snow put it) but it did spark quite a
bit of discussion and some fairly favorable comments. Yury Selivanov was
quite interested in the idea, for example,
noting that just being able to share immutable objects would be useful:
Concerns
But Gregory Smith was concerned about the
impact of each subinterpreter needing to re-import all of the modules used
by the main interpreter, since those would not be shared. That would
reduce the effectiveness of Snow's model. On the other
hand, though, Smith sees a potential upside as well: "I think a
result of it
could be to make our subinterpreter support better which would be a good
thing.
" Several suggestions were made for ways to speed up the
startup time for subinterpreters or to share more state (such as modules)
between the
interpreters.
Several in the thread believed that the existing, fork()-based concurrency was the right way forward, at least for POSIX systems. For example, Devin Jeanpierre said:
While fork() does provide those benefits, it is only available on
POSIX systems. It is different than Snow's goal, which is "to make it obvious and undeniable that
Python (3.6+) has a good multi-core story
", which is partly a matter
of public perception. The subinterpreter idea
is just a means to that end, he said, and he would be happy to see a different
solution if it fulfilled that goal. In the meantime, though, his proposal
has some characteristics that multi-processing with fork() lacks:
But Sturla Molden pointed to the lack of
fork() for Windows as one of the real reasons behind Snow's proposal: "It then boils down to a workaround for the fact that
Windows cannot fork, which makes it particularly bad for running
CPython
". But, as Snow said, Python
cannot ignore Windows. Beyond that, though, even with the "superior"
fork() solution available, the perception of multi-core Python is
much different:
Molden replied with a long list of
answers to the "FUD" that is promulgated about Python and the GIL, but that
doesn't really change anything. That is why Snow's goal is to make
multi-core support "obvious and undeniable
". It also seems
that Molden is coming from a scientific/numeric Python background, which is
not generally where the complaints about Python's multi-core support
originate, as Coghlan noted.
Shared data
The reasoning behind restricting the data shared between interpreters to
immutable types (at
least initially) can
be seen from a question asked by Nathaniel
Smith. He wondered how two subinterpreters could share a complicated data
structure
containing several different types of Python objects.
Snow acknowledged that concern, and
suggested that avoiding the "trickiness involved
" in handling
that kind of data by sticking to immutable objects; though there may be
"some sort of
support for mutable objects
" added later, he said.
Coghlan summarized Snow's proposal as really being three separate things:
- Filing off enough of the rough edges of the subinterpreter support that we're comfortable giving them a public Python level API that other interpreter implementations can reasonably support
- Providing the primitives needed for safe and efficient message passing between subinterpreters
- Allowing subinterpreters to truly execute in parallel on multicore machines
All 3 of those are useful enhancements in their own right, which offers the prospect of being able to make incremental progress towards the ultimate goal of native Python level support for distributing across multiple cores within a single process.
In addition, Coghlan has published a summary of the state of multi-core Python that looks at the problem along with alternatives and possible solutions. It is an update from an earlier entry in his Python 3 Q&A and is well worth a read to get the background on the issues.
There seems to be enough interest in Snow's proposal that it could be on
the radar for Python 3.6 (which is roughly 18 months off). There is a
long road before
that happens, though. A PEP will have to be written—as will a good bit of
code. We also have yet to see what Guido van Rossum's thoughts on
the whole idea are, though Snow did mention some discussions with Python's
benevolent dictator for life in his initial post. As Nathaniel Smith put
it, Snow's approach seems like the "least impossible
" one.
That is not the same as "possible", of course, but seems hopeful at least.
Index entries for this article | |
---|---|
Python | Subinterpreters |
Posted Jul 9, 2015 14:44 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Jul 10, 2015 14:49 UTC (Fri)
by Kwi (subscriber, #59584)
[Link] (13 responses)
Calling the GIL a public relations problem is exactly right. It's more a PR problem than a technical problem, anyway. If you want concurrency in your Python application, you have the following options: "If you want your code to run faster, you should probably just use PyPy." — Guido van Rossum Would it be easier for everybody if we could all just use threads and not worry about the GIL? Yes. But performance is rarely easy - not in Python, nor in any language.
Posted Jul 11, 2015 5:54 UTC (Sat)
by alankila (guest, #47141)
[Link] (12 responses)
Do not undersell this. This is an actual technical problem, caused by design decisions appropriate for uniprocessor machines, and general neglect of execution performance. That there are solutions, such as using a different language or implementation just underline (C)Python's unsuitability for certain tasks where other languages that are still fairly similar to Python are far better. PyPy is perhaps the thing that allows Python to remain relevant -- from my outsider point of view, the whole Python community should simply abandon CPython as soon as possible and make PyPy the official way to run Python.
Posted Jul 11, 2015 12:30 UTC (Sat)
by Kwi (subscriber, #59584)
[Link] (11 responses)
As for allowing Python to "remain relevant" and your outsider(?) PoV, that's exactly my point about the GIL being a PR problem. For years, I've used Python for server administration and orchestration, web development, and even network packet processing. All of these things have been highly concurrent, and not once has the GIL been a problem. And where is Python in danger of becoming irrelevant? Python might be facing fierce competition in web development from Rails and Node.js, but it's hardly because of the GIL (or performance at all, when it comes to Ruby). NumPy and SciPy are not going anywhere in the scientific/numeric world, and Python reigns supreme in Linux server administration, having even dethroned Perl. Complaints about the GIL are many, but concrete examples of people having problems with the GIL in actual application code are rare. I'd like to hear more of those, because otherwise, it's just a bunch of people saying "I can't write threaded code like I'm used to in Python", and honestly, there are a lot of things you're used to which you can't (or shouldn't) do in Python. That's a feature, not a bug.
Posted Jul 13, 2015 13:59 UTC (Mon)
by arvidma (guest, #6353)
[Link] (10 responses)
The problem itself is super easy to parallelize, the tree is static at the time of traversal, there are no cross-branch references and you only modify branch-local data (in the nodes). You spin-off a thread per first-level branch (or second level branch if you want to use moar cores), wait for the recursion to trickle down and up again and summarize each thread-result on the top node when they are finished.
Problem is, only one thread at the time is active, due to GIL. Using the multiprocessing module is of no help, since that means you have to serialize the full tree in one chunk per process, copy each chunk to its new context, deserialize the chunks in every new context, reserialize it when you're finished, copy back to main context and then deserialize again... Unless the computations per node are extremely heavy, you lose way more from the overhead than you gain from the threads.
I would very much appreciate if this type of problem could be handled efficiently in Python.
Posted Jul 13, 2015 15:41 UTC (Mon)
by Kwi (subscriber, #59584)
[Link] (9 responses)
I learnt about the GIL, when I was writing a piece of code that needed (frequently) to traverse [preferably in parallel] a (big) tree structure and perform some calculations on each node. I agree that Python is not the best tool for that job. The answer here would be C or Cython. Even without the GIL, you'd probably have lock contention on the reference counters for any Python function called while processing your tree. All CPython objects, including functions, are reference counted; while executing a function, the reference count is increased. Reference counting is another reason why multithreaded CPython is bad for performance critical stuff. Note that, unlike removing the GIL, removing reference counting would change the semantics of the language. That's why e.g. PyPy (which uses garbage collection) is not the "standard" interpreter. (Now, PyPy still has the GIL – except in the experimental STM branch – but my experience indicates that PyPy using a single thread is likely faster that a hypotehical GIL-free CPython using four cores.) Python compromises performance in numerous places, by design, whether it's by allowing crazy monkey patching of modules at runtime or by rejecting tail call optimizations. Did you know that from module import func and calling func gives better performance than import module and calling module.func (in a tight loop)? It's obvious when you know Python, but it can be surprising to newcomers. In the end, Python values other features higher than performance; and again, that's largely a design decision.
Posted Jul 13, 2015 16:19 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
We've learned that hard way, while building a messaging system in Python. We had a similar problem - a largish shared object queue on which multiple consumers performed some operations. Python simply crashed and burned.
Posted Jul 13, 2015 16:56 UTC (Mon)
by Kwi (subscriber, #59584)
[Link] (4 responses)
Sorry, I should've clarified that the data structure should be in C as well, for the reasons you give. Anyway, the secondary point (besides "the GIL is just one of many problems for performance") is that all languages to some extent trade CPU hours for programmer hours... with the possible exception of assembler code (as written by humans), which often performs worse than the equivalent C. The Java JITs are generally considered to provide excellent performance, while providing a high-level language. However, let's face it, outside microbenchmarks, the "Java performs better than C/C++" claims are completely bogus. Now, C++ is a high-performance and high-level language, but you'll find people making reasonable arguments that C performs slightly better, by stripping away the (very thin) abstractions provided by C++. And C is considered the king of performance, except that you'll find people making reasonable arguments that Fortran performs slightly better, by stripping away the (exceedingly thin) abstractions provided by C. A trivial example of this trade-off is signed integer overflow, which goes from impossible (Python) to possible but well-defined (Java) to undefined behavior (C). Now, if developer performance is the primary concern, I use Python, in which my performance is (rough estimate) 10x better than in Java, or 100x better than in C. If CPU performance is the primary concern, the reverse holds. (Except nowadays, I'd look into using Rust instead of C. And if both developer performance and CPU performance was a concern, I'd use Python and curse its performance, because I really don't like the Java language... but that's besides the point.)
Posted Jul 18, 2015 6:04 UTC (Sat)
by linuxrocks123 (subscriber, #34648)
[Link] (3 responses)
What type of code are you writing where you think you get a 10x productivity boost by switching languages?
Posted Jul 19, 2015 17:09 UTC (Sun)
by Kwi (subscriber, #59584)
[Link] (2 responses)
Maybe I'm just a bad C developer. ;-) All joking aside: For a project where I can fully harvest the benefits of Python features like tuples, generators, memory safety and the wide selection of readily available libraries, I routinely write in 5 lines what would have taken 50 in a language like C. (I'd put modern C++ – that is C++11 or later – somewhere in the middle, let's say 3x faster than C and 3x slower than Python.) Not only does that save me the time it takes to type those lines, but several studies suggest that the bug density (bugs per line) is roughly independent of the choice of programming language*, which means I save the time needed to debug those lines. Coming up with a simple example to demonstrate the benefits of a programming languages is always difficult, but I'll try anyway. Here's a 5-line Python function. The function depends on the standard library re (regular expression) module, and it's used with the built-in sorted function. If you count the docstring, it's 10 lines, but then you also have unit tests (python -m doctest natural_sort.py). And yes, I'll go out on a limb and say that the above is representative of maybe 80% of the Python code I write – except for the number of lines, of course. ;-) If put to the challenge, I'm sure that someone can come up a more or less equivalent C function in less than 50 lines (or less than 15 lines of C++). But it'll take them significantly longer than the 10 minutes it took to write the above, and it won't be nearly as readable (YMMV). *) I know, I know, it's nearly impossible to measure with any level of scientific rigor, and the research is highly contested. Still, some references: Ray et al., 2014. A Large Scale Study of Programming Languages and Code Quality in Github. While the paper draws no conclusions, its data suggests that Python has roughly twice the bug density (bugs per SLOC) of C, C++ or Java. (Assuming Python has at most half as many SLOC than the equivalent C, that's still a win.) Phipps, 1999. Comparing observed bug and productivity rates for Java and C++. Apparently (haven't read the study) suggests that C++ has 15–30% more defects per line than Java.
Posted Jul 19, 2015 17:41 UTC (Sun)
by Kwi (subscriber, #59584)
[Link]
A "100x" boost between C and Python is overstating it, but I'm confident that 10x is a lower bound.
In the end, it's all fuzzy numbers, obviously. :-)
Posted Jul 20, 2015 1:00 UTC (Mon)
by Kwi (subscriber, #59584)
[Link]
A quick Internet survey of implementations in other languages demonstrates my point. There's a 76 line C implementation and a related 63 line Java implementation. The large number of lines reflect that both C and Java are lacking in their native support for high-level list and string operations. I struggled to find an idiomatic C++ implementation (found plenty of "C in C++" style implementations), though I did find one using Boost (39 lines). With C# we're finally getting somewhere; it can be done in 7 lines, plus a 17 line utility class that really ought to be in the standard library (but isn't). (C# in general seems to be a good fit if one wants a statically typed, compiled and managed language with a level of expressiveness that approaches Python.) Again, I'm sure that an experienced C/C++/Java developer could do it in fewer lines than the above, but according to Google, those examples are the best the Internet has to offer. Google also finds several Python implementations, all of them variations on the same 5 lines as I posted above. (I guess there's only one obvious way to do it.)
Posted Jul 17, 2015 11:51 UTC (Fri)
by arvidma (guest, #6353)
[Link] (1 responses)
Thanks for a very informative response!
Posted Jul 17, 2015 15:51 UTC (Fri)
by Kwi (subscriber, #59584)
[Link]
I should clarify that 99% of the time, the performance hit is insignificant, but in a tight spot, one may want to replace while ...: x.y() by: This saves a lookup of the y attribute on every iteration, in favor of a much faster local variable access (assuming this is in a function body). The Python interpreter can't do this optimization automatically, because it'd change the semantics if one thread assigned to x.y while another was in the loop. It's just one example of the performance difficulties imposed by a highly dynamic language like Python (which doesn't have C's volatile keyword). But again, 99% of the time, you care more about the language than the performance. So don't go "optimize" every bit of Python code like this. :-)
Posted Jul 28, 2015 23:19 UTC (Tue)
by pboddie (guest, #50784)
[Link]
I heard this myself from a bunch of "scientific Python" people a few years ago. The response from a colleague of mine who isn't (or wasn't) really a Python user was, "Why not just write the code in C in the first place and just ignore Python?" That's a pretty hard question to answer even for those of us who feel moderately productive in Python. The big problems with Python's evolution have been the denial that various disadvantages are "real enough" and that everything has to be tied somehow to CPython or not be completely legitimate (although some in the "scientific" community are slowly accepting things like PyPy after choosing to ignore it for years). Need to cross-compile Python in a sane way or target embedded systems or mobile devices? No-one needs to do that! Wasn't Python for Symbian Series 60 not enough?! Thankfully, stuff like Micro Python has been developed and has presumably thrived by filling an otherwise neglected niche. Meanwhile, attempts to deliver CPython as a mobile platform seem to be stuck on repeat at the earliest stage. Plenty of examples of other domains exist if you care to look. People have been saying this for twenty years. Making a virtue of such things - that performance is a lost cause and that everyone should instead celebrate other things including the tendency to make the language even more baroque - is precisely what has held the language back for a good long time, too. Such attitudes probably put Perl in the place it currently resides today, in case any lessons from history were needed.
Posted Jul 12, 2015 17:32 UTC (Sun)
by wahern (subscriber, #37304)
[Link]
I don't think they've released their implementation, yet, though.
A better story for multi-core Python
A public relations problem
A public relations problem
A public relations problem
A public relations problem
A public relations problem
>>> import sys
>>> def foo():
... print(sys.getrefcount(foo))
...
>>> print(sys.getrefcount(foo))
2
>>> foo()
4
A public relations problem
A public relations problem
A public relations problem
A public relations problem
def natural(s, _re=re.compile('([0-9]+)')):
""" Provides a sort key for obtaining a natural collation of strings.
>>> sorted(['Figure 2', 'Figure 11a', 'Figure 7b'])
['Figure 11a', 'Figure 2', 'Figure 7b']
>>> sorted(['Figure 2', 'Figure 11a', 'Figure 7b'], key=natural)
['Figure 2', 'Figure 7b', 'Figure 11a']
"""
return tuple(
int(text) if text.isdigit() else text.lower()
for text in _re.split(s)
)
A public relations problem
A public relations problem
A public relations problem
A public relations problem
z = x.y
while ...: z()
A public relations problem
I agree that Python is not the best tool for that job. The answer here would be C or Cython.
In the end, Python values other features higher than performance; and again, that's largely a design decision.
A better story for multi-core Python