A viable solution for Python concurrency

Posted Oct 14, 2021 15:57 UTC (Thu) by chris_se (subscriber, #99706)
Parent article: A viable solution for Python concurrency

Having programmed extensions for CPython, I believe the main issue is going to be the underlying assumptions these extensions make about the immutability of Python objects while they work with them.

For example, say you have an extension that has a function foo() that takes a parameter that's a dict. Most extension functions don't drop the GIL by themselves, which means that while they are being called, no other thread can alter the dict that has been passed in to the function. However, if the GIL disappears, it may be possible that the dict is altered from another thread while the function is still accessing it. And even if all the PyDict_* functions are changed to be thread-safe themselves, people will typically call various PyDict functions in succession, and if the object changes in between calls, this can easily cause C extensions to misbehave.

This type of problem will be even more prominent when it comes to custom classes that extensions create, because no extension is currently performing any locking on their own objects right now (why would they?). On the other hand, certain types of concurrent access to objects from different threads might even be desirable -- if the structure doesn't change, I think different threads should be able to simultaneously access numpy arrays, for example, so the CPython API should provide some means of solving this.

I still think it's going to be worth-while of getting rid of the GIL and making Python much more useful when it comes to current CPUs (where core counts have increased in the past years much more than single-core processing speeds). And I welcome the efforts made here to walk in that direction. But it's clear that extension compatibility will go beyond just recompiling them -- the Python project will have to issue clear guidance for extension developers as to how to properly handle this (maybe by locking individual objects -- but if multiple objects are involved, e.g. multiple parameters passed to a function, you need to start thinking about deadlocks).

A viable solution for Python concurrency

Posted Oct 14, 2021 17:50 UTC (Thu) by azumanga (subscriber, #90158) [Link]

I agree with this. I worked on another similar parallelization attempt (for the GAP language, www.gap-system.org). Once the parallelization "core" was done, it was clear every library was going to misbehave, or crash, forever.

We fixed many of the problems by having objects owned by a single thread at a time, but then all code needs to know about transfering objects around, which is very painful for highly recursive objects.

A viable solution for Python concurrency

Posted Oct 14, 2021 19:13 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think that just means that running present-day extension code requires the same "stop the world" operation that GC does in this design, so it would be possible to provide the same behavior as today, at the cost of having C extension parallelism stay as bad as Python parallelism is currently. Of course, you'd want to allow a C method to declare that it's okay if Python code can run between its library calls, so that you can make extensions that do parallelize well.

A viable solution for Python concurrency

Posted Oct 14, 2021 19:21 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (4 responses)

This is not limited to C extensions. Consider for example:

>>> def foo(l, x):
...   l += [x]
... 
>>> dis.dis(foo)
  2           0 LOAD_FAST                0 (l)
              2 LOAD_FAST                1 (x)
              4 BUILD_LIST               1
              6 INPLACE_ADD
              8 STORE_FAST               0 (l)
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE

Assuming that l is a list, this function atomically appends x to it. We know that it must be atomic, because INPLACE_ADD is a single bytecode instruction, the GIL must be held for the entire execution of that instruction, and lists are implemented in C, so this instruction does not create an additional stackframe of Python bytecode. The STORE_FAST is a red herring; it just rebinds a local variable in a scope which we immediately destroy (so the interpreter could have optimized it out without altering the semantics of the function).

The problem is that I don't see a reasonable way to preserve that atomicity:

Automatic fine-grained locking would slow down the single-threaded case.
Automatic smart fine-grained locking, where the list is only locked if it's shared by multiple threads, is less likely to help than you might expect, because Python has a lot of shared mutable dictionaries in its internal implementation (basically, every scope is a dict or dict-like-thing), which are subject to the same problem. So you still end up with a lot of unnecessary (or "probably" unnecessary) locking in the multi-threaded case.
You could do some kind of lock-free data structure, but this is a general problem affecting dicts, lists, sets, and probably other data structures as well. It would require a lot of work and complexity, and I'm not even sure if every instance of this problem is solvable that way.

A viable solution for Python concurrency

Posted Oct 15, 2021 16:25 UTC (Fri) by colesbury (subscriber, #137476) [Link] (3 responses)

> The problem is that I don't see a reasonable way to preserve that atomicity...

The design uses "automatic fine-grained locking". The locks are acquired around mutating operations, but not read-only operations. I estimate that the locks add about 1.5% overhead. (See the sections "Collection thread-safety" and "Performance")

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfz...

A viable solution for Python concurrency

Posted Oct 16, 2021 2:40 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (2 responses)

Hmm. The design document lists a 7 step procedure for what was previously a three step process to read an item in a collection

1. Load the version counter from the collection
2. Load the “backing array” from the collection
3. Load the address of the item (from the “backing array”)
4. Increment the reference count of the item, if it is non-zero (otherwise retry)
5. Verify that the item still exists at the same location in the collection (otherwise retry)
6. Verify that the version counter did not change (otherwise retry)
7. Return the address of the item

Suppose you've got a shared dict foo, and thread A is constantly fiddling with it, adding stuff, removing stuff, changing stuff. But your thread B just reads foo['myValue'] which you're sure thread A has no reason to change. Today under the GIL this seems fine, but the program is slow because the GIL forbids threads A and B from both getting work done. With the 7 step procedure, often reading foo['myValue'] will fail at step 6 because thread A meanwhile incremented the version counter. I believe you would take a lock on the container then, to avoid starvation even though you're a "read-only" operation (you actually write to the reference counter)?

Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

A viable solution for Python concurrency

Posted Oct 17, 2021 2:35 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

Well, it depends on how you look at it. If you're, say, an application developer, it's easy enough to say "that seems like a bad idea, let's not do it." And in that case (i.e. the case where there's no actual mutation of shared state), the performance penalty seems pretty minor to me (if I'm understanding the process you describe correctly, and in comparison to the existing CPython overhead, which is quite large to start with). On the other hand, if you're a language developer, you don't really have the luxury of saying "this seems like a bad idea, let's not support it." Somebody may have already deployed a real-world application which depends on this behavior, so you can't unilaterally break it unless you want another backcompat flag day a la Python 3. Nobody wants to do that again.

A viable solution for Python concurrency

Posted Oct 19, 2021 11:12 UTC (Tue) by njs (subscriber, #40338) [Link]

I suspect the big win here is for data structures that have heavy read traffic from multiple threads, but few or no writes.

In this case the extra reads are pretty cheap, because the relevant cache lines can be kept loaded in "shared" state (in the MESI sense), and the branches are predictable. OTOH if reads had to acquire a lock, that would trigger cache line bouncing and be *very* expensive.

Python modules and classes are mutable data structures that are almost always read, rather than mutated. So this pattern of heavy cross-thread read-traffic to a mostly-immutable object is ubiquitous in python, so it makes sense to optimize for it.

A viable solution for Python concurrency

Posted Oct 14, 2021 21:07 UTC (Thu) by fw (subscriber, #26023) [Link] (6 responses)

One potential way out is to preserve the single-threaded-ness of individual interpreters, but allow multiple interpreters with independent locks in a single process. Sharing objects between interpreters will need special object types, though.

At least a few years ago, the GIL was not specific to sub-interpreters. People have actually patched glibc to be able to load libpython as many times as they like to get more independent interpreters, each with its own global data structures and locks. CPython is very different from (for example) the Lua reference implementation in this regard.

A viable solution for Python concurrency

Posted Oct 14, 2021 21:36 UTC (Thu) by atnot (subscriber, #124910) [Link]

I personally believe much more in this approach than GIL removal. Efficient message-passing and subinterpreters would let python easily use many threads without the many concurrency issues that are bound to show up in python code with fully shared memory.

A viable solution for Python concurrency

Posted Oct 14, 2021 23:56 UTC (Thu) by ms-tg (subscriber, #89231) [Link] (1 responses)

Please note that this is how Ruby is approaching this problem, see “Ractors”

https://github.com/ruby/ruby/blob/master/doc/ractor.md

I wonder if there should be a more socially encouraged expectation of cross pollination of ideas like this between the language communities?

A viable solution for Python concurrency

Posted Oct 15, 2021 16:52 UTC (Fri) by colesbury (subscriber, #137476) [Link]

> I wonder if there should be a more socially encouraged expectation of cross pollination of ideas like this between the language communities?

Nathaniel already wrote below about the similarity between Ractors and sub-interpreters, but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:

- Biased reference counting (originally implemented for Swift)
- mimalloc (originally developed for Koka and Lean)
- The design of the internal locks is taken from WebKit (https://webkit.org/blog/6161/locking-in-webkit/)
- The collection thread-safety adapts some code from FreeBSD (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c)
- The interpreter took ideas from LuaJIT and V8's ignition interpreter (the register-accumulator model from ignition, fast function calls and other perf ideas from LuaJIT)
- The stop-the-world implementation is influenced by Go's design (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917ef...)

A viable solution for Python concurrency

Posted Oct 15, 2021 1:18 UTC (Fri) by njs (subscriber, #40338) [Link] (2 responses)

This has been a major goal of a lot of cpython contributors for maybe 5? years now. Lots of internal refactoring, etc. See PEP 554.

I don't really see the point – subinterpreters aren't meaningfully more efficient than subprocesses, and they're way more complicated and fragile. But some people are more hopeful.

A viable solution for Python concurrency

Posted Oct 15, 2021 1:35 UTC (Fri) by njs (subscriber, #40338) [Link] (1 responses)

Also, the biggest challenge for GIL removal is legacy C extensions. And C extensions are inherently shared across the whole process.

For python code, the VM itself can take care of isolating subinterpreters from each other – for python code this is "free".

But for C code, each C extension has to manually implement subinterpreter isolation for its own internal state. So here, the subinterpreter model doesn't really simplify anything – in fact it makes it more complicated.

A viable solution for Python concurrency

Posted Oct 15, 2021 3:19 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Well, it depends on how those C extensions have been implemented. If they have large amounts of global, mutable state, then this is going to be a nightmare because you will need to make separate copies of that state for each subinterpreter, and then you have to plumb context pointers into everything, which is probably not easy.

As far as I can tell, the most likely problem for "well designed" C extensions (those which do not have substantial global mutable state) is the PyGILState_* API, which doesn't (currently) support sub-interpreters (and by my read, it is unlikely to gain such support without an API break, because there's no plausible way for it to infer which PyInterpreterState* you want it to use). You mostly only need to use those functions if you're calling into the Python interpreter from foreign threads (i.e. threads which were not created by Python). The simplest solution is probably for the Python people to provide a version of those functions which accepts an explicit PyInterpreterState* argument. Currently, for example, PyGILState_Ensure() just grabs "the" interpreter state out of a global variable.[1] This still would require some plumbing on the C extension side, of course, because extensions would still need to actually pass those arguments. In the meantime, you can use PyThreadState_New() and friends to do the same job by hand, but it's a lower-level API and more cumbersome to work with (especially if you want to tackle the reentrancy issues which PyGILState_* is intended to solve for you).

[1]: https://github.com/python/cpython/blob/main/Python/pystat...