A viable solution for Python concurrency
A viable solution for Python concurrency
Posted Oct 14, 2021 15:57 UTC (Thu) by chris_se (subscriber, #99706)Parent article: A viable solution for Python concurrency
For example, say you have an extension that has a function foo() that takes a parameter that's a dict. Most extension functions don't drop the GIL by themselves, which means that while they are being called, no other thread can alter the dict that has been passed in to the function. However, if the GIL disappears, it may be possible that the dict is altered from another thread while the function is still accessing it. And even if all the PyDict_* functions are changed to be thread-safe themselves, people will typically call various PyDict functions in succession, and if the object changes in between calls, this can easily cause C extensions to misbehave.
This type of problem will be even more prominent when it comes to custom classes that extensions create, because no extension is currently performing any locking on their own objects right now (why would they?). On the other hand, certain types of concurrent access to objects from different threads might even be desirable -- if the structure doesn't change, I think different threads should be able to simultaneously access numpy arrays, for example, so the CPython API should provide some means of solving this.
I still think it's going to be worth-while of getting rid of the GIL and making Python much more useful when it comes to current CPUs (where core counts have increased in the past years much more than single-core processing speeds). And I welcome the efforts made here to walk in that direction. But it's clear that extension compatibility will go beyond just recompiling them -- the Python project will have to issue clear guidance for extension developers as to how to properly handle this (maybe by locking individual objects -- but if multiple objects are involved, e.g. multiple parameters passed to a function, you need to start thinking about deadlocks).
Posted Oct 14, 2021 17:50 UTC (Thu)
by azumanga (subscriber, #90158)
[Link]
We fixed many of the problems by having objects owned by a single thread at a time, but then all code needs to know about transfering objects around, which is very painful for highly recursive objects.
Posted Oct 14, 2021 19:13 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Oct 14, 2021 19:21 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
This is not limited to C extensions. Consider for example:
Assuming that l is a list, this function atomically appends x to it. We know that it must be atomic, because INPLACE_ADD is a single bytecode instruction, the GIL must be held for the entire execution of that instruction, and lists are implemented in C, so this instruction does not create an additional stackframe of Python bytecode. The STORE_FAST is a red herring; it just rebinds a local variable in a scope which we immediately destroy (so the interpreter could have optimized it out without altering the semantics of the function).
The problem is that I don't see a reasonable way to preserve that atomicity:
Posted Oct 15, 2021 16:25 UTC (Fri)
by colesbury (subscriber, #137476)
[Link] (3 responses)
The design uses "automatic fine-grained locking". The locks are acquired around mutating operations, but not read-only operations. I estimate that the locks add about 1.5% overhead. (See the sections "Collection thread-safety" and "Performance")
https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfz...
Posted Oct 16, 2021 2:40 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
1. Load the version counter from the collection
Suppose you've got a shared dict foo, and thread A is constantly fiddling with it, adding stuff, removing stuff, changing stuff. But your thread B just reads foo['myValue'] which you're sure thread A has no reason to change. Today under the GIL this seems fine, but the program is slow because the GIL forbids threads A and B from both getting work done. With the 7 step procedure, often reading foo['myValue'] will fail at step 6 because thread A meanwhile incremented the version counter. I believe you would take a lock on the container then, to avoid starvation even though you're a "read-only" operation (you actually write to the reference counter)?
Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?
Posted Oct 17, 2021 2:35 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Well, it depends on how you look at it. If you're, say, an application developer, it's easy enough to say "that seems like a bad idea, let's not do it." And in that case (i.e. the case where there's no actual mutation of shared state), the performance penalty seems pretty minor to me (if I'm understanding the process you describe correctly, and in comparison to the existing CPython overhead, which is quite large to start with). On the other hand, if you're a language developer, you don't really have the luxury of saying "this seems like a bad idea, let's not support it." Somebody may have already deployed a real-world application which depends on this behavior, so you can't unilaterally break it unless you want another backcompat flag day a la Python 3. Nobody wants to do that again.
Posted Oct 19, 2021 11:12 UTC (Tue)
by njs (subscriber, #40338)
[Link]
In this case the extra reads are pretty cheap, because the relevant cache lines can be kept loaded in "shared" state (in the MESI sense), and the branches are predictable. OTOH if reads had to acquire a lock, that would trigger cache line bouncing and be *very* expensive.
Python modules and classes are mutable data structures that are almost always read, rather than mutated. So this pattern of heavy cross-thread read-traffic to a mostly-immutable object is ubiquitous in python, so it makes sense to optimize for it.
Posted Oct 14, 2021 21:07 UTC (Thu)
by fw (subscriber, #26023)
[Link] (6 responses)
At least a few years ago, the GIL was not specific to sub-interpreters. People have actually patched glibc to be able to load libpython as many times as they like to get more independent interpreters, each with its own global data structures and locks. CPython is very different from (for example) the Lua reference implementation in this regard.
Posted Oct 14, 2021 21:36 UTC (Thu)
by atnot (subscriber, #124910)
[Link]
Posted Oct 14, 2021 23:56 UTC (Thu)
by ms-tg (subscriber, #89231)
[Link] (1 responses)
https://github.com/ruby/ruby/blob/master/doc/ractor.md
I wonder if there should be a more socially encouraged expectation of cross pollination of ideas like this between the language communities?
Posted Oct 15, 2021 16:52 UTC (Fri)
by colesbury (subscriber, #137476)
[Link]
Nathaniel already wrote below about the similarity between Ractors and sub-interpreters, but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:
- Biased reference counting (originally implemented for Swift)
Posted Oct 15, 2021 1:18 UTC (Fri)
by njs (subscriber, #40338)
[Link] (2 responses)
I don't really see the point – subinterpreters aren't meaningfully more efficient than subprocesses, and they're way more complicated and fragile. But some people are more hopeful.
Posted Oct 15, 2021 1:35 UTC (Fri)
by njs (subscriber, #40338)
[Link] (1 responses)
For python code, the VM itself can take care of isolating subinterpreters from each other – for python code this is "free".
But for C code, each C extension has to manually implement subinterpreter isolation for its own internal state. So here, the subinterpreter model doesn't really simplify anything – in fact it makes it more complicated.
Posted Oct 15, 2021 3:19 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
As far as I can tell, the most likely problem for "well designed" C extensions (those which do not have substantial global mutable state) is the PyGILState_* API, which doesn't (currently) support sub-interpreters (and by my read, it is unlikely to gain such support without an API break, because there's no plausible way for it to infer which PyInterpreterState* you want it to use). You mostly only need to use those functions if you're calling into the Python interpreter from foreign threads (i.e. threads which were not created by Python). The simplest solution is probably for the Python people to provide a version of those functions which accepts an explicit PyInterpreterState* argument. Currently, for example, PyGILState_Ensure() just grabs "the" interpreter state out of a global variable.[1] This still would require some plumbing on the C extension side, of course, because extensions would still need to actually pass those arguments. In the meantime, you can use PyThreadState_New() and friends to do the same job by hand, but it's a lower-level API and more cumbersome to work with (especially if you want to tackle the reentrancy issues which PyGILState_* is intended to solve for you).
[1]: https://github.com/python/cpython/blob/main/Python/pystat...
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency
>>> def foo(l, x):
... l += [x]
...
>>> dis.dis(foo)
2 0 LOAD_FAST 0 (l)
2 LOAD_FAST 1 (x)
4 BUILD_LIST 1
6 INPLACE_ADD
8 STORE_FAST 0 (l)
10 LOAD_CONST 0 (None)
12 RETURN_VALUE
A viable solution for Python concurrency
A viable solution for Python concurrency
2. Load the “backing array” from the collection
3. Load the address of the item (from the “backing array”)
4. Increment the reference count of the item, if it is non-zero (otherwise retry)
5. Verify that the item still exists at the same location in the collection (otherwise retry)
6. Verify that the version counter did not change (otherwise retry)
7. Return the address of the item
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency
- mimalloc (originally developed for Koka and Lean)
- The design of the internal locks is taken from WebKit (https://webkit.org/blog/6161/locking-in-webkit/)
- The collection thread-safety adapts some code from FreeBSD (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c)
- The interpreter took ideas from LuaJIT and V8's ignition interpreter (the register-accumulator model from ignition, fast function calls and other perf ideas from LuaJIT)
- The stop-the-world implementation is influenced by Go's design (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917ef...)
A viable solution for Python concurrency
A viable solution for Python concurrency
A viable solution for Python concurrency