A viable solution for Python concurrency

Posted Oct 14, 2021 19:21 UTC (Thu) by NYKevin (subscriber, #129325)
In reply to: A viable solution for Python concurrency by chris_se
Parent article: A viable solution for Python concurrency

This is not limited to C extensions. Consider for example:

>>> def foo(l, x):
...   l += [x]
... 
>>> dis.dis(foo)
  2           0 LOAD_FAST                0 (l)
              2 LOAD_FAST                1 (x)
              4 BUILD_LIST               1
              6 INPLACE_ADD
              8 STORE_FAST               0 (l)
             10 LOAD_CONST               0 (None)
             12 RETURN_VALUE

Assuming that l is a list, this function atomically appends x to it. We know that it must be atomic, because INPLACE_ADD is a single bytecode instruction, the GIL must be held for the entire execution of that instruction, and lists are implemented in C, so this instruction does not create an additional stackframe of Python bytecode. The STORE_FAST is a red herring; it just rebinds a local variable in a scope which we immediately destroy (so the interpreter could have optimized it out without altering the semantics of the function).

The problem is that I don't see a reasonable way to preserve that atomicity:

Automatic fine-grained locking would slow down the single-threaded case.
Automatic smart fine-grained locking, where the list is only locked if it's shared by multiple threads, is less likely to help than you might expect, because Python has a lot of shared mutable dictionaries in its internal implementation (basically, every scope is a dict or dict-like-thing), which are subject to the same problem. So you still end up with a lot of unnecessary (or "probably" unnecessary) locking in the multi-threaded case.
You could do some kind of lock-free data structure, but this is a general problem affecting dicts, lists, sets, and probably other data structures as well. It would require a lot of work and complexity, and I'm not even sure if every instance of this problem is solvable that way.

A viable solution for Python concurrency

Posted Oct 15, 2021 16:25 UTC (Fri) by colesbury (subscriber, #137476) [Link] (3 responses)

> The problem is that I don't see a reasonable way to preserve that atomicity...

The design uses "automatic fine-grained locking". The locks are acquired around mutating operations, but not read-only operations. I estimate that the locks add about 1.5% overhead. (See the sections "Collection thread-safety" and "Performance")

https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfz...

A viable solution for Python concurrency

Posted Oct 16, 2021 2:40 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (2 responses)

Hmm. The design document lists a 7 step procedure for what was previously a three step process to read an item in a collection

1. Load the version counter from the collection
2. Load the “backing array” from the collection
3. Load the address of the item (from the “backing array”)
4. Increment the reference count of the item, if it is non-zero (otherwise retry)
5. Verify that the item still exists at the same location in the collection (otherwise retry)
6. Verify that the version counter did not change (otherwise retry)
7. Return the address of the item

Suppose you've got a shared dict foo, and thread A is constantly fiddling with it, adding stuff, removing stuff, changing stuff. But your thread B just reads foo['myValue'] which you're sure thread A has no reason to change. Today under the GIL this seems fine, but the program is slow because the GIL forbids threads A and B from both getting work done. With the 7 step procedure, often reading foo['myValue'] will fail at step 6 because thread A meanwhile incremented the version counter. I believe you would take a lock on the container then, to avoid starvation even though you're a "read-only" operation (you actually write to the reference counter)?

Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

A viable solution for Python concurrency

Posted Oct 17, 2021 2:35 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> Also, all this bother is to allow for concurrent writes to a shared data structure you're reading, which seems like a bad idea?

Well, it depends on how you look at it. If you're, say, an application developer, it's easy enough to say "that seems like a bad idea, let's not do it." And in that case (i.e. the case where there's no actual mutation of shared state), the performance penalty seems pretty minor to me (if I'm understanding the process you describe correctly, and in comparison to the existing CPython overhead, which is quite large to start with). On the other hand, if you're a language developer, you don't really have the luxury of saying "this seems like a bad idea, let's not support it." Somebody may have already deployed a real-world application which depends on this behavior, so you can't unilaterally break it unless you want another backcompat flag day a la Python 3. Nobody wants to do that again.

A viable solution for Python concurrency

Posted Oct 19, 2021 11:12 UTC (Tue) by njs (subscriber, #40338) [Link]

I suspect the big win here is for data structures that have heavy read traffic from multiple threads, but few or no writes.

In this case the extra reads are pretty cheap, because the relevant cache lines can be kept loaded in "shared" state (in the MESI sense), and the branches are predictable. OTOH if reads had to acquire a lock, that would trigger cache line bouncing and be *very* expensive.

Python modules and classes are mutable data structures that are almost always read, rather than mutated. So this pattern of heavy cross-thread read-traffic to a mostly-immutable object is ubiquitous in python, so it makes sense to optimize for it.