CPython without a global interpreter lock
The global interpreter lock (GIL) has been a part of CPython since the beginning—nearly—but that seems likely to change over the next five or so years. As we described last week, the Python steering council has announced its intention to start moving toward a no-GIL CPython, potentially as soon as Python 3.13 in October 2024 for the preliminaries. The no-GIL version of CPython comes from Sam Gross, who introduced it as a proof-of-concept nearly two years ago; now, the idea has been formalized in a Python Enhancement Proposal (PEP) that describes no-GIL mode and how it interacts with the rest of the Python ecosystem.
The PEP
PEP 703 ("Making the Global
Interpreter Lock Optional in CPython") was posted
back in January, then
revised
in May. It proposes creating a second build of CPython using a new build
configuration switch (‑‑disable-gil) that would
"run Python code without the global interpreter lock and with the
necessary changes needed to make the interpreter thread-safe
".
The GIL is a bottleneck for multi-threaded Python programs because it
prevents more than one thread from executing Python at any given time.
The PEP
is not definitive about the eventual end state for the no-GIL build, but the
steering
council made it clear that its intent is to eventually have only a single
build of CPython—either without the GIL if no-GIL works out, or rolling
back to the with-GIL version if it does not.
Gross obviously recognized that acceptance of the PEP might be something of a struggle; one of the ways he dealt with that was by giving PEP 703 one of the more extensive "Motivation" sections ever seen. It looks at multiple different Python use cases (AI, numerical libraries, GPU-heavy workloads) and gets quotes from Python developers and maintainers about the problems they have encountered because of the GIL—and the lengths they have had to go to in order to work around the GIL. One project has already switched to using Gross's experimental no-GIL fork in order to avoid communication bottlenecks in its data-acquisition system.
The core of the PEP is the "Specification" section, which describes the changes needed to CPython for no-GIL operation. It mentions two ways to control the no-GIL operation of the ‑‑disable-gil build. First, if the Py_mod_gil slot of an extension module that is being loaded is not set to Py_mod_gil_not_used (or is not present at all), the interpreter will pause any current threads and re-enable the GIL before resuming them. It will issue a warning when that happens, so that users are notified that their code has loaded an module that is not compatible with no-GIL operation.
But, of course, there may well be extensions that have not been updated to use the Py_mod_gil slot (these slots came from PEP 489 ("Multi-phase extension module initialization")) even though they would work fine without the GIL. The PYTHONGIL environment variable can be used to override the slot check; if it is set to zero, the GIL will be disabled, while a value of one forces the GIL to be enabled. That will allow testing modules that may work fine without a GIL, but there are other reasons the override is useful:
The PYTHONGIL=0 override is important because extensions that are not thread-safe can still be useful in multi-threaded applications. For example, one may want to use the extension from only a single thread or guard access by locks. For context, there are already some extensions that [are] not thread-safe even with the GIL, and users already have to take these sorts of steps.
Garbage collection
Most of the changes to CPython for PEP 703 relate to memory management—garbage collection, in particular. The techniques used have not changed all that much since the initial posting of the no-GIL project (and our article describing it); we will review some of that here, but this article will mostly focus on other details in the proposal.
Python's garbage-collection mechanism relies on reference counts (for the most part), but the maintenance of those counts is currently protected with the GIL, so no other locking is required. Multiple, concurrent accesses to these counts is a recipe for bugs and crashes, so those counts need to be protected some other way in the absence of the GIL. Operations on reference counts are ubiquitous in the interpreter, though, so adding atomic locking to each operation would be a performance nightmare.
The PEP proposes
three techniques to make the reference counts thread-safe
in a performant manner. The first is to use biased reference
counting, "which is a thread-safe reference counting technique with
lower execution overhead than plain atomic reference counting
". It
uses the fact that most objects are generally not actually shared between
threads, so the owning thread uses its own reference count that it
maintains without any locking. Any other threads have to use atomic
operations on a separate shared reference count; those two counts are then
used to
determine when the object can be freed.
Some Python objects, such as small integers, interned strings, True, False, and None, are immortal—they live as long as the interpreter does—so they do not need to participate in the reference-counting dance. These are marked as immortal objects, though the scheme used is slightly different than that in PEP 683 ("Immortal Objects, Using a Fixed Refcount"), which was accepted for Python 3.12. Because the no-GIL interpreter uses biased reference counts, it cannot use the same representation for immortal objects as in PEP 683. In any case, incrementing or decrementing the reference count of an immortal object is a noop.
While most objects get freed when their reference count drops to zero, there are some objects that have reference cycles that prevent the count from reaching zero. These are currently detected and freed during a garbage-collection pass that is protected by the GIL. PEP 703 proposes two "stop-the-world" passes that would pause all threads, first to identify the objects to be freed, and second to identify any that are left after the finalizers from the first round have completed.
Those two phases will also handle the third mechanism (after biased reference counts and immortal objects) that is being added: deferred reference counting. Some objects are generally long-lived, but not immortal, such as modules, top-level functions, and code objects; those objects are commonly accessed by multiple threads as well. Instead of performing an expensive atomic reference-count operation for them, they would instead be marked for deferred reference counting. When those objects are pushed and popped from the interpreter's stack, no reference-count operations will be performed, so the true state of references to those objects can only be calculated during a stop-the-world garbage-collection phase. In practice, it is not a lot different than how they are handled today:
Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.
Allocation and locking
The pymalloc memory allocator, which is not thread-safe without the GIL, has been replaced with mimalloc, which has been modified somewhat to support the CPython use case. The mimalloc internal data structures can be used to replace the existing linked list that allows the garbage collector to find all of the Python objects that have been allocated. Mimalloc has also been modified to support something similar to read-copy update (RCU) that allows locks to be avoided when retrieving items from dict and list objects:
A few operations on dict and list optimistically avoid acquiring the per-object locks. They have a fast path operation that does not acquire locks, but may fall back to a slower operation that acquires the dictionary's or list's lock when another thread is concurrently modifying that container.[...] There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.
The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).
In general, Python containers (dict, list, etc.) are
protected from
concurrent modification by the GIL, though there are some operations that
even the GIL does not fully protect, as described
in the PEP. For most container operations, the no-GIL interpreter uses
per-object
locking, which "aims for similar protections as the GIL
", though, as
mentioned above, read operations avoid locking at all if they can.
But per-object locking can lead to deadlocks:
Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.
Those deadlocks can be avoided using "Python
critical sections". The idea is that a lock can only be held while an
operation is being performed; if there is a nested operation, the lock is
"suspended" by being
released until the nested operation completes, when it must be reacquired.
That suspension must also be done around blocking operations, such as I/O.
As an optimization, the suspension is only done if the thread would
block. "This reduces the number of lock acquisitions and releases for
nested operations, while avoiding deadlocks.
"
Backward compatibility
As the PEP notes, the vast majority of compatibility concerns with the existing CPython ecosystem are related to the C API. To start with, extensions built for today's CPython will not be ABI compatible with no-GIL, so they will require a recompile at minimum. The bigger problem is that the existence of the GIL has masked concurrency problems that exist in the C code of many extensions.
Even extension developers who wanted to develop thread-safe extensions had no real way to test them until no-GIL came along. By the sounds, testing extensions with no-GIL is ongoing, especially for the larger, active extensions that have been chafing under the constraints of the GIL for many years. There is a long tail of extensions, however; not breaking those with no-GIL is important, thus the Py_mod_gil slot test. Beyond that, Gross plans to write a compatibility HOWTO that should help the process along.
As noted in last week's article, the lack of a GIL has some negative effects on the ongoing Faster CPython work, which is described in PEP 659 ("Specializing Adaptive Interpreter"). The no-GIL PEP mentions some of those problems and points to a specific specialization problem as an open issue for the no-GIL interpreter. For now, it looks like those problems are seen as challenges by the Faster CPython team, who are looking to work with Gross and others toward a no-GIL interpreter without sacrificing too much single-threaded performance.
Single-threaded performance is another area that the (quite comprehensive)
PEP 703 touches
on. Since the vast majority of Python code is single-threaded now,
which is something that will only start to change slowly once no-GIL gets
going, it is imperative that measures be taken to ensure that the
performance of those programs does not regress. As Faster CPython
developer Mark Shannon said,
research will need to be done on "the optimizations necessary to make
Python fast in a free-threading environment
", but he and other members
of the team seem up for the task.
While the numbers are somewhat disputed, PEP 703 gives a performance cost of 5-8% for the no-GIL changes relative to the in-progress Python 3.12. Those numbers are strictly the cost of the changes for no-GIL and do not reflect the gains that will come for multi-threaded Python programs that are not restricted by the GIL.
In conclusion
Even this fairly lengthy look only scratches the surface of the full contents of the PEP; it is well worth a read for those who are interested. One important thing to keep in mind, though, is that the steering council made it quite clear that the process will play out rather deliberately—slowly—over five years or more. There will be lots of opportunities to test and help fix no-GIL Python over that time frame, as well as to work on making extensions thread-safe without the GIL. To a large extent, the success of the no-GIL project is going to depend, at least in part, on the Python community—not just the core developers and the teams from various companies—pulling together to help make it succeed. It will be interesting to see (and report on) how it all goes.
Index entries for this article | |
---|---|
Python | CPython |
Python | Global interpreter lock (GIL) |
Python | Python Enhancement Proposals (PEP)/PEP 703 |
Posted Aug 9, 2023 23:10 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (6 responses)
> This proposal relies on a scheme that mostly avoids acquiring locks when accessing individual elements in lists and dictionaries. Note that this is not “lock free” in the sense of “lock-free” and “wait-free” algorithms that guarantee forward progress. It simply avoids acquiring locks (mutexes) in the common case to improve parallelism and reduce overhead.
I do wish the terminology around this wasn't overloaded...
Posted Aug 10, 2023 12:16 UTC (Thu)
by tialaramex (subscriber, #21167)
[Link] (5 responses)
Deferred reclamation schemes like RCU are expensive but deliver wait-freedom for readers (and typically lock-freedom for a writer). We lose peak performance but gain certainty, every one of our readers will make forward progress.
This approach, whatever it is, doesn't deliver that. It is optimistic, and when its optimism is well founded it'll go fast, when it isn't it may stall [but hopefully not deadlock?]. I think the PEP probably shouldn't mention RCU at all, or it should be in a footnote for anybody who thought "Oh, like RCU?" to clarify that you don't get RCU's guarantees. I can see why it was on the author's mind but I don't think mentioning it helps understanding.
Reading the PEP did clarify that the plan really is as cavalier as I feared, that environment variable is a YOLO feature and it's going to be left in plain sight like a tempting red button at toddler height. I don't expect that to go well.
Posted Aug 10, 2023 15:18 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
FWIW, I think starting with schemes that are optimistic about there not being contention is the right way to go, so long as the schemes can be reworked without changing the C extension ABI. Since existing programs are generally single-threaded, there just can't be any contention in the current common case. Once there are actual programs that could perform better if contention was handled more efficiently, it'll make sense to spend peak performance on it (possibly only if the program ever spawns threads or uses a flag to get that mode without a stop-the-world event when first spawning a thread).
Posted Aug 10, 2023 16:59 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
I know I previously said that I hoped they would provide stronger guarantees than this, but honestly, I'm having a hard time getting excited about it. If you go fiddling around with the interpreter's settings and cause your codebase to break, well, you get to keep both pieces. That's maybe not the safest way of doing things, but it's a valid perspective IMHO. This is more or less the same reasoning that leads Python to have no real encapsulation, for example.
Posted Aug 13, 2023 11:01 UTC (Sun)
by tialaramex (subscriber, #21167)
[Link] (2 responses)
But this sort of work is happening because people are trying to write Python at scale and having problems. Jim writes tricky code that's fine assuming the GIL, Sarah uses Jim's code in a sub-routine for her team's new project. Mark uses that sub-routine with Sarah's permission from the new Project Foo, and the Project Foo lead Andy sets the environment variable because it's a "known workaround" for a problem they have, but now about one run in twenty has corrupt results and nobody knows why.
Whose fault is the corruption? Jim? Sarah? Mark? Andy? It is likely that there's no intersection between the people who would have known this won't work and the people who did it anyway.
Maybe it's fine, but I wouldn't want to be the Python community finding that out the hard way.
Posted Aug 15, 2023 5:41 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
If Jim's code is unmaintained, then it's... well, not necessarily Andy's *fault*, but it is Andy's *problem* because it's his project. Such is an occupational hazard of relying on unmaintained code, in any language, but especially in Python. Python has had a very well-established practice of slowly, carefully deprecating and removing old APIs, both before and after the 2-to-3 transition. 2-to-3 was not an anomaly because it broke backcompat, it was an anomaly because it did it abruptly in a single release, and on a much larger scale than has otherwise been typical. This problem sounds much more like a classic Python deprecation than a "break the world" flag day.
Posted Aug 16, 2023 16:57 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
My guess is that if it became normal to write environment variable checks and block execution, the response from CPython would be to change the name of the environment variable. This is an intentional foot gun, not that they'd accept that description, de-fanging it would be contrary to their purpose in offering it.
Posted Aug 10, 2023 2:29 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
This is really remarkable work. Congratulations to those doing it.
Posted Aug 11, 2023 4:00 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link] (6 responses)
Posted Aug 11, 2023 13:19 UTC (Fri)
by mb (subscriber, #50428)
[Link]
Even if you made these reference counting interfaces no-ops, then there certainly are many C modules expecting proper counting for resource freeing.
Posted Aug 11, 2023 15:30 UTC (Fri)
by Karellen (subscriber, #67644)
[Link] (1 responses)
Posted Aug 11, 2023 16:10 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link]
So, IMHO refcounting was and is wrong choice (for python-like use cases; e.g. kernel-space is a different matter).
(Not so) fun fact: gcc's libstdc++ still has this horrible "optimization" where shared_ptr bits detect at runtime if multi-threading is active inlined at ~every place share_ptr is copied. And compiler adds this run-time check and both single-threaded and multi-thread "atomic-ful" refcounting codes everywhere! Quick godbolt proof: https://godbolt.org/z/o9b9hxfMP
Of course this is me cherry-picking one annoying arguably performance bug in gcc. But imho it adds nicely to the topic of: "no, refcounting is not anywhere as straightforward as people tend to think".
Posted Aug 11, 2023 16:49 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Aug 11, 2023 22:04 UTC (Fri)
by khim (subscriber, #9252)
[Link] (1 responses)
Nope. It wouldn't. Or, raither: it may work on your system, but if you would try to give that code to someone then pretty soon your tracker would be overflowing with messages about how nothing works. Because AV software and “security” software (like this abomination) would keep your file around for a few seconds to “investigate” it. The only way to deal with it is to catch OSError and repeat the operation after some time (with exponential back-off). And that saves you when GC is used, too.
Posted Aug 11, 2023 22:14 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I had quite a few tools that have these kinds of call sequences, and I don't remember any problems from the third-party tools locking them. Especially when we're talking about files in $TMPDIR.
Posted Aug 18, 2023 8:14 UTC (Fri)
by SLi (subscriber, #53131)
[Link] (1 responses)
Or is there some other care that GIL-free supporting extensions need to take besides not explicitly using the GIL?
Posted Aug 18, 2023 13:33 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
But the C-API has reference counting baked it very deeply.
CPython without a global interpreter lock
CPython without a global interpreter lock
A lot of Python code out there depends on deterministic destruction for resource cleanup. It has always been a somewhat bad idea, but it IS the case right now.
Moving to a full GC will subtly break this code. For example, a sequence like this will work _most_ of the time on Windows:
CPython without a global interpreter lock
If the GC is fast enough to immediately clean up the `fl` descriptor, then everything will work. But sometimes GC will not have time to run, and `os.unlink` will fail with a "file locked" error.
def do_something(name):
fl = open(name)
fl.read()
...
do_something("blah");
os.unlink("blah");
> If the GC is fast enough to immediately clean up the `fl` descriptor, then everything will work.
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock
CPython without a global interpreter lock