|
|
Subscribe / Log in / New account

CPython without a global interpreter lock

By Jake Edge
August 9, 2023

The global interpreter lock (GIL) has been a part of CPython since the beginning—nearly—but that seems likely to change over the next five or so years. As we described last week, the Python steering council has announced its intention to start moving toward a no-GIL CPython, potentially as soon as Python 3.13 in October 2024 for the preliminaries. The no-GIL version of CPython comes from Sam Gross, who introduced it as a proof-of-concept nearly two years ago; now, the idea has been formalized in a Python Enhancement Proposal (PEP) that describes no-GIL mode and how it interacts with the rest of the Python ecosystem.

The PEP

PEP 703 ("Making the Global Interpreter Lock Optional in CPython") was posted back in January, then revised in May. It proposes creating a second build of CPython using a new build configuration switch (‑‑disable-gil) that would "run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe". The GIL is a bottleneck for multi-threaded Python programs because it prevents more than one thread from executing Python at any given time. The PEP is not definitive about the eventual end state for the no-GIL build, but the steering council made it clear that its intent is to eventually have only a single build of CPython—either without the GIL if no-GIL works out, or rolling back to the with-GIL version if it does not.

Gross obviously recognized that acceptance of the PEP might be something of a struggle; one of the ways he dealt with that was by giving PEP 703 one of the more extensive "Motivation" sections ever seen. It looks at multiple different Python use cases (AI, numerical libraries, GPU-heavy workloads) and gets quotes from Python developers and maintainers about the problems they have encountered because of the GIL—and the lengths they have had to go to in order to work around the GIL. One project has already switched to using Gross's experimental no-GIL fork in order to avoid communication bottlenecks in its data-acquisition system.

The core of the PEP is the "Specification" section, which describes the changes needed to CPython for no-GIL operation. It mentions two ways to control the no-GIL operation of the ‑‑disable-gil build. First, if the Py_mod_gil slot of an extension module that is being loaded is not set to Py_mod_gil_not_used (or is not present at all), the interpreter will pause any current threads and re-enable the GIL before resuming them. It will issue a warning when that happens, so that users are notified that their code has loaded an module that is not compatible with no-GIL operation.

But, of course, there may well be extensions that have not been updated to use the Py_mod_gil slot (these slots came from PEP 489 ("Multi-phase extension module initialization")) even though they would work fine without the GIL. The PYTHONGIL environment variable can be used to override the slot check; if it is set to zero, the GIL will be disabled, while a value of one forces the GIL to be enabled. That will allow testing modules that may work fine without a GIL, but there are other reasons the override is useful:

The PYTHONGIL=0 override is important because extensions that are not thread-safe can still be useful in multi-threaded applications. For example, one may want to use the extension from only a single thread or guard access by locks. For context, there are already some extensions that [are] not thread-safe even with the GIL, and users already have to take these sorts of steps.

Garbage collection

Most of the changes to CPython for PEP 703 relate to memory management—garbage collection, in particular. The techniques used have not changed all that much since the initial posting of the no-GIL project (and our article describing it); we will review some of that here, but this article will mostly focus on other details in the proposal.

Python's garbage-collection mechanism relies on reference counts (for the most part), but the maintenance of those counts is currently protected with the GIL, so no other locking is required. Multiple, concurrent accesses to these counts is a recipe for bugs and crashes, so those counts need to be protected some other way in the absence of the GIL. Operations on reference counts are ubiquitous in the interpreter, though, so adding atomic locking to each operation would be a performance nightmare.

The PEP proposes three techniques to make the reference counts thread-safe in a performant manner. The first is to use biased reference counting, "which is a thread-safe reference counting technique with lower execution overhead than plain atomic reference counting". It uses the fact that most objects are generally not actually shared between threads, so the owning thread uses its own reference count that it maintains without any locking. Any other threads have to use atomic operations on a separate shared reference count; those two counts are then used to determine when the object can be freed.

Some Python objects, such as small integers, interned strings, True, False, and None, are immortal—they live as long as the interpreter does—so they do not need to participate in the reference-counting dance. These are marked as immortal objects, though the scheme used is slightly different than that in PEP 683 ("Immortal Objects, Using a Fixed Refcount"), which was accepted for Python 3.12. Because the no-GIL interpreter uses biased reference counts, it cannot use the same representation for immortal objects as in PEP 683. In any case, incrementing or decrementing the reference count of an immortal object is a noop.

While most objects get freed when their reference count drops to zero, there are some objects that have reference cycles that prevent the count from reaching zero. These are currently detected and freed during a garbage-collection pass that is protected by the GIL. PEP 703 proposes two "stop-the-world" passes that would pause all threads, first to identify the objects to be freed, and second to identify any that are left after the finalizers from the first round have completed.

Those two phases will also handle the third mechanism (after biased reference counts and immortal objects) that is being added: deferred reference counting. Some objects are generally long-lived, but not immortal, such as modules, top-level functions, and code objects; those objects are commonly accessed by multiple threads as well. Instead of performing an expensive atomic reference-count operation for them, they would instead be marked for deferred reference counting. When those objects are pushed and popped from the interpreter's stack, no reference-count operations will be performed, so the true state of references to those objects can only be calculated during a stop-the-world garbage-collection phase. In practice, it is not a lot different than how they are handled today:

Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.

Allocation and locking

The pymalloc memory allocator, which is not thread-safe without the GIL, has been replaced with mimalloc, which has been modified somewhat to support the CPython use case. The mimalloc internal data structures can be used to replace the existing linked list that allows the garbage collector to find all of the Python objects that have been allocated. Mimalloc has also been modified to support something similar to read-copy update (RCU) that allows locks to be avoided when retrieving items from dict and list objects:

A few operations on dict and list optimistically avoid acquiring the per-object locks. They have a fast path operation that does not acquire locks, but may fall back to a slower operation that acquires the dictionary's or list's lock when another thread is concurrently modifying that container.

[...] There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).

In general, Python containers (dict, list, etc.) are protected from concurrent modification by the GIL, though there are some operations that even the GIL does not fully protect, as described in the PEP. For most container operations, the no-GIL interpreter uses per-object locking, which "aims for similar protections as the GIL", though, as mentioned above, read operations avoid locking at all if they can. But per-object locking can lead to deadlocks:

Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.

Those deadlocks can be avoided using "Python critical sections". The idea is that a lock can only be held while an operation is being performed; if there is a nested operation, the lock is "suspended" by being released until the nested operation completes, when it must be reacquired. That suspension must also be done around blocking operations, such as I/O. As an optimization, the suspension is only done if the thread would block. "This reduces the number of lock acquisitions and releases for nested operations, while avoiding deadlocks."

Backward compatibility

As the PEP notes, the vast majority of compatibility concerns with the existing CPython ecosystem are related to the C API. To start with, extensions built for today's CPython will not be ABI compatible with no-GIL, so they will require a recompile at minimum. The bigger problem is that the existence of the GIL has masked concurrency problems that exist in the C code of many extensions.

Even extension developers who wanted to develop thread-safe extensions had no real way to test them until no-GIL came along. By the sounds, testing extensions with no-GIL is ongoing, especially for the larger, active extensions that have been chafing under the constraints of the GIL for many years. There is a long tail of extensions, however; not breaking those with no-GIL is important, thus the Py_mod_gil slot test. Beyond that, Gross plans to write a compatibility HOWTO that should help the process along.

As noted in last week's article, the lack of a GIL has some negative effects on the ongoing Faster CPython work, which is described in PEP 659 ("Specializing Adaptive Interpreter"). The no-GIL PEP mentions some of those problems and points to a specific specialization problem as an open issue for the no-GIL interpreter. For now, it looks like those problems are seen as challenges by the Faster CPython team, who are looking to work with Gross and others toward a no-GIL interpreter without sacrificing too much single-threaded performance.

Single-threaded performance is another area that the (quite comprehensive) PEP 703 touches on. Since the vast majority of Python code is single-threaded now, which is something that will only start to change slowly once no-GIL gets going, it is imperative that measures be taken to ensure that the performance of those programs does not regress. As Faster CPython developer Mark Shannon said, research will need to be done on "the optimizations necessary to make Python fast in a free-threading environment", but he and other members of the team seem up for the task.

While the numbers are somewhat disputed, PEP 703 gives a performance cost of 5-8% for the no-GIL changes relative to the in-progress Python 3.12. Those numbers are strictly the cost of the changes for no-GIL and do not reflect the gains that will come for multi-threaded Python programs that are not restricted by the GIL.

In conclusion

Even this fairly lengthy look only scratches the surface of the full contents of the PEP; it is well worth a read for those who are interested. One important thing to keep in mind, though, is that the steering council made it quite clear that the process will play out rather deliberately—slowly—over five years or more. There will be lots of opportunities to test and help fix no-GIL Python over that time frame, as well as to work on making extensions thread-safe without the GIL. To a large extent, the success of the no-GIL project is going to depend, at least in part, on the Python community—not just the core developers and the teams from various companies—pulling together to help make it succeed. It will be interesting to see (and report on) how it all goes.


Index entries for this article
PythonCPython
PythonGlobal interpreter lock (GIL)
PythonPython Enhancement Proposals (PEP)/PEP 703


to post comments

CPython without a global interpreter lock

Posted Aug 9, 2023 23:10 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (6 responses)

Nice summary!

> This proposal relies on a scheme that mostly avoids acquiring locks when accessing individual elements in lists and dictionaries. Note that this is not “lock free” in the sense of “lock-free” and “wait-free” algorithms that guarantee forward progress. It simply avoids acquiring locks (mutexes) in the common case to improve parallelism and reduce overhead.

I do wish the terminology around this wasn't overloaded...

CPython without a global interpreter lock

Posted Aug 10, 2023 12:16 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (5 responses)

After reading this through I think the responsibility for the muddy water lies with the PEP author.

Deferred reclamation schemes like RCU are expensive but deliver wait-freedom for readers (and typically lock-freedom for a writer). We lose peak performance but gain certainty, every one of our readers will make forward progress.

This approach, whatever it is, doesn't deliver that. It is optimistic, and when its optimism is well founded it'll go fast, when it isn't it may stall [but hopefully not deadlock?]. I think the PEP probably shouldn't mention RCU at all, or it should be in a footnote for anybody who thought "Oh, like RCU?" to clarify that you don't get RCU's guarantees. I can see why it was on the author's mind but I don't think mentioning it helps understanding.

Reading the PEP did clarify that the plan really is as cavalier as I feared, that environment variable is a YOLO feature and it's going to be left in plain sight like a tempting red button at toddler height. I don't expect that to go well.

CPython without a global interpreter lock

Posted Aug 10, 2023 15:18 UTC (Thu) by iabervon (subscriber, #722) [Link]

Considering that the steering committee announced their intent to accept that PEP, rather than just accepting it, I expect it to get revisions from the other people who are working on the project now that it's officially the future direction.

FWIW, I think starting with schemes that are optimistic about there not being contention is the right way to go, so long as the schemes can be reworked without changing the C extension ABI. Since existing programs are generally single-threaded, there just can't be any contention in the current common case. Once there are actual programs that could perform better if contention was handled more efficiently, it'll make sense to spend peak performance on it (possibly only if the program ever spawns threads or uses a flag to get that mode without a stop-the-world event when first spawning a thread).

CPython without a global interpreter lock

Posted Aug 10, 2023 16:59 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

> Reading the PEP did clarify that the plan really is as cavalier as I feared, that environment variable is a YOLO feature and it's going to be left in plain sight like a tempting red button at toddler height. I don't expect that to go well.

I know I previously said that I hoped they would provide stronger guarantees than this, but honestly, I'm having a hard time getting excited about it. If you go fiddling around with the interpreter's settings and cause your codebase to break, well, you get to keep both pieces. That's maybe not the safest way of doing things, but it's a valid perspective IMHO. This is more or less the same reasoning that leads Python to have no real encapsulation, for example.

CPython without a global interpreter lock

Posted Aug 13, 2023 11:01 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (2 responses)

This sentiment makes lots of sense for Bob's home grown Python program, sole maintainer Bob, sole user Bob, chief executive and bottle washer also Bob. If it breaks it's Bob's fault. Duh.

But this sort of work is happening because people are trying to write Python at scale and having problems. Jim writes tricky code that's fine assuming the GIL, Sarah uses Jim's code in a sub-routine for her team's new project. Mark uses that sub-routine with Sarah's permission from the new Project Foo, and the Project Foo lead Andy sets the environment variable because it's a "known workaround" for a problem they have, but now about one run in twenty has corrupt results and nobody knows why.

Whose fault is the corruption? Jim? Sarah? Mark? Andy? It is likely that there's no intersection between the people who would have known this won't work and the people who did it anyway.

Maybe it's fine, but I wouldn't want to be the Python community finding that out the hard way.

CPython without a global interpreter lock

Posted Aug 15, 2023 5:41 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

If Jim's code is maintained, he can just add a startup check for the offending environment variable, and cleanly fail the import at that time with an explanatory message. If Jim refuses to do that, then it's his fault. Checking one environment variable is much cheaper than importing an entire module, so you just do it at import time and it's effectively free.

If Jim's code is unmaintained, then it's... well, not necessarily Andy's *fault*, but it is Andy's *problem* because it's his project. Such is an occupational hazard of relying on unmaintained code, in any language, but especially in Python. Python has had a very well-established practice of slowly, carefully deprecating and removing old APIs, both before and after the 2-to-3 transition. 2-to-3 was not an anomaly because it broke backcompat, it was an anomaly because it did it abruptly in a single release, and on a much larger scale than has otherwise been typical. This problem sounds much more like a classic Python deprecation than a "break the world" flag day.

CPython without a global interpreter lock

Posted Aug 16, 2023 16:57 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

I would not characterise Python's approach as "careful", on the whole. "Slapdash" is the word which comes to mind. There are some good ideas in their compatibility approach, but they're inconsistently applied, and so I wouldn't rely on them.

My guess is that if it became normal to write environment variable checks and block execution, the response from CPython would be to change the name of the environment variable. This is an intentional foot gun, not that they'd accept that description, de-fanging it would be contrary to their purpose in offering it.

CPython without a global interpreter lock

Posted Aug 10, 2023 2:29 UTC (Thu) by Paf (subscriber, #91811) [Link]

I know some folks involved in this are reading this, so:

This is really remarkable work. Congratulations to those doing it.

CPython without a global interpreter lock

Posted Aug 11, 2023 4:00 UTC (Fri) by alkbyby (subscriber, #61687) [Link] (6 responses)

Hi all. Perhaps someone could explain. Why not simply do full honest GC? All those reference counting complications go away automagically.

CPython without a global interpreter lock

Posted Aug 11, 2023 13:19 UTC (Fri) by mb (subscriber, #50428) [Link]

One of the main problems is that the CPython C-API shall remain compatible to avoid another 2to3 scenario for all C modules.
But the C-API has reference counting baked it very deeply.

Even if you made these reference counting interfaces no-ops, then there certainly are many C modules expecting proper counting for resource freeing.

CPython without a global interpreter lock

Posted Aug 11, 2023 15:30 UTC (Fri) by Karellen (subscriber, #67644) [Link] (1 responses)

Isn't ref-counting a lot more performant than GC? If you can ref-count the majority of the easy cases, and leave GC to only mop up the smaller set of hard cases which remain, isn't that going to be a big win?

CPython without a global interpreter lock

Posted Aug 11, 2023 16:10 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

I think that is one of bigger debates in our trade (e.g. recall famous paper by Boehm from as early as 90s). But as noted in this article too, refcounting is okay for fully single threaded cases (arguably, even that can be debated). But when it comes to multi-threaded cases, it quickly becomes burden. Especially if you compare with modern advanced GCs of today.

So, IMHO refcounting was and is wrong choice (for python-like use cases; e.g. kernel-space is a different matter).

(Not so) fun fact: gcc's libstdc++ still has this horrible "optimization" where shared_ptr bits detect at runtime if multi-threading is active inlined at ~every place share_ptr is copied. And compiler adds this run-time check and both single-threaded and multi-thread "atomic-ful" refcounting codes everywhere! Quick godbolt proof: https://godbolt.org/z/o9b9hxfMP

Of course this is me cherry-picking one annoying arguably performance bug in gcc. But imho it adds nicely to the topic of: "no, refcounting is not anywhere as straightforward as people tend to think".

CPython without a global interpreter lock

Posted Aug 11, 2023 16:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

A lot of Python code out there depends on deterministic destruction for resource cleanup. It has always been a somewhat bad idea, but it IS the case right now. Moving to a full GC will subtly break this code. For example, a sequence like this will work _most_ of the time on Windows:

def do_something(name):
  fl = open(name)
  fl.read()
  ...

do_something("blah");
os.unlink("blah");

If the GC is fast enough to immediately clean up the `fl` descriptor, then everything will work. But sometimes GC will not have time to run, and `os.unlink` will fail with a "file locked" error.

CPython without a global interpreter lock

Posted Aug 11, 2023 22:04 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

> If the GC is fast enough to immediately clean up the `fl` descriptor, then everything will work.

Nope. It wouldn't. Or, raither: it may work on your system, but if you would try to give that code to someone then pretty soon your tracker would be overflowing with messages about how nothing works. Because AV software and “security” software (like this abomination) would keep your file around for a few seconds to “investigate” it.

The only way to deal with it is to catch OSError and repeat the operation after some time (with exponential back-off). And that saves you when GC is used, too.

CPython without a global interpreter lock

Posted Aug 11, 2023 22:14 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Good antiviruses don't lock the file, and anyway, you can still do: "do_something("blah"); time.sleep(10); os.unlink("blah");" with the same effect.

I had quite a few tools that have these kinds of call sequences, and I don't remember any problems from the third-party tools locking them. Especially when we're talking about files in $TMPDIR.

CPython without a global interpreter lock

Posted Aug 18, 2023 8:14 UTC (Fri) by SLi (subscriber, #53131) [Link] (1 responses)

I know I should look at the Python source code before asking this, but since extensions are shared objects, could "not using GIL" be in some cases be detected from that .so not referencing a certain symbol?

Or is there some other care that GIL-free supporting extensions need to take besides not explicitly using the GIL?

CPython without a global interpreter lock

Posted Aug 18, 2023 13:33 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

An extension not mentioning the GIL may assume that the GIL is always locked. If it mentions the GIL, it is (usually) to drop it while calling into non-Python code (so that other Python code can make progress while, say, multiplying your 100000x100000 matrices). I think I'd actually expect a shared module mentioning the GIL to be more OK with it being a no-op. But all kinds of code could be using GIL-based assumptions of "these will never run concurrently" when manipulating shared state, so a "looks for the GIL" signal doesn't sound like a very strong signal to me.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds