Python subinterpreters and free-threading

By Jake Edge
August 20, 2024

At PyCon 2024 in Pittsburgh, Pennsylvania, Anthony Shaw looked at the various kinds of parallelism available to Python programs. There have been two major developments on the parallel-execution front over the last few years, with the effort to provide subinterpreters, each with its own global interpreter lock (GIL), along with the work to remove the GIL entirely. In the talk, he explored the two approaches to try to give attendees a sense of how to make the right choice for their applications.

Shaw began by pointing attendees to his talk notes, which has a list of prerequisites that should be met or "you need to leave immediately". That list included multiple talks to review, some of which have been covered here (Eric Snow on subinterpreters and Brandt Bucher on a Python JIT compiler), and one that was taking place at the same time as Shaw's talk "that you need to have watched", he said to laughter. He also recommended the chapter on parallelism from his CPython Internals book and his master's thesis "if you're really, really bored".

Parallel mechanisms

Parallelism comes down to the ability to segment a task into pieces that can be worked on at the same time, he said, but not all tasks can be. If it takes him an hour to solve a Rubik's Cube, he asked, how much time would it take four people to solve one Rubik's Cube? Since four (or even two) people cannot sensibly work on the same puzzle at once, adding more participants does not really help.

There is another aspect, which he tried to demonstrate in a physical way. He got three volunteers to draw a picture of a dog with him, but before they could do so, he needed to provide them with the work to do (by balling up a piece of paper to throw to them). That is akin to the serialization of data that a computer might need to do (via, say, pickle) to parcel out a task to multiple cores. While he (as "core zero") is sending out that data, he is not drawing a dog, so he can only start drawing after that is complete. Meanwhile, the other "cores" have to unpack the pickle, draw the dog, repack the pickle and send it back. So, if it takes each core ten seconds to draw a dog, drawing four dogs in parallel takes a bit more than that because of the coordination required.

He said that there are four different types of parallel execution available in Python: threads, coroutines, multiprocessing, and subinterpreters. Three of them execute in parallel (or will with the GIL-removal work), which means they can run at the same time on separate cores, while coroutines execute concurrently, so they cooperatively share the same core. Each has a different amount of start-up time required, Shaw said, with coroutines having the smallest and multiprocessing the largest.

He showed a simple benchmark that created a NumPy array of 100,000 random integers from zero to 100, then looped over the array calculating the distance of each entry from 50 (i.e. abs(x-50)). Shaw said that "if I split that in half and executed between two threads in Python 3.12, it would take twice as long". That is mainly because of the GIL.

In Python 3.11, there was one GIL per Python process, which is why multiprocessing is used for "fully parallel code execution". In Python 3.12, subinterpreters got their own GIL, so there was a GIL per interpreter. That means that two things can be running at once in a single Python process by having multiple interpreters each running in its own thread. The operating system will schedule the execution on different cores to provide the parallelism between the interpreters.

He presented some numbers to show, generally, that the workload size has a major effect on the speed of operations. When using threads (with the GIL), calculating pi to 200 digits takes 0.04s, which is roughly the same as with subinterpreters (0.05s). It turns out that calculating pi to that precision is quite fast, so what is being measured is the overhead in starting up threads versus subinterpreters. When 2000 digits are calculated, the thread (plus GIL) test goes to 2.37s, while subinterpreters take only 0.63s. Meanwhile, multiprocessing has much higher overhead (0.3s for 200 digits) but is closer to subinterpreters for 2000 (0.89s). The quick conclusion that can be drawn is that "subinterpreters are really like multiprocessing with less overhead and shared memory", he said.

Moving on to the free-threading interpreter (which is the preferred term for the no-GIL version), he noted that Python 3.13 beta 1, which had been released in early May, a week before the talk, allows the GIL to be disabled. His earlier example of splitting a task up into multiple threads—which ran slower because of the GIL—will now run faster. Meanwhile, Python 3.13 has a bunch of optimizations to make single-threaded code run faster.

The GIL exists to provide thread-safety, which is important to keep in mind. The free-threading version introduces a new ABI for Python, so you need a special build of the CPython interpreter in order to use it; in addition, extension modules that use the C API will need to be built for the free-threaded CPython. Those are important things to consider because they affect compatibility with the existing Python ecosystem.

The steps needed to remove the GIL "are really quite straightforward and I'm not sure why it's taken so long to do so", he said wryly—and to laughter. Step one is to remove the GIL; step two is to change the memory allocator and the garbage collector, which is still a work-in-progress. The final step is to fix the C code that assumes that the GIL exists, which means that it is essentially not thread-safe. That final step "is probably going to drag on for a while".

The compatibility with the C-based extensions is the problem area, both for free-threading and subinterpreters, but also for any other attempt at a fully parallel Python over the years. From what Shaw has seen, pure Python code works fine with either mechanism, "because the core Python team have got sort of full control over what the interpreter is doing". The irony is that one of the main reasons for the existence of all of the C extensions is to optimize Python programs because they could not be run in parallel.

Faster?

He reused the benchmark of pi to 2000 digits on the newly released interpreter, both with and without the GIL. As would be expected, removing the GIL reduced the threaded numbers; on a four-core system, they went from 2.41s with the GIL to 0.75s without. The subinterpreters and multiprocessing versions were slower, however. With the GIL, subinterpreters took 0.76s (and multiprocessing took 1.2s); without it was 0.99s for subinterpreters and 1.57s for multiprocessing. The difference is that regular Python code runs slower without the GIL, Shaw said; there are a number of specializations and optimizations that are available when the GIL is enabled, but are not yet thread-safe, so they are disabled when the GIL is turned off. He expects that the performance for the threaded and subinterpreters versions to "actually be really close in the future".

"Hands up if your production workload includes calculating the number of digits in pi", he said with a chuckle. Benchmarks like the one he showed are not representative of real Python code because they are CPU bound. The performance of web and data-science applications has a major I/O component, so the improvements are not as clear-cut as they are in calculating pi. He thinks a "more realistic example is to look at the way that web applications are architected in Python".

For example, web applications that use Gunicorn typically run in a multi-worker, multithreaded mode, which is managed by Gunicorn. There are multiple Python processes, each of which has its own GIL, of course; typically there is one process per core on the system. Meanwhile, there are generally at least two threads per process, which actually helps because the GIL is dropped for many I/O operations, including reading and writing from network sockets. "So it does actually make sense to have multiple threads, because they can be running at the same time, even though the GIL is there".

He wondered if it made sense to shift that model to have a single process, but with multiple interpreters, each of which had multiple threads. He showed a moderately complicated demo that did just that, though it was somewhat difficult to follow. Interested readers can watch the YouTube video of the talk to see the demo in action. Links to the code can be found in the talk notes.

The upshot of the demo is that the new features make multiple different architectures possible. One question that often comes up is whether it makes sense to switch to a single process with a single interpreter and multiple threads now that a free-threading interpreter is available. The answer is that, as with most architectural questions, it depends on the problem being solved. In addition, "code doesn't use threading out of the box", so the application code will need to be tailored to fit.

Another question that comes up often in light of these changes is which to choose, subinterpreters or free-threading. It is not one or the other, Shaw said, as the two can coexist; each has its own strengths. There are applications or parts of them that can benefit from the isolation provided by subinterpreters, while CPU-intensive workloads can be parallelized using threads.

Something to note is that asyncio does not use threads, he said. That is something that is often misunderstood, but the coroutines that are used by asyncio are not assigned to threads, so the GIL is not interfering with them. He talked with someone who tried benchmarking an asyncio application with the GIL disabled; they were surprised there was no change, but should not have been.

In conclusion, Shaw said, he thinks that "subinterpreters are a better solution for multiprocessing" using long-running workers; "I think for a web app, this is a perfect solution." Free-threading will be a better choice for CPU-heavy tasks that can be easily segmented and only need to do a "quick data exchange between them".

But he is concerned that the Python ecosystem has been implicitly relying on the GIL for thread safety. "Removing it is going to break things in unpredictable ways for some time to come." Python cannot keep putting off that change, though, so he thinks it needs to be done. His conclusions came with a lengthy list of "terms and conditions" (which can be seen in the talk notes as well) that enumerated some of the remaining challenges for both subinterpreters and free-threading.

Questions

One attendee asked about the datetime module, which Shaw had listed as not being thread-safe in the list of terms and conditions; because of that, the Django web framework cannot be used with subinterpreters. Shaw pointed the attendee at the link from the talk notes to a GitHub issue (which was worked on and fixed during the sprints the week following PyCon). In answer to another question, he said that his list would be changing and that many of the problem areas would likely be fixed before Python 3.13 is released in October.

Another attendee noted that the vast majority of C extensions simply make synchronous calls into a C library; they do not use any Python objects, reference counting, or any of the other potential problem areas. He wondered if there had been any thought about migration tools to help the authors of those types of extensions. Shaw said that there has been discussion of that and it is an area that needs work, though people are aware of it; "the GIL only got disabled like last week". He suggested the sprints would be a good place to work on that and that the CPython team would be appreciative; that was met with somewhat scattered applause, likely from members of said team.

Will there always be two Python versions, with and without the GIL, an attendee asked. Shaw suggested that the person standing behind them in line at the microphone would be better able to address that. Guido van Rossum, creator of Python who now works on the Faster CPython project, said that he could not speak for the steering council but that his understanding is that free-threading was accepted experimentally; if it gets fully accepted, "eventually that's all you get". In his question, Van Rossum wondered if the optimal model for using asyncio and subinterpreters is to have a single thread running the asyncio event loop in each subinterpreter and to have one subinterpreter per core; Shaw agreed and noted that it was part of what he was showing in his demo.

The last question that was asked had to do with data transfer between subinterpreters; what are the plans there? Shaw said that "the goal is to provide interfaces to share information between subinterpreters without pickling", which is "extremely slow". You can share immutable types right now, as well as sharing memory using Memory Views. If you need continuous data exchange, though, threading pools may be a better approach.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]

Index entries for this article
Conference	PyCon/2024
Python	Global interpreter lock (GIL)
Python	Subinterpreters

Python & Guido

Posted Aug 21, 2024 2:51 UTC (Wed) by Paf (subscriber, #91811) [Link]

I’m not really a Python person, but I am always really charmed by articles talking about that community and especially Guido’s continuing role in it. They(we?) are lucky to have him.

2 threads or 100 threads?

Posted Aug 21, 2024 3:30 UTC (Wed) by wujj123456 (guest, #84680) [Link] (4 responses)

> "if I split that in half and executed between two threads in Python 3.12, it would take twice as long"

The code and comment indicates it's splitting into 100 threads to get 2x slower. This kind of overhead is more plausible than having only 2 threads but 2x slower.

2 threads or 100 threads?

Posted Aug 22, 2024 3:10 UTC (Thu) by kenmoffat (subscriber, #4807) [Link] (1 responses)

You have not clarified which of the links you are commenting on, and I find some of them hard to read. But if you split something into threads and it takes longer, then what is the point ?

I assume whatever you commented on illustrated a problem which either has now been , or needs to be, fixed. But for those of us who have little experience in an area, context is always useful.

2 threads or 100 threads?

Posted Aug 22, 2024 7:20 UTC (Thu) by wujj123456 (guest, #84680) [Link]

The context is in the article, and there is only one link that could plausibly be the one I'm referring to in the paragraph containing the sentence I quoted: https://gist.github.com/tonybaloney/24d545ed855a3c90f8442...

You don't need to understand the code snippet either, because it was commented with "Split array into blocks of 100 and start a thread for each", which is in conflict with the quoted sentence in the article. My question is just asking to confirm whether it's 2 thread or 100 threads causing the 2x slowdown, because that points to very different per-thread synchronization overhead from the GIL lock.

2 threads or 100 threads?

Posted Aug 22, 2024 16:11 UTC (Thu) by jake (editor, #205) [Link] (1 responses)

I think Anthony was just making a general statement about what *would* happen if he were to do that. I checked the video and the quote is correct, though I think he misstates what he means ... the part that he showed in his slides is just the upper "Crude sample" part from that link. Now that you point it out, though, I suspect he did not actually mean that it took twice as long, just that it would take the same amount of time as the simple non-threaded version because the GIL would not be released so (any number of) threads would serialize.

jake

2 threads or 100 threads?

Posted Aug 24, 2024 6:37 UTC (Sat) by wujj123456 (guest, #84680) [Link]

Got it. Somehow I missed the link to video. I agree with you given Anthony was just stating that the threads couldn't run in parallel. Thanks for the continued coverage on this interesting topic.

Garbage Collector changes

Posted Aug 22, 2024 20:21 UTC (Thu) by StandingPad (subscriber, #171211) [Link] (1 responses)

The article mentions that part of removing the GIL involves changing the garbage collector. I'm aware at the moment, CPython uses reference counting (which isn't parallel friendly), so does the free-threading build use a different method like mark-and-sweep, or does it use some variant of reference counting that is parallel friendly?

Garbage Collector changes

Posted Aug 23, 2024 8:00 UTC (Fri) by bluss (guest, #47454) [Link]

See the relevant PEP-703 for an explanation of this https://peps.python.org/pep-0703/#reference-counting