MemHive: sharing immutable data between Python subinterpreters
Immutable data makes concurrent access easier, since it eliminates the data-race conditions that can plague multithreaded programs. At PyCon 2024, Yury Selivanov introduced an early-stage project called MemHive, which uses Python subinterpreters and immutable data to overcome the problems of thread serialization that are caused by the language's Global Interpreter Lock (GIL). Recent developments in the Python world have opened up different strategies for avoiding the longstanding problems with the GIL.
Selivanov began by displaying the output of top showing what was running on his laptop, which consisted of a single Python process running ten asyncio event loops. Each event loop was running concurrently in its own thread and they were saturating the ten CPU cores that his laptop has. They were sharing a single mapping data structure (i.e. similar to a dictionary) between them to exchange values; the mapping had one million keys and values.
Fast services
If you want to build a fast Python service, he asked, how do you do it? You can use asyncio in a single Python process and that will allow servicing lots of requests, but on a 64-core server, "this will make it sad" because it can only use one core. Another possibility is multiprocess, which "is battle-tested [and] industry-standard", but sharing data via a database adds latency; sharing in memory is always going to be faster, but processes do not share their address space. Another alternative is to use threads; "if you have a galaxy brain, if you know how to control threads, if you know how to write thread-safe code, maybe you can do that—except in Python we have GIL".
He asked how many attendees wanted to see the GIL be removed; when faced with the large number of hands up, he said that he might be in the wrong room because "I actually kind of like GIL". He believes that "the complexity of multithreaded programming is absolutely ridiculous". The GIL makes that moot, currently, but there is an effort to remove it, though it will likely take a couple of years for that work to stabilize and for all of the ecosystem to be compatible with it, he said.
"So what do we do now?" He suggested that subinterpreters, which fairly recently each got their own GIL, might be the right approach. Subinterpreters run completely independently of each other and cannot exchange anything between them, "except some very primitive information". But this mechanism actually allows for parallel execution in a single Python process. In addition, the subinterpreters all share the same memory space.
One could imagine an architecture where there is one main subinterpreter that has a pool of worker subinterpreters; perhaps there would be queues that are used to transmit work to these workers. But, in order to send data to those workers, the Python objects need to be serialized using, say, pickle, which is slow, Selivanov said.
So there is a need to avoid using pickle. He asked: "What if there was a way to actually have safe shared state" so that all of the subinterpreters can access it efficiently? He built the MemHive library to provide that functionality. It has three levels of API; level one provides a simple protocol with just a few functions, which allows sending data across the subinterpreter boundary, while level two is at a higher level, adding queues and synchronization primitives. Level three is an "asyncio bridge", which he showed briefly in an example.
The example creates an async worker function that is added ten times to an AsyncMemHive using its add_async_worker() method; each of those will be run in a separate subinterpreter. The worker functions receive a parameter that allows them to access the shared state and receive messages from the main subinterpreter. The shared state is created in the main subinterpreter as a memhive.Map object; the main subinterpreter can also broadcast messages to all of the workers or queue work items, which will then be sent to a single worker. All of that can be seen in the YouTube video of the talk.
Selivanov said that his talk could have been about the API, how to build things with it, and so on, but he thinks it is a pretty minimal API that people can figure out for themselves for the most part. Instead, he wanted to talk about how it is implemented "so that you don't think that this is some witchcraft magic"; he wanted attendees to understand how it works so that they can use it, or apply the techniques to their own code.
Immutability
Python already has a few immutable data types, he said, including strings, integers, floating-point numbers, byte arrays, and tuples. For example, the first character of a string cannot be changed in place; instead, a new string must be created. The same is true for tuples and the others, but there is no immutable dictionary in Python, which "is a shame because it would be nice to have it". He imagined how it might work with an immutable Map type:
m1 = Map({'a': 1})
m2 = m1.set('b', 2) # create a new Map with a and b
That code would initialize a Map using a dictionary (or it could perhaps also take key-value pairs), then a new Map can be created to add another entry to it. That API is pretty much the same as what is used for tuples in Python:
t1 = (x, y)
t2 = t1 + (z,)
There is a need for efficiency, however. The tuple example has an O(n) complexity, where "n" is the number of elements in the tuple; that means it requires CPU time proportional to the length of the tuple. Typically, tuples are used for small records with a few elements, so that cost is not onerous, but maps, which may be used as a cache for a web application, may have millions of elements. That kind of price cannot be paid for adding an entry to a cache of that sort; O(n) is "unacceptably slow" for the use case.
Fortunately, there are algorithms that can be used to make these changes with O(log n) complexity; O(log n) grows much more slowly than O(n). He handwaved a bit to reduce the complexity order down to O(meh), at least for him, noting that for maps with only a few tens of thousands of elements, the operation is "fairly quick". It was (obviously) not particularly rigorous and he did not really address the gap between "a few tens of thousands" and the "millions" he mentioned earlier.
The algorithm works by using a tree data structure to represent the map; the API to the map does not expose the tree to users, but it underlies the map. When adding a new entry, there will be many nodes in the tree that do not need to change at all, thus they do not need to be copied; since tree nodes are immutable, they must be copied when their contents change. However, only the nodes that are directly affected by the change, or that refer to those nodes, need to be copied and updated. In general, just a small handful of nodes in a huge tree will need adjustment for any given key-value change. Switching back to the use case of millions of entries, he suggested that adding an element might only affect five nodes out of, say, 10,000 in the tree.
This technique is known as "structured sharing", he said. The algorithm used by MemHive is a hash array mapped trie (HAMT). He had already implemented HAMT for CPython (as hamt.c), "but it is kind of hidden"; it is used by contextvars in the standard library. More information can be found in Selivanov's PEP 567 ("Context Variables"). He noted that the Python source file has a huge comment that describes HAMT in great detail; "it's a fascinating data structure" that is worth taking a look at.
Using efficient immutability
This technique can be used by the subinterpreter pool to efficiently share a map between workers and the main subinterpreter, since they all share the same memory space. A worker can simply reference all of the HAMT nodes in the main subinterpreter in its "copy" by looking into the memory of the main subinterpreter. Since it is immutable, no locks are needed to access the entries. An even more "mind-blowing" feature is that the worker can add new entries by creating new nodes in its version for any nodes that are affected by a given change; all of the other nodes are still just referenced from the version in the main subinterpreter. When the worker passes the mapping back, only the changed nodes need to be copied into the main subinterpreter.
Though there is no locking required, there are some additional constraints. The values in the map need to be lazily copied to a new object when they are accessed in a worker, because Python objects cannot be shared. Doing so only requires doing a memcpy(), "which is extremely fast", he said.
MemHive currently implements a shareable map, as described; it could also implement lists, Selivanov said, because there is an algorithm similar to HAMT that could be applied to them. It also implements lazy copies of things like integers, strings, tuples, and so on, though more of those could be added as well.
He then turned to benchmarks. He used a map with one-million entries that gets passed to a worker, which changes N keys, then hands the map back to the main subinterpreter. The piece that he was measuring was that latter step of returning the changes, which can be done using pickle or "all of this structured-sharing magic". The graph he showed had two flat lines, one around 1500ms for the pickle solution and another at zero for the MemHive technique; the number of keys changed ranged from zero to 100. Each time, the entire map had to be serialized and then deserialized using pickle, so the times remain the same, while the amount of time spent by MemHive was so minimal that it was lost in the scale of the graph.
To remedy that, he displayed a graph of just MemHive with a scale from zero to 0.2ms; it showed that the amount of time did grow from around 0.01ms for one change, to around 0.18ms for 100. So returning a map with a million keys with a small number of changes "is a sub-millisecond thing" using MemHive. Even with smaller maps, using immutable data is a significant win over pickle; with ten entries and up to ten changes, pickle took around 30µs for each, while MemHive ranged from zero to less than 10µs. Even when changing the entire map, the copying that MemHive does is quite a bit faster than using pickle.
He had some doubts about showing his next slide, since several prominent core Python developers had warned him that some people would not like it. In fact, he is immediately skeptical when he sees something like it. But it is true that the speedup of using an immutable map over using pickle is, as his slide said, "6x - 150,000x", though it came with a "your mileage may vary" disclaimer. As would be expected, the speedup will depend on how many changes are made to the shared map; if all of the map is updated, it will be six times faster, if just one key is updated, though, "it can be this ridiculous number times faster".
He would love to talk more about the implementation details, he said, but was running out of time; it would make for a full talk of its own or perhaps a blog post at some point. Meanwhile, he wanted to point out that MemHive is not truly open source, by his definition, even though the code is available under the Apache License, version 2. He had only created the GitHub repository the previous day, but, to him, open source means more than just making the code available.
He is not planning to take pull requests or answer issues that are filed in the repository; in addition, the code is not stable, as it is just a prototype, though it does implement "the hardest part of it all". There are still bugs that need fixing and "a few new APIs have to be added". He added a statement about not using MemHive in production, "OR YOU WILL BE FIRED" to the project's README. He is building MemHive for his company, EdgeDB, which provides a database with "a fairly big Python component in it". He and his colleagues are going to be working on MemHive, testing it, and "dogfooding it" because the product needs it "to optimize our cloud deployment". Perhaps that is why he added "UNLESS YOU WORK AT EDGEDB" to the README warning sometime after the talk.
Q&A
The first question was about the most-common task that would benefit from the MemHive approach. Selivanov said that the simplest task that would benefit is similar to what he described: multiple asyncio programs sharing a cache. Another might be for an application that does a lot of number crunching based on network submissions; there could be a pool of subinterpreters doing that work with other subinterpreters running asyncio event loops to handle the requests and to dispatch them to the number crunchers.
Another attendee asked about support for lists; is that something that he and EdgeDB need, so they will be working on it? It is hard to say for sure at this point, Selivanov said, but if he does, he would use the same source for the list data structure as he did for HAMT: the Clojure language. If he does not end up working on it, someone else can easily enough; it will "take about a week of programming, it's very doable".
Python tuples are immutable, but they can contain mutable entries, an attendee said, what happens if a tuple with mutable entries is shared using MemHive? After calling it "an excellent question", Selivanov said that MemHive prohibits accessing mutable elements, so if a tuple with some integers and a list as entries is passed, the worker can access the integers in the tuple, but not the list. He is not sure if he will keep the current behavior or create a tuple-equivalent that will only allow approved data types to be stored in it.
The final question was about the benchmarks, which used a map with integer keys, from zero up to the max, and then changed the values of the first N keys. The attendee thought that those keys would likely end up in the same branch of underlying trie; does it make a difference if the keys to be changed are randomly chosen throughout the range? Selivanov said that he has run the benchmark many times and that the Python hash function changes on each run so that the first N keys end up all over the trie. In the end, though, it does not matter much, he said; if you have a map with one-million entries, the HAMT behind it is "very diverse, so one way or another you are probably going to be mutating the same number of branches".
It will be interesting to see where MemHive, and the general idea of structured sharing between subinterpreters goes. Even once the GIL penalty is removed, avoiding the synchronization headaches of standard multithreaded programs certainly has some advantages. Though somewhat overshadowed by the "big news" of the free-threading (no GIL) Python interpreter, subinterpreters seemingly have a lot to offer as well.
[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]
| Index entries for this article | |
|---|---|
| Conference | PyCon/2024 |
| Python | Subinterpreters |
