MemHive: sharing immutable data between Python subinterpreters

By Jake Edge
August 28, 2024

Immutable data makes concurrent access easier, since it eliminates the data-race conditions that can plague multithreaded programs. At PyCon 2024, Yury Selivanov introduced an early-stage project called MemHive, which uses Python subinterpreters and immutable data to overcome the problems of thread serialization that are caused by the language's Global Interpreter Lock (GIL). Recent developments in the Python world have opened up different strategies for avoiding the longstanding problems with the GIL.

Selivanov began by displaying the output of top showing what was running on his laptop, which consisted of a single Python process running ten asyncio event loops. Each event loop was running concurrently in its own thread and they were saturating the ten CPU cores that his laptop has. They were sharing a single mapping data structure (i.e. similar to a dictionary) between them to exchange values; the mapping had one million keys and values.

Fast services

If you want to build a fast Python service, he asked, how do you do it? You can use asyncio in a single Python process and that will allow servicing lots of requests, but on a 64-core server, "this will make it sad" because it can only use one core. Another possibility is multiprocess, which "is battle-tested [and] industry-standard", but sharing data via a database adds latency; sharing in memory is always going to be faster, but processes do not share their address space. Another alternative is to use threads; "if you have a galaxy brain, if you know how to control threads, if you know how to write thread-safe code, maybe you can do that—except in Python we have GIL".

He asked how many attendees wanted to see the GIL be removed; when faced with the large number of hands up, he said that he might be in the wrong room because "I actually kind of like GIL". He believes that "the complexity of multithreaded programming is absolutely ridiculous". The GIL makes that moot, currently, but there is an effort to remove it, though it will likely take a couple of years for that work to stabilize and for all of the ecosystem to be compatible with it, he said.

"So what do we do now?" He suggested that subinterpreters, which fairly recently each got their own GIL, might be the right approach. Subinterpreters run completely independently of each other and cannot exchange anything between them, "except some very primitive information". But this mechanism actually allows for parallel execution in a single Python process. In addition, the subinterpreters all share the same memory space.

One could imagine an architecture where there is one main subinterpreter that has a pool of worker subinterpreters; perhaps there would be queues that are used to transmit work to these workers. But, in order to send data to those workers, the Python objects need to be serialized using, say, pickle, which is slow, Selivanov said.

So there is a need to avoid using pickle. He asked: "What if there was a way to actually have safe shared state" so that all of the subinterpreters can access it efficiently? He built the MemHive library to provide that functionality. It has three levels of API; level one provides a simple protocol with just a few functions, which allows sending data across the subinterpreter boundary, while level two is at a higher level, adding queues and synchronization primitives. Level three is an "asyncio bridge", which he showed briefly in an example.

The example creates an async worker function that is added ten times to an AsyncMemHive using its add_async_worker() method; each of those will be run in a separate subinterpreter. The worker functions receive a parameter that allows them to access the shared state and receive messages from the main subinterpreter. The shared state is created in the main subinterpreter as a memhive.Map object; the main subinterpreter can also broadcast messages to all of the workers or queue work items, which will then be sent to a single worker. All of that can be seen in the YouTube video of the talk.

Selivanov said that his talk could have been about the API, how to build things with it, and so on, but he thinks it is a pretty minimal API that people can figure out for themselves for the most part. Instead, he wanted to talk about how it is implemented "so that you don't think that this is some witchcraft magic"; he wanted attendees to understand how it works so that they can use it, or apply the techniques to their own code.

Immutability

Python already has a few immutable data types, he said, including strings, integers, floating-point numbers, byte arrays, and tuples. For example, the first character of a string cannot be changed in place; instead, a new string must be created. The same is true for tuples and the others, but there is no immutable dictionary in Python, which "is a shame because it would be nice to have it". He imagined how it might work with an immutable Map type:

    m1 = Map({'a': 1})
    m2 = m1.set('b', 2)  # create a new Map with a and b

That code would initialize a Map using a dictionary (or it could perhaps also take key-value pairs), then a new Map can be created to add another entry to it. That API is pretty much the same as what is used for tuples in Python:

    t1 = (x, y)
    t2 = t1 + (z,)

There is a need for efficiency, however. The tuple example has an O(n) complexity, where "n" is the number of elements in the tuple; that means it requires CPU time proportional to the length of the tuple. Typically, tuples are used for small records with a few elements, so that cost is not onerous, but maps, which may be used as a cache for a web application, may have millions of elements. That kind of price cannot be paid for adding an entry to a cache of that sort; O(n) is "unacceptably slow" for the use case.

Fortunately, there are algorithms that can be used to make these changes with O(log n) complexity; O(log n) grows much more slowly than O(n). He handwaved a bit to reduce the complexity order down to O(meh), at least for him, noting that for maps with only a few tens of thousands of elements, the operation is "fairly quick". It was (obviously) not particularly rigorous and he did not really address the gap between "a few tens of thousands" and the "millions" he mentioned earlier.

The algorithm works by using a tree data structure to represent the map; the API to the map does not expose the tree to users, but it underlies the map. When adding a new entry, there will be many nodes in the tree that do not need to change at all, thus they do not need to be copied; since tree nodes are immutable, they must be copied when their contents change. However, only the nodes that are directly affected by the change, or that refer to those nodes, need to be copied and updated. In general, just a small handful of nodes in a huge tree will need adjustment for any given key-value change. Switching back to the use case of millions of entries, he suggested that adding an element might only affect five nodes out of, say, 10,000 in the tree.

This technique is known as "structured sharing", he said. The algorithm used by MemHive is a hash array mapped trie (HAMT). He had already implemented HAMT for CPython (as hamt.c), "but it is kind of hidden"; it is used by contextvars in the standard library. More information can be found in Selivanov's PEP 567 ("Context Variables"). He noted that the Python source file has a huge comment that describes HAMT in great detail; "it's a fascinating data structure" that is worth taking a look at.

Using efficient immutability

This technique can be used by the subinterpreter pool to efficiently share a map between workers and the main subinterpreter, since they all share the same memory space. A worker can simply reference all of the HAMT nodes in the main subinterpreter in its "copy" by looking into the memory of the main subinterpreter. Since it is immutable, no locks are needed to access the entries. An even more "mind-blowing" feature is that the worker can add new entries by creating new nodes in its version for any nodes that are affected by a given change; all of the other nodes are still just referenced from the version in the main subinterpreter. When the worker passes the mapping back, only the changed nodes need to be copied into the main subinterpreter.

Though there is no locking required, there are some additional constraints. The values in the map need to be lazily copied to a new object when they are accessed in a worker, because Python objects cannot be shared. Doing so only requires doing a memcpy(), "which is extremely fast", he said.

MemHive currently implements a shareable map, as described; it could also implement lists, Selivanov said, because there is an algorithm similar to HAMT that could be applied to them. It also implements lazy copies of things like integers, strings, tuples, and so on, though more of those could be added as well.

He then turned to benchmarks. He used a map with one-million entries that gets passed to a worker, which changes N keys, then hands the map back to the main subinterpreter. The piece that he was measuring was that latter step of returning the changes, which can be done using pickle or "all of this structured-sharing magic". The graph he showed had two flat lines, one around 1500ms for the pickle solution and another at zero for the MemHive technique; the number of keys changed ranged from zero to 100. Each time, the entire map had to be serialized and then deserialized using pickle, so the times remain the same, while the amount of time spent by MemHive was so minimal that it was lost in the scale of the graph.

To remedy that, he displayed a graph of just MemHive with a scale from zero to 0.2ms; it showed that the amount of time did grow from around 0.01ms for one change, to around 0.18ms for 100. So returning a map with a million keys with a small number of changes "is a sub-millisecond thing" using MemHive. Even with smaller maps, using immutable data is a significant win over pickle; with ten entries and up to ten changes, pickle took around 30µs for each, while MemHive ranged from zero to less than 10µs. Even when changing the entire map, the copying that MemHive does is quite a bit faster than using pickle.

He had some doubts about showing his next slide, since several prominent core Python developers had warned him that some people would not like it. In fact, he is immediately skeptical when he sees something like it. But it is true that the speedup of using an immutable map over using pickle is, as his slide said, "6x - 150,000x", though it came with a "your mileage may vary" disclaimer. As would be expected, the speedup will depend on how many changes are made to the shared map; if all of the map is updated, it will be six times faster, if just one key is updated, though, "it can be this ridiculous number times faster".

He would love to talk more about the implementation details, he said, but was running out of time; it would make for a full talk of its own or perhaps a blog post at some point. Meanwhile, he wanted to point out that MemHive is not truly open source, by his definition, even though the code is available under the Apache License, version 2. He had only created the GitHub repository the previous day, but, to him, open source means more than just making the code available.

He is not planning to take pull requests or answer issues that are filed in the repository; in addition, the code is not stable, as it is just a prototype, though it does implement "the hardest part of it all". There are still bugs that need fixing and "a few new APIs have to be added". He added a statement about not using MemHive in production, "OR YOU WILL BE FIRED" to the project's README. He is building MemHive for his company, EdgeDB, which provides a database with "a fairly big Python component in it". He and his colleagues are going to be working on MemHive, testing it, and "dogfooding it" because the product needs it "to optimize our cloud deployment". Perhaps that is why he added "UNLESS YOU WORK AT EDGEDB" to the README warning sometime after the talk.

Q&A

The first question was about the most-common task that would benefit from the MemHive approach. Selivanov said that the simplest task that would benefit is similar to what he described: multiple asyncio programs sharing a cache. Another might be for an application that does a lot of number crunching based on network submissions; there could be a pool of subinterpreters doing that work with other subinterpreters running asyncio event loops to handle the requests and to dispatch them to the number crunchers.

Another attendee asked about support for lists; is that something that he and EdgeDB need, so they will be working on it? It is hard to say for sure at this point, Selivanov said, but if he does, he would use the same source for the list data structure as he did for HAMT: the Clojure language. If he does not end up working on it, someone else can easily enough; it will "take about a week of programming, it's very doable".

Python tuples are immutable, but they can contain mutable entries, an attendee said, what happens if a tuple with mutable entries is shared using MemHive? After calling it "an excellent question", Selivanov said that MemHive prohibits accessing mutable elements, so if a tuple with some integers and a list as entries is passed, the worker can access the integers in the tuple, but not the list. He is not sure if he will keep the current behavior or create a tuple-equivalent that will only allow approved data types to be stored in it.

The final question was about the benchmarks, which used a map with integer keys, from zero up to the max, and then changed the values of the first N keys. The attendee thought that those keys would likely end up in the same branch of underlying trie; does it make a difference if the keys to be changed are randomly chosen throughout the range? Selivanov said that he has run the benchmark many times and that the Python hash function changes on each run so that the first N keys end up all over the trie. In the end, though, it does not matter much, he said; if you have a map with one-million entries, the HAMT behind it is "very diverse, so one way or another you are probably going to be mutating the same number of branches".

It will be interesting to see where MemHive, and the general idea of structured sharing between subinterpreters goes. Even once the GIL penalty is removed, avoiding the synchronization headaches of standard multithreaded programs certainly has some advantages. Though somewhat overshadowed by the "big news" of the free-threading (no GIL) Python interpreter, subinterpreters seemingly have a lot to offer as well.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]

Index entries for this article
Conference	PyCon/2024
Python	Subinterpreters

GC?

Posted Aug 28, 2024 22:26 UTC (Wed) by nickodell (subscriber, #125165) [Link] (1 responses)

How does garbage collection work for a MemHive? I assume that since a reference to a node could be present in any subinterpreter, then no nodes in a MemHive can ever be freed.

GC?

Posted Aug 29, 2024 19:19 UTC (Thu) by malefic (guest, #37306) [Link]

Objects are owned by the "hub" subinterpreter, other subinterpeters report their refcount and once the sum reaches zero, objects are garbage-collected.

what-about-ism

Posted Aug 29, 2024 7:22 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

Is there anywhere a comparison to urcu, the userspace RCU implementation?

K3n.

Multithreading is not hard

Posted Aug 29, 2024 9:15 UTC (Thu) by mb (subscriber, #50428) [Link] (14 responses)

>Another alternative is to use threads; "if you have a galaxy brain, if you know how to control threads,
>if you know how to write thread-safe code, maybe you can do that
>[..]
>He believes that "the complexity of multithreaded programming is absolutely ridiculous".

Hate so say it, but please try Rust.
Multithreading is really simple, not complex, safe and fast in Rust.

Maybe Python can adapt some of the Rust techniques? What about channels, where object ownership is cheaply transferred from one thread to another?

Multithreading is not hard

Posted Aug 29, 2024 20:34 UTC (Thu) by rc00 (guest, #164740) [Link] (8 responses)

> Hate so say it, but please try Rust.
> Multithreading is really simple, not complex, safe and fast in Rust.

Isn't the same true for Go? Go is simpler in fact.

Multithreading is not hard

Posted Aug 29, 2024 21:22 UTC (Thu) by intelfx (subscriber, #130118) [Link]

Not really. Go "handwaves away" hard problems in concurrency, Rust solves them rigorously and statically.

I guess Go might be easier if you stay within its pseudo actor model sandbox (with the goroutines and channels), but once you outgrow it and start requiring some sort of shared state... it's no better than anything else (if not worse, because the apparent simplicity of the managed runtime makes you complacent).

Multithreading is not hard

Posted Aug 31, 2024 19:37 UTC (Sat) by LtWorf (subscriber, #124958) [Link]

Go is simpler for simple stuff. The second you go away from what the interns at google are supposed to do, it is not so simple at all.

Multithreading is not hard

Posted Aug 31, 2024 19:48 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Go is a "multithreaded assembly". It has almost no structured concurrency primitives, and the ones that exist are easy to misuse. It's fine if you want to write mostly embarrassingly parallel code, like typical web servers, with only occasional bits of shared data like shared caches.

But once you step away from that, it's not at all fun. This is slowly changing, with the advent of generics, but it's still nowhere close in safety to Rust.

And I'm talking as someone who _loves_ writing Go code.

Concurrency

Posted Sep 1, 2024 23:02 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (4 responses)

Here's what Go has to say about this: "Don't be clever". That's a quote from https://go.dev/ref/mem

Go is simple if you're simple. If you try to be clever, well, too bad, now the world is on fire and it's your fault.

This does make the code reviews for Google's large teams of Go programmers easier. "This seems clever, don't do that, rejected". But what if we *need* to be clever?

At Google their answer would be "Don't use Go", which is how we got to this conversation. I actually have two suggestions, one you already heard, use Rust.

But the other might fit Python's historical place as an educational language better - Structural Concurrency. Today we take it for granted that our Control Flow is Structured and you'd probably be outraged if Python proposed to add a goto statement. But we write unstructured concurrent software all the time, maybe we shouldn't. Unlike Control Flow we don't have decades of experience with languages that prefer structure for the concurrency, but we can't get that experience if nobody does it, and it's not a weird research niche, C++ is kinda sorta dipping their toes in this water, Python could do that too.

Concurrency

Posted Sep 3, 2024 13:06 UTC (Tue) by anselm (subscriber, #2796) [Link] (3 responses)

This does make the code reviews for Google's large teams of Go programmers easier. "This seems clever, don't do that, rejected". But what if we *need* to be clever?

Since we're talking about Go, it seems appropriate to quote Brian Kernighan: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

Concurrency

Posted Sep 3, 2024 14:37 UTC (Tue) by Wol (subscriber, #4433) [Link]

Reminds me of my definition of "elegant" code. Elegant code may look complicated, but once you dig in and understand it, your reaction is "oh how obvious!".

Cheers,
Wol

Concurrency

Posted Sep 3, 2024 15:08 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (1 responses)

Sure, that's presumably Go's rationale for this choice. "Use another language" is a valid escape hatch for Go.

But I did mean it, sometimes we need to be clever. "I'm not smart enough to debug this" is a reason your $5 CRUD app just shouldn't do it, but maybe this $100M government crash programme can hire somebody even smarter to debug it? Or six people who together are smart enough ?

There's a difference between how "clever" the code in this timetable app should be, and how "clever" the code in the operating system which shovels pixels from the timetable app onto my screen in real time can afford to be. The timetable app was given a brief glance by a colleague, hopefully that pixel shovelling code was stared at by several experts who've seen similar scary code before and it might even have been properly tested if we're lucky.

Two different variants of "clever"

Posted Sep 3, 2024 15:36 UTC (Tue) by farnz (subscriber, #17727) [Link]

Note, too, that "clever" can mean two different things:

This is a really hard problem, and the only way to solve it with reasonable resource utilization is to demonstrate an understanding of the problem in code.
The problem's simple, but by writing a trick that depends on assumptions that aren't generally true (but are in this case), you get a better solution.

Go very much takes the perspective that all clever code falls under the second category of "clever"; the XOR trick for swapping integers, for example, is usually an example of the second category. But there are some problems that are genuinely hard to solve (routing parcel delivery vehicles, for example), where some components of the solution will be "clever" in the first sense - it's genuinely not easy to understand what's involved.

Multithreading is not hard

Posted Aug 31, 2024 7:49 UTC (Sat) by bluss (guest, #47454) [Link]

I think you and the host of the talk are on the same side. That's also why he's bringing this up. Notice the talk is about subinterpreters and per-interpeter GIL.

This is a vastly safer approach to parallelism in Python than the alternatives, and leaves the door open for things that you mention - channels - and structured parallelism and other things!

The other alternative (not championed by the host) is to remove the GIL entirely, which is called free threading. That is of course exciting for the features it allows but from a thread safety standpoint it is vastly more challenging - both for the interpreter developers (CPython developers) as well as for Python developers themselves.

Must be this tall

Posted Sep 1, 2024 22:50 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

https://bholley.net/blog/2015/must-be-this-tall-to-write-...

Every time I read that story it briefly confuses me and then I remember that David Baron is a common name. Professor David Baron was one of my CS lecturers, he was partly responsible for CPL (and thus eventually BCPL, and from there B and then C)

But no, I don't expect Python will adopt Rust's approach. You need some pretty heavy lifting from your type system and that's not really Python's cup of tea.

Multithreading is not hard

Posted Sep 4, 2024 7:51 UTC (Wed) by jengelh (guest, #33263) [Link] (2 responses)

>Hate so say it, but please try Rust.

The amount of "why not Rust"-ish comments under topics of *various* languages is too damn high!
If the only tool you have is a hammer, you tend to see every problem as a nail.

Multithreading is not hard

Posted Sep 4, 2024 11:02 UTC (Wed) by Wol (subscriber, #4433) [Link]

> If the only tool you have is a hammer, you tend to see every problem as a nail.

But if you've got a toolbox full of tools, and the best tool is a hammer, then the problem almost certainly IS a nail!!!

EVERYBODY should at least learn a smattering of Rust. Likewise a smattering of Forth and Guile/Scheme. The more tools you have, the more you know what those tools are capable of, the more you'll be able to do.

And for what it's worth, I've seen a LOT of people say that learning Rust made them a better C / C++ programmer. I'M NOT SURPRISED.

I don't think I've seen a single person say learning Rust was a waste of time. I have, unfortunately, seen a lot of people say they don't have time to learn it, or they need a project (which they don't have) to help them get to grips with it. I'd put myself in that boat, actually ...

So go out and learn Rust. Whether you use it anger, or not, everything here makes me believe the effort will be worth it.

Cheers,
Wol

Rust is different

Posted Sep 4, 2024 13:08 UTC (Wed) by kleptog (subscriber, #1183) [Link]

> The amount of "why not Rust"-ish comments under topics of *various* languages is too damn high!

Maybe because Rust is one of the few new languages that actually brings something new to the table. After C you have C++/Java/Go/C# which are all variations on a theme. They have their differences, but if you know one you basically know them all. It's just a matter of learning the idiosyncrasies and standard library.

With Rust I actually get the feeling that the compiler is actually taking a load off my shoulders.

Looking more like the Actor model

Posted Aug 29, 2024 13:57 UTC (Thu) by kleptog (subscriber, #1183) [Link]

All this is similar to the shared nothing multiprocessing that Erlang/Elixer does. In the Actor model you start viewing your program as a collection of processes that are each manipulating immutable objects into new immutable objects. It doesn't work for every program, but for anything event driven it can work very very well, because it basically scales linearly with processing power.

It's a different style of programming though, which doesn't really fit into Python's style so it feels a little weird. But if it means you can get more done without getting stuck on the GIL, yay I guess. OTOH, if you wanted performance, why are you using Python at all?

Trie was new to me

Posted Aug 30, 2024 12:17 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

Saving you some clicks https://en.wikipedia.org/wiki/Trie