An introduction to asynchronous Python

By Jake Edge
June 28, 2017

In his PyCon 2017 talk, Miguel Grinberg wanted to introduce asynchronous programming with Python to complete beginners. There is a lot of talk about asynchronous Python, especially with the advent of the asyncio module, but there are multiple ways to create asynchronous Python programs, many of which have been available for quite some time. In the talk, Grinberg took something of a step back from the intricacies of those solutions to look at what asynchronous processing means at a higher level.

He started by noting that while he does a lot of work on the Flask Python-based web microframework, this talk would not be about Flask. He did write the Flask Mega-Tutorial (and a book on Flask), but he would be trying to mention it less than ten times during the talk—a feat that he managed admirably. He has also developed a Python server for Socket.IO that started out as something for "that framework", but has since "taken on a life of its own".

He asked attendees if they had heard people say that "async makes your code go fast". If so, he said, his talk would explain why people say that. He started with a simple definition of "async" (as "asynchronous" is often shortened). It is one way of doing concurrent programming, which means doing many things at once. He is not only referring to asyncio here as there are many ways to have Python do more than one thing at once.

He then reviewed those mechanisms. First up was multiple processes, where the operating system (OS) does all the work of multi-tasking. From CPython (the reference Python implementation) that is the only way to use all the cores in the system. Another way to do more than one thing at once is by using multiple threads, which is also a way to have the OS handle the multi-tasking, but Python's Global Interpreter Lock (GIL) prevents multi-core concurrency. Asynchronous programming, on the other hand, does not require OS participation. There is a single process and thread, but the program can get multiple things done at once. He asked: "what's the trick?"

Chess

He turned to a real-world example of how this works: a chess exhibition, where a chess master takes on, say, 24 opponents simultaneously. "Before computers killed the fun out of chess", these kinds of exhibitions were done regularly, but he is not sure if they still are. If each game takes around 30 move pairs to complete, the master would require twelve hours to finish the matches if they were played consecutively (at one minute per move pair). By sequentially making moves in each game, though, the whole exercise can be completed in an hour. The master simply makes a move at a board (in, say, five seconds) and then goes on to the next, leaving the opponent lots of time to move before the master returns (after making 23 other moves). The master will "cream everyone" in that time, Grinberg said.

It is "this kind of fast" that people are talking about for async programming. The chess master is not optimized to go faster, the work is arranged so that they do not waste time waiting. "That is the complete secret" to asynchronous programming, he said, "that's how it works". In that case, the CPU is the chess master and it waits the least amount of time possible.

But attendees are probably wondering how that can be done using just one process and one thread. How is async implemented? One thing that is needed is a way for functions to suspend and resume their execution. They will suspend when they are waiting and resume when the wait is over. That sounds like a hard thing to do, but there are four ways to do that in Python without involving the OS.

The first way is with callback functions, which is "gross", he said; so gross, in fact, that he was not even going to give an example of that. Another is using generator functions, which have been a part of Python for a long time. More recent Pythons, starting with 3.5, have the async and await keywords, which can be used for async programs. There is also a third-party package, greenlet, that has a C extension to Python to support suspend and resume.

There is another piece needed to support asynchronous programming: a scheduler that keeps track of suspended functions and resumes them at the right time. In the async world, that scheduler is called an "event loop". When a function suspends, it returns control to the event loop, which finds another function to start or resume. This is not a new idea; it is effectively the same as "cooperative multi-tasking" that was used in old versions of Windows and macOS.

Examples

Grinberg created examples of a simple "hello world" program using some of the different mechanisms. He did not get to all of them in the presentation and encouraged the audience to look at the rest. He started with a simple synchronous example that had a function that slept for three seconds between printing "Hello" and "World!". If he called that in a loop ten times, it would take 30 seconds to complete since each function would run back to back.

He then showed two examples using asyncio. They were essentially the same, but one used the @coroutine decorator for the function and yield from in the body (the generator function style), while the other used async def for the function and await in the body. Both used the asyncio version of the sleep() function to sleep for three seconds between the two print() calls. Beyond those differences, and some boilerplate to set up the event loop and call the function from it, the two functions had the same core as the original example. The non-boilerplate differences are by design; asyncio makes the places where code suspends and resumes "very explicit".

The two programs are shown below:

    # async/await version
    
    import asyncio
    loop = asyncio.get_event_loop()

    async def hello():
        print('Hello')
	await asyncio.sleep(3)
	print('World!')

    if __name__ == '__main__':
	loop.run_until_complete(hello())


    # @coroutine decorator version

    import asyncio
    loop = asyncio.get_event_loop()

    @asyncio.coroutine
    def hello():
        print('Hello')
	yield from asyncio.sleep(3)
	print('World!')

    if __name__ == '__main__':
	loop.run_until_complete(hello())

Running the program gives the expected result (three seconds between the two strings), but it gets more interesting if you wrap the function call in a loop. If the loop is for ten iterations, the result will be ten "Hello" strings, a three-second wait, then ten "World!" strings.

There are other examples for mechanisms beyond asyncio, including for greenlet and Twisted. The greenlet examples look almost exactly the same as the synchronous example, just using a different sleep(). That is because greenlet tries to make asynchronous programming transparent, but hiding those differences can be a blessing and a curse, Grinberg said.

Pitfalls

There are some pitfalls in asynchronous programming and people "always trip on these things". If there is a task that requires heavy CPU use, nothing else will be done while that calculation is proceeding. In order to let other things happen, the computation needs to release the CPU periodically. That could be done by sleeping for zero seconds, for example (using await asyncio.sleep(0)).

Much of the Python standard library is written in blocking fashion, however, so the socket, subprocess, and threading modules (and other modules that use them) and even simple things like time.sleep() cannot be used in async programs. All of the asynchronous frameworks provide their own non-blocking replacements for those modules, but that means "you have to relearn how to do these things that you already know how to do", Grinberg said.

Eventlet and gevent, which are built on greenlet, both monkey patch the standard library to make it async compatible, but that is not what asyncio does. It is a framework that does not try to hide the asynchronous nature of programs. asyncio wants you to think about asynchronous programming as you design and write your code.

Comparison

He concluded his talk with a comparison of processes, threads, and async in a number of different categories. All of the techniques optimize the waiting periods; processes and threads have the OS do it for them, while async programs and frameworks do it for themselves. Only processes can use all cores of the system, however, threads and async programs do not. That leads some to write programs that combine one process per core with threads and/or async functions, which can work quite well, he said.

Scalability is "an interesting one". Running multiple processes means having multiple copies of Python, the application, and all of the resources used by both in memory, so the system will run out of memory after a fairly small number of simultaneous processes (tens of processes are a likely limit), Grinberg said. Threads are more lightweight, so there can be more of those, on the order of hundreds. But async programs are "extremely lightweight", such that thousands or tens of thousands of simultaneous tasks can be handled.

The blocking standard library functions can be used from both processes and threads, but not from async programs. The GIL only interferes with threads, processes and async can coexist with it just fine. But, he noted, there is only "some" interference from the GIL even for threads in his experience; when threads are blocked on I/O, they will not be holding the GIL, so the OS will give the CPU to another thread.

There are not many things that are better for async in that comparison. The main advantage to asynchronous programs for Python is the massive scaling they allow, Grinberg said. So if you have servers that are going to be super busy and handle lots of simultaneous clients, async may help you avoid going bankrupt from buying servers. The async programming model may also be attractive for other reasons, which is perfectly valid, but looking strictly at the processing advantages shows that scaling is where async really wins.

A YouTube video of Grinberg's talk is available; the Speaker Deck slides are similar, but not the same as what he used.

[I would like to thank The Linux Foundation for travel assistance to Portland for PyCon.]

Index entries for this article
Conference	PyCon/2017
Python	Async

An introduction to asynchronous Python

Posted Jun 29, 2017 16:45 UTC (Thu) by willy (subscriber, #9762) [Link] (5 responses)

> Running multiple processes means having multiple copies of Python, the application, and all of the resources used by both in memory, so the system will run out of memory after a fairly small number of simultaneous processes (tens of processes are a likely limit), Grinberg said.

Ooh, no. There will only be one copy of the Python interpreter text segment in memory. There will be separate data segments for each invocation, of course. And maybe that's what he meant. I'm not entirely sure what is meant by "resources", but if that's (for example) a read-only data file being processed, then there's only one copy of that too (unless python does something awful like read() it into a userspace buffer instead of mmap() it). Either way, dozens of processes being the limit seems unlikely.

An introduction to asynchronous Python

Posted Jun 29, 2017 16:58 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

Interpreters like Perl and Python which use reference counts are terrible about data sharing. Even if a variable is read-only the reference count bumping as it is passed around unshares the pages.
I noticed this on SpamAssassin on 256 MB boxes years ago. It used an initialize and fork model, obviously copied from a C application, perhaps Apache. It should have been very memory efficient. However, as soon as a SA worker began to work, it's memory quickly unshared and started to overload the box.

And of course there are C++ apps using shared_ptr and std::string which do just as badly at this.

An introduction to asynchronous Python

Posted Jun 29, 2017 19:51 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Interesting point. This suggests that if you are going to use reference counting, the counts should be stored separately from the object, in some kind of global array or global lookup of address to count. Then they will be in their own pages, while the objects themselves stay mostly read-only.

An introduction to asynchronous Python

Posted Jun 29, 2017 22:49 UTC (Thu) by excors (subscriber, #95769) [Link]

I think std::shared_ptr can sort of do this already - you can pass the constructor a raw pointer (allocated however you want) plus an allocator object that will be used for internal allocations (i.e. refcount storage etc), and you could make that allocator use a separate pool to keep all the refcounts together. Then wrap it all in a custom type so users don't have to think about it.

(Apparently constructing a shared_ptr via std::make_shared is special - that does a single allocation to contain both the refcount and the object, which is usually a good idea, but in this case you'd need to implement it differently, which should be easy enough.)

An introduction to asynchronous Python

Posted Jun 29, 2017 17:48 UTC (Thu) by dtlin (subscriber, #36537) [Link] (1 responses)

If you start up new processes, each Python library loaded (including the standard library) will not be shared. That far outweighs the text size of the interpreter itself.

If you fork off of a main process after loading libraries, reference counting unshares the data pretty quickly.

An introduction to asynchronous Python

Posted Jun 30, 2017 5:50 UTC (Fri) by epa (subscriber, #39769) [Link]

Ironically, if you use a worker process model you may not need reference counting or other forms of garbage collection at all. Each process can do some work and then exit, freeing its memory then. This suggests some global flag to turn off reference counting and garbage collection would help here - and perhaps also help non-forking programs which nonetheless do one thing and then exit.

A more subtle tweak would be to set the reference counts on all objects to some magic value like -1 which marks an object as used and stops the count being updated further. The parent process could call that as a one-off just before forking the workers. Then all existing objects would stay shared, but allocations made in the children (or further things allocated in the parent) would be garbage collected as usual.

An introduction to asynchronous Python

Posted Jun 30, 2017 12:43 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

> There are not many things that are better for async in that comparison. The main advantage to asynchronous programs for Python is the massive scaling they allow, Grinberg said.

I think the main advantage of the async model over threading may be that you don't have to understand synchronisation - it avoids all those mutexes, semaphores, events, conditions, atomics, implicitly atomic operations in a particular interpreter implementation, GIL, ... Instead, all your code is guaranteed to run atomically except where there's a clearly-visible "await". Given that essentially no human beings are capable of understanding synchronisation perfectly in any non-trivial cases, that's a substantial benefit.

An introduction to asynchronous Python

Posted Jul 1, 2017 0:25 UTC (Sat) by neilbrown (subscriber, #359) [Link] (1 responses)

> Given that essentially no human beings are capable of understanding synchronisation perfectly in any non-trivial cases,

That's a bold claim!
I think it much more likely that we don't have, or are not using, suitable semantic tools to enable us to think about synchronization in a reliable way.
By "semantic tools" I mean things like "loop invariants", which I personally find make it much easier to think accurately about loops.
I think a lot of synchronization errors come about because people are either not informed about the locking requirements, or think they can take a short cut without justifying it. This suggests that it isn't a lack of capability, but a lack of tools and training.

You point still stands, though, that it may be easier to train people in asynchrony than in synchrony.

An introduction to asynchronous Python

Posted Jul 7, 2017 17:16 UTC (Fri) by HelloWorld (guest, #56129) [Link]

The real question is why two concurrent threads would mess around with the same mutable data structure anyway. And most of the time is that they really don't, and when you stop doing that, things become much easier.