Making CPython faster

Posted Jun 3, 2021 19:06 UTC (Thu) by smurf (subscriber, #17840)
In reply to: Making CPython faster by anton
Parent article: Making CPython faster

Umm, no. As a simple counterexample, consider a server that needs to do some nontrivial computation on some nontrivial data stored in the server's RAM. While serving other clients.

As of today, this is an unsolvable problem in "pure" CPython.

Making CPython faster

Posted Jun 3, 2021 21:28 UTC (Thu) by anton (subscriber, #25547) [Link] (10 responses)

This is supposed to be a counterexample of what?

Especially on a server under load that has many concurrent clients, single-thread optimization of the existing code reduces CPU load, while splitting the program into multiple threads usually increases the amount of work CPUs have to do. E.g., for the embarrassingly parallel 700x700 matrix multiplication example, on a 2-core/4-thread Ivy Bridge OpenBLAS takes 130M CPU cycles (in 92ms) with a single thread, 192M CPU cycles (in 69ms) with two threads, and 438M CPU cycles (in 84ms) with 4 threads.

Making CPython faster

Posted Jun 3, 2021 23:10 UTC (Thu) by Paf (subscriber, #91811) [Link] (5 responses)

And *even in your example* the problem is solved faster. Parallelism isn’t free, but the other cores in my system weren’t free either.

Making CPython faster

Posted Jun 3, 2021 23:13 UTC (Thu) by Paf (subscriber, #91811) [Link] (2 responses)

Ah, I see it’s slower with 4.

Still, there is literally an entire field of scientific computation that exists because this “slower with more threads” is ... really not universal. (Though it is often less efficient, yes.)

Making CPython faster

Posted Jun 5, 2021 6:16 UTC (Sat) by mkbosmans (subscriber, #65556) [Link] (1 responses)

This is not a particularly large matrix, so perhaps it should not come as a surprise that parallel scalability is limited here.
For a typical 'HPC' workload you would have a much larger working set, either a single bigger matrix, or lots of these small to medium sized matrices. In the latter case you would parallelize over the matrices and do each individual matrix multiplication in a single thread.

Making CPython faster

Posted Jun 5, 2021 9:02 UTC (Sat) by anton (subscriber, #25547) [Link]

All 490,000 elements of the result matrix can be computed independently, i.e., in parallel. Parallel scalability in this case is limited by resource constraints and by parallel overhead (starting more threads, telling them their jobs, and waiting for all of them to complete their jobs).

Of course typical HPC workloads take more than 92ms on one CPU, otherwise nobody would bother to build parallel systems for them and nobody would bother with parallelizing the code for these CPUs. But the question is if CPython programs behave like typical HPC workloads.

Making CPython faster

Posted Jun 4, 2021 6:28 UTC (Fri) by anton (subscriber, #25547) [Link]

If multi-threaded was always worse in every respect, nobody would do it.

But my point is that if you serve many concurrent clients on CPU-intensive jobs, that alone will load the cores, and you don't want to increase the cycles needed by multi-threading each job. By contrast, successful single-thread optimization will reduce the cycles needed for each job, which will be more useful in this kind of setting.

Making CPython faster

Posted Jun 4, 2021 17:22 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

That example was specifically selected to be embarrassingly parallel. If it wasn't faster, then *nothing* would *ever* be faster, and multithreading would not exist.

Making CPython faster

Posted Jun 4, 2021 20:18 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

> for the embarrassingly parallel 700x700 matrix multiplication example, on a 2-core/4-thread Ivy Bridge OpenBLAS takes 130M CPU cycles (in 92ms) with a single thread, 192M CPU cycles (in 69ms) with two threads, and 438M CPU cycles (in 84ms) with 4 threads.

I don't think matrix multiplication counts as embarrassingly parallel. That term usually refers to problems that can be easily split into many processes that are almost entirely independent, with little communication between them. The naive multiplication algorithm (where each output element is computed as the dot product of a row and column) can be split into one process per output element, so superficially it might be embarrassingly parallel, but in practice the processes are nowhere near independent because they're competing for memory bandwidth (which is the bottleneck in that algorithm; the actual computation is trivial). A divide-and-conquer algorithm will make better use of caches to massively reduce memory bandwidth, but then the processes are even less independent because they're explicitly passing intermediate results to other processes, as well as still competing for memory bandwidth.

Matrix multiplication is the exact opposite of easy to parallelise - people have spent decades trying to optimise the algorithms on CPUs, then moved to GPUs (which have a more suitable memory architecture), then built dedicated matrix-multiplication ASICs (like Google's TPU).

There's plenty of other algorithms that are arithmetic-bound rather than memory-bound, where performance can scale almost linearly with the number of CPU cores.

Making CPython faster

Posted Jun 5, 2021 9:49 UTC (Sat) by anton (subscriber, #25547) [Link]

Computing each element of the result matrix is entirely independent of all other computations during matrix multiplication, with no communication between the computations.

Yes, you experience resource constraints when mapping the problem to a real machine, but that's also the case for other embarrassingly parallel problems. Concerning memory bandwidth, for the naive algorithm, the 4.4 cycles/multiply-add mostly stem from the 4 cycles of FP addition latency; if transparent huge pages fail to work, you see the page table walker latency. For other algorithms, it's actually the fact that you can organize matrix multiplication to be compute limited rather than memory-bandwidth limited that makes matrix multiply nontrivial wrt getting close-to-optimal performance. But even in single-threaded implementation matrix multiply has significant optimization opportunities, as the speedup of single-threaded OpenBLAS over the naive algorithm shows.

Does the presence of additional optimization opportunities mean it is not embarassingly parallel? Not in my book.

BTW, before anyone interprets too much in my Ivy Bridge results (which are flawed by the influence of SMT), I have made 700x700 runs on a 4 core/4-thread Sandy Bridge with OpenBLAS and varying number of threads:

thr ms cycles
1   49 101M  
2   32 110M
4   20 121M

So, without SMT (aka Hyperthreading) the rise in total CPU cycles is much less than in my earlier results with an SMT-enabled CPU, at least for this embarrassingly parallel problem; if you need significant synchronization, you will probably see a bigger rise in CPU cycles even without SMT.

Making CPython faster

Posted Jun 5, 2021 18:06 UTC (Sat) by rghetta (subscriber, #39444) [Link] (1 responses)

Sure, single thread performance is important, but I don't think your matrix multiplication example is appropriate for reasoning about the GIL. It's a cpu bound program, sure, but working in a tight loop and using, say, 50MB of memory or something like that.
A better one could be a client/server program, with thousands of largely indipendent big, composite objects, say each using dozens of KB of memory, for several GB total. You get random requests, starting some sort of evaluation of one or more of these objects (again, chosen at random).
Without the GIL a reasonably designed multithreaded program could handle these requests indipendently much more efficently than a single threaded one (remember, these are almost indipendent objects).
And multiprocessing here is not a good option, because transferring those big objects from a process to another, or even worse, loading them from storage is costly. Yes, you can mitigate the cost by random partitioning and so on, but multithreading usually is cheaper and less resource intensive, even if you have some lock contention.

But with the GIL multithreading is simply impossibile, because the GIL is not just a global lock, but a global interpreter lock, threads just serialize.
And from a processor viewpoint each thread takes the GIL for a long time, so even if you use C modules to speed up, chances are that you don't get much real benefit.
When you return from your fast C routine you're likely to end up still waiting in the queue for the GIL, at least unless your C routine executes in more or less a GIL tick (and your thread is luckily chosen to run).
Those workloads in python are horribly slow, and even doubling single thread performance will not change things significantly.

Making CPython faster

Posted Jun 6, 2021 13:25 UTC (Sun) by gracinet (guest, #89400) [Link]

> But with the GIL multithreading is simply impossibile, because the GIL is not just a global lock, but a global interpreter lock, threads just serialize.
> And from a processor viewpoint each thread takes the GIL for a long time, so even if you use C modules to speed up, chances are that you don't get much real benefit.

This is not 100% true. In fact this is only the worst case, and yes, it is common enough to be a concern.

C extensions can (and should!) release the GIL when they don't need to access Python objects (including the memory they reference).

One obvious case is waiting for I/O, another would be performing CPU-bound computations in directly allocated memory. I believe the likes of numpy do it (never checked that myself). Of course the standard library modules implemented in C are also expected to do it properly. It's not really complicated to implement either, but I hear it's often overlooked.

To be clear, I'm not saying the GIL isn't a problem, it's just not as bad as serializing the threads.