Making CPython faster
Over the last month or so, there has been a good bit of news surrounding the idea of increasing the performance of the CPython interpreter. At the 2021 Python Language Summit in mid-May, Guido van Rossum announced that he and a small team are being funded by Microsoft to work with the community on getting performance improvements upstream into the interpreter—crucially, without breaking the C API so that the ecosystem of Python extensions (e.g. NumPy) continue to work. Another talk at the summit looked at Cinder, which is a performance-oriented CPython fork that is used in production at Instagram. Cinder was recently released as open-source software, as was another project to speed up CPython that originated at Dropbox: Pyston.
There have been discussions on and development of performance enhancements for CPython going back quite a ways; it is a perennial topic at the yearly language summit, for example. More recently, Mark Shannon proposed a plan that could, he thought, lead to a 5x speedup for the language by increasing its performance by 50% in each of four phases. It was an ambitious proposal, and one that required significant monetary resources, but it seemed to go nowhere after it was raised in October 2020. It now seems clear that there were some discussions and planning going on behind the scenes with regard to Shannon's proposal.
Faster CPython
After an abbreviated retirement, Van Rossum went to work at Microsoft toward the end of 2020 and he got to choose a project to work on there. He decided that project would be to make CPython faster; to that end, he has formed a small team that includes Shannon, Eric Snow, and, possibly, others eventually. The immediate goal is even more ambitious than what Shannon laid out for the first phase in his proposal: double the speed of CPython 3.11 (due October 2022) over that of 3.10 (now feature-frozen and due in October).
The plan, as described in the report on the talk by Joanna Jablonski (and
outlined in Van Rossum's talk
slides) is to work with the community in the open on GitHub. The faster-cpython
repositories are being used to house the code, ideas, tools, issue tracking and
feature discussions, and so on. The work is to be done in collaboration
with the other core developers in the normal, incremental way that
changes to CPython are made. There will be "no surprise
6,000 line PRs [pull requests]
" and the team will be responsible for
support and maintenance of any changes.
The main constraint is to preserve ABI and API compatibility so that the extensions continue to work. Keeping even extreme cases functional (the example given is pushing a million items onto the stack) and ensuring that the code remains maintainable are both important as well.
While the team can consider lots of different changes to the language implementation, the base object type and the semantics of reference counting for garbage collection will need to stay the same. But things like the bytecode, stack-frame layout, and the internals of non-public objects can all be altered for better performance. Beyond that, the compiler that turns source code into bytecode and the interpreter, which runs the bytecode on the Python virtual machine (VM), are fair game as well.
In a "meta"
issue in the GitHub tracker, Van Rossum outlined the three main pieces of the plan
for 3.11. They all revolve around the idea of speeding up the
bytecode interpreter through speculative
specialization, which adapts the VM to run faster on some code because
the object being operated on is of a known and expected type (or has some
other attribute that can be determined with a simple test).
Shannon further described what is being proposed in
PEP 569
("Specializing Adaptive Interpreter
"), which he
announced
on the python-dev mailing list in mid-May. The "Motivation" section of
the PEP explains the overarching idea:
Typical optimizations for virtual machines are expensive, so a long "warm up" time is required to gain confidence that the cost of optimization is justified. In order to get speed-ups rapidly, without [noticeable] warmup times, the VM should speculate that specialization is justified even after a few executions of a function. To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply.By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.
In order to do these optimizations, Python code objects will be modified in a process called "quickening" once they have been executed a few times. The code object will get a new, internal array to store bytecode that can be modified on-the-fly for a variety of optimization possibilities. In the GitHub issue tracking the quickening feature, Shannon lists several of these possibilities, including switching to "super instructions" that do much more (but more specialized) work than existing bytecode instructions. The instructions in this bytecode array can also be changed at run time in order to adapt to different patterns of use.
During the quickening process, adaptive versions of instructions that can benefit from specialization are placed in the array instead of the regular instructions; the array is not a Python object, but simply a C array containing the code in the usual bytecode format (8-bit opcode followed by 8-bit operand). The adaptive versions determine whether to use the specialization or not:
CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization. Examples include CALL_FUNCTION, LOAD_ATTR, LOAD_GLOBAL and BINARY_ADD .By introducing a "family" of specialized instructions for each of these instructions allows effective specialization, since each new instruction is specialized to a single task. Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself. Each family will also include one or more specialized instructions that perform the equivalent of the generic operation much faster provided their inputs are as expected. Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected. Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed. If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version.
The PEP goes on to describe two of these families (CALL_FUNCTION and LOAD_GLOBAL) and the kinds of specializations that could be created for them. For example, there could be specialized versions to call builtin functions with one argument or to load a global object from the builtin namespace. It is believed that 25-30% of Python instructions could benefit from specialization. The PEP only gives a few examples of the kinds of changes that could be made; the exact set of optimizations, and which instructions will be targeted, are still to be determined.
Other CPythons
In Dino Viehland's summit talk about Cinder, he described
a feature called "shadow bytecode" that is similar to what is being
proposed in PEP 659. Cinder, though, is not being run as an
open-source project; the code is used in production, however, and is being made
available to potentially adapt parts of it for upstream CPython. Some
parts of Cinder have already been added to CPython, including two
enhancements (bpo-41756
and bpo-42085) that
simplified coroutines to eliminate the use of the StopIteration
exception: "On simple benchmarks, this was 1.6 times faster, but it
was also a 5% win in production.
"
Pyston takes a somewhat different
approach than what is being proposed, but there are overlaps
(e.g. quickening). As described in its
GitHub repository, Pyston
uses the dynamic assembler
(DynASM) that comes from the LuaJIT project. That results in a
just-in-time (JIT) compiler with "very low overhead
" as one of
the techniques used. Using Pyston provides around 30% better performance on web
applications.
Both Cinder and Pyston are based on Python 3.8, so any features that are destined for upstream will likely need updating. The intent of the PEP 659 work is to work within the community directly, which is not something either of the other two projects were able to do; both started as internal closed-source "skunkworks" projects that have only recently seen the light of day. How much of that work will be useful in the upstream CPython remains to be seen.
It will be interesting to watch the work of Van Rossum's team as it tries to reach a highly ambitious goal. Neither of the other two efforts achieved performance boosts anywhere close to the 100% increase targeted by the specializing adaptive interpreter team, though Shannon said that he had working code to fulfill the 50% goal for his first phase back in October. Building on top of that code makes 2x (or 100% increase) seem plausible at least and, if that target can be hit, the overall 5x speedup Shannon envisioned might be reached as well. Any increase is welcome, of course, but those kinds of numbers would be truly eye-opening—stay tuned ...
Index entries for this article | |
---|---|
Python | CPython |
Python | Performance |
Posted Jun 1, 2021 23:14 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (33 responses)
Posted Jun 1, 2021 23:18 UTC (Tue)
by malefic (guest, #37306)
[Link] (31 responses)
Posted Jun 2, 2021 0:15 UTC (Wed)
by interalia (subscriber, #26615)
[Link] (1 responses)
Python speed improvements sound great though, if they have in-project support for it.
Posted Jun 15, 2021 7:01 UTC (Tue)
by cpitrat (subscriber, #116459)
[Link]
Posted Jun 2, 2021 11:04 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (27 responses)
Also I do question their focus on keeping the C API intact. It's way past showing its age and prevents many of optimizations, esp. when you use it to call back into Python.
Posted Jun 2, 2021 12:53 UTC (Wed)
by Conan_Kudo (subscriber, #103240)
[Link] (1 responses)
Posted Jun 2, 2021 13:18 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link]
Posted Jun 2, 2021 16:22 UTC (Wed)
by Otus (subscriber, #67685)
[Link]
Posted Jun 2, 2021 18:27 UTC (Wed)
by pallas (guest, #128204)
[Link] (2 responses)
Posted Jun 3, 2021 4:05 UTC (Thu)
by jf (guest, #152547)
[Link]
Posted Jun 14, 2021 0:47 UTC (Mon)
by bartoc (guest, #124262)
[Link]
Posted Jun 2, 2021 20:10 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (19 responses)
I think those definitions really do need to be discussed here. I can think of at least four interpretations:
1. In the general case, multithreading provides a greater win than any other optimization.
The problem is that interpretations 2-4 are (mostly) false:
2. It's really hard to write correct multithreaded code. Humans are demonstrably awful at reasoning about concurrency and thread safety. Threads are well known as a major source of bugs in software. Single-threaded performance optimizations can still cause issues (e.g. cache invalidation is hard), but with one major exception (Spectre), they tend to be much easier to reason about and deal with.
Posted Jun 3, 2021 16:46 UTC (Thu)
by anton (subscriber, #25547)
[Link] (18 responses)
As an example, even for an embarrassingly parallel problem like 700x700 matrix multiplication, a naive implementation on an Ivy Bridge takes 4.4cycles/multiply-add if transparent huge pages work (23.3 if they don't), while OpenBLAS with 1 thread takes 0.36cycles/multiply-add (factor 12 or 65). So you would need to put a multi-threaded naive program on at least that many threads to compete with single-thread OpenBLAS; plus using SMT produces a slowdown in this case (so you need one core/thread), and on modern CPUs you tend to get higher clock rates if you use only one core instead of all of them.
Spectre is a hardware bug. Software workarounds for this bug may be hard to reason about (and they slow down programs), but that has no particular connection to single-threaded software optimization. If you don't optimize, your program can still be vulnerable to Spectre.
Posted Jun 3, 2021 19:06 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (11 responses)
As of today, this is an unsolvable problem in "pure" CPython.
Posted Jun 3, 2021 21:28 UTC (Thu)
by anton (subscriber, #25547)
[Link] (10 responses)
Especially on a server under load that has many concurrent clients, single-thread optimization of the existing code reduces CPU load, while splitting the program into multiple threads usually increases the amount of work CPUs have to do. E.g., for the embarrassingly parallel 700x700 matrix multiplication example, on a 2-core/4-thread Ivy Bridge OpenBLAS takes 130M CPU cycles (in 92ms) with a single thread, 192M CPU cycles (in 69ms) with two threads, and 438M CPU cycles (in 84ms) with 4 threads.
Posted Jun 3, 2021 23:10 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (5 responses)
Posted Jun 3, 2021 23:13 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (2 responses)
Still, there is literally an entire field of scientific computation that exists because this “slower with more threads” is ... really not universal. (Though it is often less efficient, yes.)
Posted Jun 5, 2021 6:16 UTC (Sat)
by mkbosmans (subscriber, #65556)
[Link] (1 responses)
Posted Jun 5, 2021 9:02 UTC (Sat)
by anton (subscriber, #25547)
[Link]
Of course typical HPC workloads take more than 92ms on one CPU, otherwise nobody would bother to build parallel systems for them and nobody would bother with parallelizing the code for these CPUs. But the question is if CPython programs behave like typical HPC workloads.
Posted Jun 4, 2021 6:28 UTC (Fri)
by anton (subscriber, #25547)
[Link]
But my point is that if you serve many concurrent clients on CPU-intensive jobs, that alone will load the cores, and you don't want to increase the cycles needed by multi-threading each job. By contrast, successful single-thread optimization will reduce the cycles needed for each job, which will be more useful in this kind of setting.
Posted Jun 4, 2021 17:22 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Posted Jun 4, 2021 20:18 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
I don't think matrix multiplication counts as embarrassingly parallel. That term usually refers to problems that can be easily split into many processes that are almost entirely independent, with little communication between them. The naive multiplication algorithm (where each output element is computed as the dot product of a row and column) can be split into one process per output element, so superficially it might be embarrassingly parallel, but in practice the processes are nowhere near independent because they're competing for memory bandwidth (which is the bottleneck in that algorithm; the actual computation is trivial). A divide-and-conquer algorithm will make better use of caches to massively reduce memory bandwidth, but then the processes are even less independent because they're explicitly passing intermediate results to other processes, as well as still competing for memory bandwidth.
Matrix multiplication is the exact opposite of easy to parallelise - people have spent decades trying to optimise the algorithms on CPUs, then moved to GPUs (which have a more suitable memory architecture), then built dedicated matrix-multiplication ASICs (like Google's TPU).
There's plenty of other algorithms that are arithmetic-bound rather than memory-bound, where performance can scale almost linearly with the number of CPU cores.
Posted Jun 5, 2021 9:49 UTC (Sat)
by anton (subscriber, #25547)
[Link]
Yes, you experience resource constraints when mapping the problem to a real machine, but that's also the case for other embarrassingly parallel problems. Concerning memory bandwidth, for the naive algorithm, the 4.4 cycles/multiply-add mostly stem from the 4 cycles of FP addition latency; if transparent huge pages fail to work, you see the page table walker latency. For other algorithms,
it's actually the fact that you can organize matrix multiplication to be compute limited rather than memory-bandwidth limited that makes matrix multiply nontrivial wrt getting close-to-optimal performance. But even in single-threaded implementation matrix multiply has significant optimization opportunities, as the speedup of single-threaded OpenBLAS over the naive algorithm shows.
Does the presence of additional optimization opportunities mean it is not embarassingly parallel? Not in my book.
BTW, before anyone interprets too much in my Ivy Bridge results (which are flawed by the influence of SMT), I have made 700x700 runs on a 4 core/4-thread Sandy Bridge with OpenBLAS and varying number of threads:
Posted Jun 5, 2021 18:06 UTC (Sat)
by rghetta (subscriber, #39444)
[Link] (1 responses)
Posted Jun 6, 2021 13:25 UTC (Sun)
by gracinet (guest, #89400)
[Link]
This is not 100% true. In fact this is only the worst case, and yes, it is common enough to be a concern.
C extensions can (and should!) release the GIL when they don't need to access Python objects (including the memory they reference).
One obvious case is waiting for I/O, another would be performing CPU-bound computations in directly allocated memory. I believe the likes of numpy do it (never checked that myself). Of course the standard library modules implemented in C are also expected to do it properly. It's not really complicated to implement either, but I hear it's often overlooked.
To be clear, I'm not saying the GIL isn't a problem, it's just not as bad as serializing the threads.
Posted Jun 3, 2021 23:01 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (3 responses)
Multi-threading is a really powerful option to get greater performance, especially (and increasingly) on modern hardware.
It is not necessarily opposed to single threaded optimizations. (Eg, OpenBLAS probably sees a large percentage of its usage, perhaps a majority, in *multithreaded* programs!)
And, basically, it sucks that Python has this option mostly closed to it. It doesn’t make Python a bad language, it’s not fatal, etc. But it is certainly not a positive aspect of the language (even if the GIL enables many of the good things).
Posted Jun 3, 2021 23:08 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (2 responses)
But they’re talking about hopefully eventually achieving a 5x speedup. This is great, and I want them to do it. But as someone with a partial scientific computing background... I sure wish I could use the other 15 cores on my desktop in CPython too! To say nothing of the 64 cores on individual nodes of the compute cluster.
There are a ton of computational folks who’d love to do more of their work in Python, but do not/cannot effectively because of the lack of parallelism in CPython.
I’m not saying it should be priority A1. But let’s not pretend it isn’t a hindrance!
Posted Jun 4, 2021 7:15 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (1 responses)
Posted Jun 4, 2021 17:24 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Even if Python dropped the GIL tomorrow, those libraries would continue to be used. Native code is so much faster than interpreted code that it's not worth even trying to make Python fast enough to compete with them.
Posted Jun 5, 2021 20:26 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
The reason I called out Spectre is that Spectre arises from pipelining and speculative execution, which are single-threaded optimizations (that just so happen to be extremely widely deployed).
Posted Jun 11, 2021 8:30 UTC (Fri)
by anton (subscriber, #25547)
[Link]
I don't think that is the case. The hardware designers managed to handle architectural state (registers, memory) correctly in various implementations of speculative execution: They buffer changes to that state in internal buffers (e.g., in the store buffer) until the instruction is no longer speculative, and then commit these changes to permanently-visible architectural state. For microarchitectural state such as caches, they changed the state already speculatively, because, after all, it's not architectural: from an architectural correctness point of view, it does not matter if the cache contains the permanent value of one memory location, or the permanent value of a different memory location.
It's only when you consider side channels that it makes a difference, and apparently they had not thought about that; and apparently before 2017 nobody else has thought about that either. So you might say that side channels are hard to reason about.
Or maybe it's because few people tend to think about side channels: On the software side we knew mitigation techniques for a number of side channels for non-speculatively executed code, and they are so arduous that we use them only for secret-key handling code. On the hardware side, cache side channels have been known for a long time, but we do not want to do without the speed advantage that caches give us, so hardware designers design caches and leave it to software people to use mitigations; however, there have been other side channels (IIRC due to resource constraints) that have been eliminated in more recent hardware with a different balance of resources; my dim memory says that the Ivy Bridge was affected and Haswell was not, but I may be misremembering. It's not clear whether that was an intentional or accidental fix.
Posted Jun 2, 2021 20:40 UTC (Wed)
by strombrg (subscriber, #2178)
[Link]
I want to see the C Extension API die. I think HPY is our best hope for this. But the fact remains that CPython has many C extension modules written to the C Extension API, and killing those off is not in the best interest of the Python community in the near term.
Posted Jun 2, 2021 20:36 UTC (Wed)
by strombrg (subscriber, #2178)
[Link]
I heard the GIL was removed once - but single threaded performance was so much slower that the patches were rejected.
But the GIL is in significant part responsible for the great proliferation of Python libraries in C, because they don't have to futz with details as much with a GIL.
If you need concurrency in Python, you can use multiprocessing, or an alternative implementation like IronPython, Jython, or possibly Micropython.
I actually think that getting HPY off the ground and into critical mass is more important than removing the GIL. Then we should see an explosion of alternative Python implementations and adoption.
Posted Jun 2, 2021 16:31 UTC (Wed)
by mb (subscriber, #50428)
[Link]
Trading single thread performance for a possible increase in multi thread performance is really bad for many use cases. A solution would have to make sure that at least the existing single threaded programs would not suffer from this in any way.
Posted Jun 2, 2021 11:31 UTC (Wed)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Jun 2, 2021 16:34 UTC (Wed)
by mb (subscriber, #50428)
[Link]
Posted Jun 2, 2021 20:43 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Let's take a trivial example: "a+b". You have special codes for int, float, and string. Each of these special instructions shall test for "in case of int+int do this, else find a.__add__ and do it the slow way". Another special instruction tests for floats and a third tests for strings.
You cannot cram the whole decision list into a single special code. That would slow down the general case too much.
Now if a and b are typed this gets easier, as you can optimize up front. But Python has dynamic types (including the ability to subtype integers …), thus you still need to check.
Posted Jun 2, 2021 11:54 UTC (Wed)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Indeed; I'm a little skeptical how much you can squeeze out of an interpreter, so I'll be impressed if they make it
Posted Jun 2, 2021 18:52 UTC (Wed)
by tsavola (subscriber, #37605)
[Link]
https://github.com/wasm3/wasm3/blob/main/docs/Performance.md
Posted Jun 2, 2021 20:45 UTC (Wed)
by strombrg (subscriber, #2178)
[Link] (1 responses)
Posted Jun 8, 2021 8:39 UTC (Tue)
by jezuch (subscriber, #52988)
[Link]
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
That would probably be territory for Python 4, which everyone is seemingly quite scared to do.
Making CPython faster
Making CPython faster
Making CPython faster
Meh, multi-threaded concurrency being a win depends on the workload. There can be huge hidden overhead in inter-thread coördination, not to mention added complexity. I'd suggest reading Scalability! But at what COST?, which looked at several real-world multi-threaded workloads that were faster when re-implemented as single-threaded processes.
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
2. In the general case, writing a multithreaded program is easier (less work, less error prone, etc.) than any single-threaded optimization.
3. In the specific case of CPython, removing the GIL provides a greater win than any other optimization.
4. In the specific case of CPython, removing the GIL is easier (less work, less error prone, etc.) than any single-threaded optimization.
3. CPython has a ton of low-hanging single-thread performance fruit. Its compiler-level optimizations are extremely basic compared to most other languages (constant folding, minor peephole optimizations, and that's about it), and (currently) it doesn't even JIT the resulting bytecode. Python also has a ton of preexisting CPU-bound single-threaded code, and very little preexisting CPU-bound multithreaded code, so improving multithreaded performance will accomplish exactly nothing until existing application code is refactored to use threads.
4. Removing the GIL is actually really hard, if you want to hit all of your backwards-compatibility requirements. PyPy tried to do it with software transactional memory, but even they found the technical challenges too great.
Interpretation 1 is also mostly false. Unless you have a program that's already single-thread-optimized, that solves an embarrassingly parallel problem, and you have lots of cores at your disposal, single-thread optimizations provide more speedup than multi-threading.
Making CPython faster
Making CPython faster
This is supposed to be a counterexample of what?
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
For a typical 'HPC' workload you would have a much larger working set, either a single bigger matrix, or lots of these small to medium sized matrices. In the latter case you would parallelize over the matrices and do each individual matrix multiplication in a single thread.
All 490,000 elements of the result matrix can be computed independently, i.e., in parallel. Parallel scalability in this case is limited by resource constraints and by parallel overhead (starting more threads, telling them their jobs, and waiting for all of them to complete their jobs).
Making CPython faster
If multi-threaded was always worse in every respect, nobody would do it.
Making CPython faster
Making CPython faster
Making CPython faster
Computing each element of the result matrix is entirely independent of all other computations during matrix multiplication, with no communication between the computations.
Making CPython faster
thr ms cycles
1 49 101M
2 32 110M
4 20 121M
So, without SMT (aka Hyperthreading) the rise in total CPU cycles is much less than in my earlier results with an SMT-enabled CPU, at least for this embarrassingly parallel problem; if you need significant synchronization, you will probably see a bigger rise in CPU cycles even without SMT.
Sure, single thread performance is important, but I don't think your matrix multiplication example is appropriate for reasoning about the GIL. It's a cpu bound program, sure, but working in a tight loop and using, say, 50MB of memory or something like that.Making CPython faster
A better one could be a client/server program, with thousands of largely indipendent big, composite objects, say each using dozens of KB of memory, for several GB total. You get random requests, starting some sort of evaluation of one or more of these objects (again, chosen at random).
Without the GIL a reasonably designed multithreaded program could handle these requests indipendently much more efficently than a single threaded one (remember, these are almost indipendent objects).
And multiprocessing here is not a good option, because transferring those big objects from a process to another, or even worse, loading them from storage is costly. Yes, you can mitigate the cost by random partitioning and so on, but multithreading usually is cheaper and less resource intensive, even if you have some lock contention.
But with the GIL multithreading is simply impossibile, because the GIL is not just a global lock, but a global interpreter lock, threads just serialize.
And from a processor viewpoint each thread takes the GIL for a long time, so even if you use C modules to speed up, chances are that you don't get much real benefit.
When you return from your fast C routine you're likely to end up still waiting in the queue for the GIL, at least unless your C routine executes in more or less a GIL tick (and your thread is luckily chosen to run).
Those workloads in python are horribly slow, and even doubling single thread performance will not change things significantly.
Making CPython faster
> And from a processor viewpoint each thread takes the GIL for a long time, so even if you use C modules to speed up, chances are that you don't get much real benefit.
Making CPython faster
Making CPython faster
There are *tons* of cases that multithread pretty well, where optimized single threaded code can be repeated across multiple cores. (Again, single threaded performance optimization and multi threading are not necessarily opposed, though they do conflict in many cases.)
Making CPython faster
Making CPython faster
Making CPython faster
Spectre is a bug in the implementation of speculative execution. So apparently what you wanted to say is that hardware designers found it hard to reason about speculative execution.
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
In part due to the then required use of fine grained locking and atomic instructions for refcounting.
Making CPython faster
Profiling runs of existing Python code should give a good idea of what common operations would benefit from replacing the usual byte code sequence with a special instruction.
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster
Making CPython faster