Multi-threaded emulation for QEMU

August 17, 2016

This article was contributed by Alex Bennée

When I first started working on the QEMU machine emulator and virtualizer, one of the first things I noticed was its approach to threading for its virtual CPUs (vCPUs). For system emulation you emulate everything from the hardware up. This means a single vCPU represents a single execution core of the emulated system. Each of these vCPUs is run in turn in a round-robin manner in a single host thread. As a result, you often see one busy core pegging at 100% while the rest of your multi-core system has little to do.

For emulating simple systems with single-core ARM chips, this hasn't been a major problem. However the systems that are now being emulated (e.g. phones and servers) are a lot more powerful. Growth in per-core clock speed has plateaued, so CPU manufacturers are increasing performance by increasing the number of cores instead. As a result, the limitations of the single-threaded emulation engine are starting to become a performance bottleneck.

I gave a talk at last year's KVM Forum [YouTube] where I outlined the challenges involved in making the system safe for multi-threading. This work is on the part of QEMU known as the Tiny Code Generator (TCG) translation engine, which is responsible for generating host code to emulate the guest; thus, the project is known as the Multi-Threaded TCG (MTTCG).

It is worth pointing out that QEMU is already a multi-threaded application depending on how it is invoked. Most people come across QEMU when it is being used in conjunction with Linux's Kernel Virtual Machine (KVM). This requires hardware support to directly run guest code, so it only works if the guest is the same architecture as the host. In this mode each vCPU does run in its own thread and control is only passed back to QEMU when KVM can't handle something. This generally happens for hardware emulation or I/O processing. A lot of work has been done over the years to reduce contention of the Big QEMU Lock (BQL) to ensure this scales well.

There is another use of QEMU: linux-user mode. This allows the execution of user-space applications compiled for another architecture to be run on a host system. It avoids the need to emulate any additional peripheral hardware and concentrates on translating code up until it makes a system call. QEMU then does the appropriate mangling to make the host system call and pass the result back to the translated program. In this mode, QEMU follows the threading model of the program it is translating, so each thread the application spawns results in another QEMU thread. Some of the work done to improve the stability of threads in linux-user mode has been used in MTTCG. However it was never a complete solution; some of the work of the MTTCG project also benefits the stability of linux-user mode.

A TCG primer

When QEMU is operating in TCG mode it behaves very much like a JIT (just-in-time) engine. The main run loop consists of looking up a translated block of code based on the current guest program counter (PC). If it can't find one, it needs to invoke the translator. The translator reads the guest instructions of a basic block of execution and converts them into TCGOps, which are simple operations which make up TCG's intermediate representation. This is very similar to what a compiler does.

After some optimization of the TCGOps, the back-end then emits a series of host instructions to emulate the guest instructions. Once the block has been translated and added to the code buffer, QEMU jumps to the newly created code. As new blocks are added, the old blocks are patched so they jump directly between translated code blocks.

There are two main ways that execution leaves the translated code. The first is an exit back to the run loop to handle some condition (e.g. a computed jump, jumping to an untranslated bit of code, handling an interrupt or other system event). The other is for the slow path of the SoftMMU (QEMU's virtual memory component) that occurs when the address being used by a load or store isn't in the SoftMMUs translation lookaside buffer (TLB). Typically, the slow path will refill the TLB with the needed value, although the forcing of the slow path is also used to detect self-modifying code and intercept access to I/O memory.

Work starts

Thanks to its open-source nature, QEMU has been used for a number of academic papers that have explored the approaches one can take to use multiple threads to safely model a multi-core device. Generally, however, no attempt has been made to upstream the changes into the master tree.

The current project's story starts in the summer of 2014 when Frederic Konrad of GreenSocs announced that he was experimenting with a multi-threaded version of the QEMU TCG-based system emulation. By the start of 2015, he had posted the first RFC patches to the list. Since then, in true open-source style, people from a number of different companies and institutions have been involved. Multiple implementations of various approaches to solving a multitude of associated problems have now been posted and the developers are starting to form a plan for eventually upstreaming full MTTCG support.

Some of this work has already had a positive impact on the master source tree. For example, a number of clean-ups to the TCG code have already been merged and resulted in a better understanding of critical points in the TCG code. This was mostly identifying areas containing shared state that needs to be protected when there are multiple threads of execution.

A major improvement merged before the 2.7 feature freeze was Emilio Cota's QEMU hash table (QHT) implementation. The hash table is used to look up previous translations of code for a given PC address. While QHT improved on the distribution characteristics of the previous hash, the real improvement comes from its lockless design. This will be important for system emulation, as exits to the main run loop happen more often. This is because QEMU doesn't allow translated blocks to jump directly to generated code for a different MMU page in the guest's memory, since pages can be mapped in and out. The ability to do the lookup without lock contention in the hot path will be very useful.

Current state

The work to support MTTCG is currently split into two patch sets.

The recently updated v4 of the base enabling patches includes all the common support in TCG to safely serialize code generation and modification between the various threads. It uses the work done for linux-user emulation and KVM to ensure that critical sections, such as code generation and device access, are protected by locks. With this version, simple test cases can be run using a thread-per-vCPU as long as the test case does not make any assumptions about atomic operations or memory ordering.

In practice, the memory ordering constraints of weakly ordered systems don't generally cause a problem when running on an x86 host, because x86 is not weakly ordered. As a result, on those host systems, all that is needed is a MTTCG compliant atomic emulation implementation. You can find a branch with Cota's cmpxchg-based atomics merged on GitHub. With this extra functionality, an ARM Linux kernel with a Debian user space can be booted.

While a lot of the enabling work has been aimed at ARM guest CPUs, the intention is for MTTCG to be as generic as possible and minimize the changes needed to support each guest architecture. For example, there is an alpha quality port to PPC in the works, which is now being tested against the latest patch series. If you are interested in porting a particular guest architecture, now is the time to get involved with reviewing the core patch sets and proving the solutions are indeed generic.

There are two big remaining hurdles to overcome before MTTCG can be enabled for the masses. These are two related but different problems: modeling atomic behavior and memory coherency.

Atomics

When the translated instructions run in a single thread, questions of instruction atomicity are fairly simple to solve. A translation of an atomic instruction can safely be broken into many host instructions as long as they complete before the next vCPU is scheduled. Because individual instruction translations are never split across basic execution blocks and execution never switches to another vCPU in the middle of a basic block, the sequence of host instructions implementing a single guest instruction is either executed completely or not at all. As a result, the translator can get away with translating atomic guest instructions into simple host instructions and get atomicity "for free".

While the linux-user targets have had to deal with this problem when there are multiple threads, the solutions to date have been very target specific. MTTCG requires a mechanism that can be used across all targets to meet the various demands of each target's atomic semantics.

There is a complication introduced by the fact that a number of RISC architectures implement atomic behavior with paired instruction sequences (e.g. load-link/store-conditional or load-exclusive/store-exclusive in ARM terminology). This means any solution has to take into account that the host system may have a very different mechanism for providing atomic support than the emulated guest. There are a number of potential solutions that have been posted so far.

Save load value/address and check on store

This is the solution currently used by ARM user-mode emulation. A guest load-exclusive instruction is implemented by translating it into a call to a helper function, which is a piece of C code called directly from the translated code. The helper stores the address and the loaded value. When the conditional store is executed, all vCPUs are stopped and the address and the current value are compared with the saved values. If they have not changed, the store is allowed to complete and report success.

Pros: Simple to implement, only need to instrument LL/SC instructions
Cons: Doesn't detect ABA conditions, currently architecture specific

This approach doesn't sufficiently deal with ABA conditions, where a second vCPU is scheduled between the time that the first vCPU reads and writes back an updated value to a memory location. If the second vCPU initially changes the value in the memory location, but then writes back the original value, it will fool the first vCPU into thinking that the memory has not been touched since the initial read of the that location. So far, this doesn't seem to cause problems.

Load-link/store-conditional instruction support via SoftMMU

Alvise Rigo has a patch series that adds support for emulating LL/SC instructions in a generic manner for any architecture that needs it. This takes advantage of the SoftMMU code to force a slow path whenever stores take place in the same region of memory that is currently involved in exclusive operations. This means stores to other regions of memory will continue at normal speed and the performance penalty of checking if accesses intersect with the exclusive region is mitigated.

The original patch set was aimed at enabling a more expressive emulation in single-threaded mode. With some additional work it can be used to support the MTTCG case. However, as it does mess with internals of the TCG vCPU emulation, care has to be taken to ensure operations occur in the right order and race conditions are not introduced.

Pros: Expressive TCG operations, doesn't affect all stores
Cons: Tricky locking/sequencing, needs SoftMMU, scalability concerns

The SoftMMU is the mechanism used to emulate virtual addressing in system-mode emulation. It uses the TLB to map from guest addresses to host system addresses. This is slower than the linux-user memory operations, which simply use a fixed offset from guest to host addresses as well as memory layout tricks to ensure QEMU is kept out of the way of the guest application. However, the SoftMMU also makes it easy to cause accesses to some addresses to be handled specially, which is important for system emulation.

This approach may not scale well to many vCPUs because the approach requires stopping all execution of all vCPUs while flushing each vCPU's TLB structures. This is not a major problem as long as it happens infrequently. However, stopping all vCPUs on every atomic guest operation may turn out to be too expensive.

Link helpers and instrumented stores

The last option could be considered a hybrid of the other two. The link instruction sets an exclusive zone, which is then checked by instrumenting every store in the system. This can done outside of the SoftMMU code with helpers. The store helper will check to see if it is writing to an exclusive zone; if so, it sets a flag that is checked when the store conditional is run.

Pros: Simpler to implement locking/sequencing
Cons: Big performance impact by having a helper run for every store

Whatever final solution is chosen, the choice will depend on some heavy testing and benchmarking. It may end up being a hybrid approach or even a totally new and novel solution to the problem. Since this is an ongoing discussion, anyone with an interest in the topic is invited to contribute.

Memory coherency

Memory coherency is concerned with when the effects of memory operations (e.g. loads and stores) are visible across domains of execution (e.g. CPU cores). In the common case, the problem is to ensure that multiple CPUs observe memory accesses in the order intended by the programmer. For the single-threaded case, this is solved trivially, because all operations will appear in program order.

To allow the programmer to make statements about the visibility of operations across cores, architectures have introduced barrier instructions. This allows the programmer to make explicit statements about when the effects of loads and stores are visible across the system. If you want to read up in more detail, I can recommend the Linux kernel's memory barriers.txt or Will Deacon's ELC2013 talk [YouTube] and slides [PDF] on the subject.

The x86 architecture has a strong memory model that ensures a consistent view of the state of memory across multiple cores. However, even the x86 will reorder non-aliased stores past a load unless constrained by a barrier. In fact, you can write an ARM executable that shows this when run in QEMU with MTTCG on an x86 host.

One of this year's suggested Google Summer of Code (GSoC) projects was to tackle this problem and the student, Pranith Kumar, has already posted a number of RFC patch series to add memory barriers to TCG. The basic idea is a simple one: the TCG gains a new TCGOp to represent a memory barrier operation. When the TCGOp is translated, the back-end emits the relevant host instructions to ensure that the minimum memory synchronization requirements are met.

Although the idea is simple there are some implementation details to worry about. While an x86 host has no trouble ensuring that the ARM memory model is met (weak-on-strong), the reverse (strong-on-weak) won't be nearly as efficient since every x86 memory operation has an implied barrier that would slow down the generated ARM code a lot.

A related open question is: how should QEMU handle acquire/release semantics in this scheme? Acquire/release instructions are load/store instructions that come with implied barriers as part of the memory operation. They have been introduced because they are more efficient than issuing separate barrier instructions. There is concern that by using a simple scheme of discrete TCGOp memory barriers when translating the guest code, the resulting generated host code will be very inefficient. This is an active topic of discussion on the mailing list.

Next steps

Several Linaro members are interested in this work as it provides a faster, more accurate, and more convenient development environment for multiple types of ARM systems to any ARM developer. QEMU system emulation is used as a modeling and bring-up tool, as the basis for the Android Emulator, and as a tool for kernel and toolchain developers.

While GreenSocs did the initial work to get this started, Linaro has invested resources in continuing this work and ensuring a usable, scalable, and efficient solution will be merged upstream. It has coordinated with multiple community members without whom this work would not have progressed with its current impressive momentum. I would like to especially mention Cota, Rigo, and Paolo Bonzini, as well as fellow Linaro engineer Sergey Fedorov and the many others who have offered review comments online.

Regular posting and review is essential to keep the development momentum going and ensure that out-of-tree implementations remain relevant and usable. I will be attending KVM Forum 2016 in late August to finalize a plan for the remaining steps to get MTTCG merged upstream and enabled.

Index entries for this article
GuestArticles	Bennée, Alex

Multi-threaded emulation for QEMU

Posted Aug 21, 2016 3:57 UTC (Sun) by binkert (guest, #87008) [Link] (1 responses)

Given what you write about, the section titled "Memory Coherency" should be titled "Memory Consistency". Consistency deals with writes to different memory locations and the order the processors observe them. Coherence deals with writes to individual memory locations and how they propagate throughout the system.

Multi-threaded emulation for QEMU

Posted Aug 31, 2016 10:06 UTC (Wed) by alex (subscriber, #1355) [Link]

You are right, it should be consistency and in fact the project and patch series refer to that. Of course outside of the specific usage coherence is "the quality of being logical and consistent" so it's not completely misleading ;-)

I'm wondering what this means on a practical level

Posted Mar 1, 2017 6:36 UTC (Wed) by spotter (guest, #12199) [Link] (5 responses)

For instance, I believe (correct me if I'm wrong), that currently in the "old school vmware workstation" type mode of QEMU (i.e. running windows on Linux or vice versa without hardware support) could only operate as a single core machine (or in practice even if it showed multiple virtual cores, it was in practice just using one core). Would this allow it to actually do more of a 1:1 mapping? i.e. you want 4 cores in your virtual windows box, and it will actually make use of 4 cores you have on your linux host? or does it not help in these cases at all?

I'm wondering what this means on a practical level

Posted Mar 1, 2017 16:57 UTC (Wed) by nix (subscriber, #2304) [Link] (4 responses)

QEMU has supported multiple cores in KVM mode for ages, including more cores than you actually have. (However, CPU affinity for those cores is only available via the monitor interface, so if you're using the command line to start QEMU you can expect significant thrashing between physical CPUs and a corresponding slowdown even when using fewer virtual cores than physical, unless you use some non-upstreamed patches to add a command-line interface to set this.)

I'm wondering what this means on a practical level

Posted Mar 1, 2017 19:39 UTC (Wed) by spotter (guest, #12199) [Link] (3 responses)

yes, I know KVM. but KVM isn't available everywhere (ex: you want to do something in emulation on AWS, or basically any commercial cloud platform, none, AFAIK right now, support nested virt). There are reasons to run qemu by itself on the cloud, but performance has been terrible and this could help with one of those reasons I'd think.

I'm wondering what this means on a practical level

Posted Mar 1, 2017 22:42 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

Honestly, TCG performance is going to be pretty dire no matter what you do. Emulating a CPU under another CPU is not going to be anything but vastly slower than the original unless you're emulating something like a 6502. Extra cores will just make it incredibly slow rather than intolerably slow.

I'm wondering what this means on a practical level

Posted Mar 2, 2017 0:53 UTC (Thu) by spotter (guest, #12199) [Link] (1 responses)

I guess I'm wondering why. Going back to old school vmware, cpu bound userspace programs were not much slower than running on the bare metal. I thought the reason was that they only had to emulate the privileged instructions. I'm guessing from your response that QEMU doesn't work that way, it virtualizes and translates the whole cpu, so in practice there is no advantage of x86_64 on x86_64 vs arm on x86_64 in terms of performance. That's not the problem QEMU is trying to solve (and perhaps the answer is if you care about this, just use KVM)

I'm wondering what this means on a practical level

Posted Mar 28, 2017 13:18 UTC (Tue) by alex (subscriber, #1355) [Link]

Userspace translation is a fair bit faster than system emulation as each load/store is a simple offset into the "guest" memory compared to the SoftMMU emulation we need to do for a full system. However we still re-translate all the guest instructions into new host ones via TCG so you always have to pay the cost of the translation overhead.