A QEMU case study in grappling with software complexity

October 12, 2021

This article was contributed by Kashyap Chamarthy

There are many barriers to producing software that is reliable and maintainable over the long term. One of those is software complexity. At the recently concluded 2021 KVM Forum, Paolo Bonzini explored this topic, using QEMU, the open source emulator and virtualizer, as a case study. Drawing on his experience as a maintainer of several QEMU subsystems, he made some concrete suggestions on how to defend against undesirable complexity. Bonzini used QEMU as a running example throughout the talk, hoping to make it easier for future contributors to modify QEMU. However, the lessons he shared are equally applicable to many other projects.

Why is software complexity even a problem? For one, unsurprisingly, it leads to bugs of all kinds, including security flaws. Code review becomes harder for complex software; it also makes contributing to and maintaining the project more painful. Obviously, none of these are desirable.

The question that Bonzini aimed to answer is "to what extent can we eliminate complexity?"; to do that he started by distinguishing between "essential" and "accidental" complexity. The notion of these two types of complexity originates from the classic 1987 Fred Brooks paper, "No Silver Bullet". Brooks himself is looking back to Aristotle's notion of essence and accident.

Essential complexity, as Bonzini put it, is "a property of the problem that a software program is trying to solve". Accidental complexity, instead, is "a property of the program that is solving the problem at hand" (i.e. the difficulties are not inherent to the problem being solved). To explain the concepts further, he identified the problems that QEMU is solving, which constitute the essential complexity of QEMU.

Essence and accidents of QEMU

QEMU has a large set of requirements in terms of portability, configurability, performance, and security. Besides emulating guest devices and providing ways to save and restore their state, it has a powerful storage layer and also embeds a few network servers, such as a VNC server. QEMU also has to make sure that the CPU and device models exposed to the guest remain stable, regardless of whether the underlying hardware or QEMU itself are updated. For many users it's important to use QEMU with a distribution kernel rather than a custom-built kernel. Being able to boot non-Linux operating systems is a necessary feature for many QEMU users, as well; it counts as essential complexity.

QEMU provides a management interface, usually called the monitor. In fact, there are two, HMP (human monitor protocol) and QMP (QEMU monitor protocol), because users need an easy way to interact with the monitor and won't be served by the same JSON-based interface provided by QMP that external programs use to manage QEMU. Therefore, QEMU contains an object model and a code generator that handles the marshaling and unmarshaling of C structures. Thanks to this code generator, the same code can easily operate on either JSON dictionaries or command-line options.

Developers also see another face of complexity, which is brought in by the tools that are part of the build process. Tools make common tasks easier, but they also make debugging harder when they break. For example, QEMU once had a manual configuration mechanism that required listing all of the devices, one by one, in the board that it is emulating. These days, only the board needs to be specified, and the build system will automatically enable the devices that are supported by it. It also ensures that impossible configurations don't build — this is useful, but, of course, developers have to learn how to deal with those failures.

Sources of complexity

For the sake of his presentation, Bonzini identified two main sources of accidental complexity. The first is "incomplete transitions" (inspired by a paper on GCC maintenance), which occur when a new and better way to do something is introduced, but it is not applied consistently across the codebase. This can be due to any number of reasons: the developer might not have the time or relevant expertise; or they simply fail to discover the remaining occurrences.

As an example, he cited two contrasting ways to report errors in QEMU: the propagation-based API, and ad hoc functions (e.g. error_report()) that write errors to the standard output. The propagation-based API was introduced to report errors to the QMP interface. It has two advantages: it separates the point where the errors happen versus where they're reported, and allows for graceful error recovery. Another example of an incomplete transition is that, even though these days the QEMU build system mostly uses Meson, there are preexisting compilation tests that are written in the Bourne shell and are part of QEMU's configure script.

However, QEMU also has a decent track-record of completing transitions. Several of these were done using Coccinelle — a pattern-matching and source transformation tool that allows the creation of a "semantic patch" that it can apply uniformly across the codebase. For instance, Coccinelle was used to replace obsolete APIs, to simplify code that was going through unnecessary hoops, or even to introduce whole new APIs (as was the case for the creation and "realization" of devices).

The second source of accidental complexity is duplicated logic and missing abstractions. There is a trade-off between writing code that is ad hoc, or designing reusable data structures and APIs. Bonzini pointed to the command-line parsing and contrasted ad hoc code, using functions such as strtol() or sscanf(), to QEMU-specific APIs such as QemuOpts or keyval. The latter ensure a level of consistency in the command line, and sometimes take care of printing help messages.

Another example is the recent effort to organize more parts of QEMU into shared objects that can be installed separately. As the number of such modules grew, a new mechanism was put in place to list a module's provided functionality and its dependencies in the same source file as the implementation, rather than having them scattered around the QEMU source code. As soon as a reviewer notices excessive duplication, or functionality scattered across many files, they should make a plan on how to eliminate that, Bonzini suggested.

Complexity on the QEMU command line

The talk proceeded with a case study of accidental complexity in QEMU, namely the command-line processing code. QEMU has 117 options, implemented in approximately 3000 lines of code that has "some essential complexity, but way too much accidental complexity". Bonzini outlined ways to simplify things, or how not to make them worse when working on QEMU's command-line parsing code. He began by asking: what exactly is causing accidental complexity in QEMU's command-line options? The many options vary a lot in their implementation, so the talk grouped them into six categories, and went through them in order of increasing accidental complexity: flexible, command, combo, shortcut, one-off, and legacy.

Flexible options are the most complicated, since they cater to a wide range of needs. They provide access to large parts of QEMU's essential complexity, and new features in QEMU are usually enabled through these options. Flexible options work by delegating as much as possible to generic QEMU APIs, so that enabling new features does not require writing or modifying any command-line parsing code. This is how a single option, ‑object, can configure secrets such as encryption keys, certificates for TLS, the association of virtual machines to NUMA nodes on the host, and so on. Three options, ‑cpu, ‑device, and ‑machine, configure almost all aspects of the virtual hardware. However, these options are not immune to accidental complexity: there are at least four parsers for such options: QemuOpts, keyval, a JSON parser, and a bespoke parser that is used by the ‑cpu option. "Four parsers are at least two more than there should be."

A command option is specified on the QEMU command line, but it also typically corresponds to one of the QMP commands that can be invoked at run time. An example is the option to not start the vCPU at guest boot up (qemu-kvm -S on the command line; or stop at run time), but start it only when asked to do so (via QMP cont, for "continue"). Another example is ‑loadvm to start QEMU from a file with guest state saved in it; or trace to enable trace points (this assumes QEMU is built with one of the available tracing backends). These options put a relatively small burden on the QEMU maintainer; but Bonzini suggested keeping a high bar for adding new command-line options — it's easier to invoke the options from the the QMP interface at run time.

With combo options, "we start our descent into accidental complexity hell": these options create both the frontend and backend of a device in a single command-line option. For example, QEMU's ‑drive option creates a device such as virtio-blk and a disk image for the guest in a single option. The more verbose variants of the options are unwieldy enough for casual users that the combo options do serve a genuine use case, but they have a high maintenance burden. The parsing code is complex and the options also tend to have ramifications in the rest of the code — both the backend code and the virtual-chipset creation code. These options make QEMU's code less modular, so that one cannot add support for a new board without knowing some details about the command line.

Shortcut options are syntactic sugar for the previous three groups. For example, ‑kernel path is short for ‑machine pc,kernel=path. They are handy — many users may not even realize that the longer forms exist — and they have a small maintenance burden because their implementation lives entirely within the command-line parsing code. However, given the sheer number of options that already exist, it's better to not add more.

Then there are one-off options; these are essential but their implementation is often suboptimal. Typically, they write a value to a global variable, or call a function that is not available via the QEMU monitor at run time. Bonzini pleaded with developers to avoid creating new ones and instead to refactor the existing ones into shortcut or command options, which he has been doing on and off over the past year.

Finally, with the legacy command-line options, "we hit rock bottom". Many of them are failed experiments (e.g. the ‑readconfig and ‑writeconfig options) or things that should not be in QEMU at all. For example, instead of ‑daemonize that daemonizes the QEMU process after initialization, users are better-served by tools such as libvirt. The way forward for these is to deprecate and ultimately remove them.

Ways to fight back

What lessons does the QEMU command line teach and what guidance can a developer derive? "Do not design in a void", he said — exploit the existing essential complexity. Before embarking on adding a new command-line flag, ask yourself if it is necessary. Perhaps one of the existing integrations in QEMU such as the QEMU API and QMP commands could be used. This way, one can make the most of the existing interactions between QEMU's subsystems.

Second, Bonzini highlighted the responsibilities of patch reviewers: understand the essential part of the complexity, and do not mistake it for an accident — this is a prerequisite to identify rising accidental complexity. And don't let the accidental complexity take over the project. For those working on refactoring large codebases, he encouraged learning Coccinelle.

Incomplete transitions are not always to be feared: transitioning from an old API to a new and better API is a natural part of how software is improved. In QEMU's case, sometimes a new feature requires a transition period anyway, because it affects the command line or a management tool, and thus requires a deprecation cycle. In such cases, take advantage of the incomplete transition, and work in phases. Identify the smallest chunks of work that can be considered an improvement, and plan for what comes later.

Further, ensure that the new and recommended way to perform a development task, or using a feature is documented — "there should be one obvious way to do a task. If not, one documented way to do it." Incomplete or piece-wise transitions should not deter one from making improvements to a program. Evaluate the trade-offs between duplicating code and adding more abstractions. Some situations may warrant code duplication; but when things are turning for the worse, do not aggravate the situation.

Conclusion

Building essentially-complex and maintainable software is hard enough as it is. Problems can compound over time if the elements of accidental complexity discussed here — incomplete transitions, excessive abstractions, ill-defined logical boundaries between components, and tooling complexity — are not reined in. The lessons distilled here from QEMU's experience provide ample guidance for other projects confronted with similar obstacles.

[I'd like to thank Paolo Bonzini for substantial reviews of earlier drafts of this article.]

Index entries for this article
GuestArticles	Chamarthy, Kashyap
Conference	KVM Forum/2021

A QEMU case study in grappling with software complexity

Posted Oct 13, 2021 16:12 UTC (Wed) by marcH (subscriber, #57642) [Link] (6 responses)

> The many options vary a lot in their implementation, so the talk grouped them into six categories, and went through them in order of increasing accidental complexity: flexible, command, combo, shortcut, one-off, and legacy.

I wish this classification had been available last time I tried to make sense of the command line interface. Any plan to actually classify / tag each option and make all that part of the official documentation?

> Further, ensure that the new and recommended way to perform a development task, or using a feature is documented — "there should be one obvious way to do a task. If not, one documented way to do it."

A QEMU case study in grappling with software complexity

Posted Oct 13, 2021 17:43 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (5 responses)

It hadn't occurred to me that this distinction would be useful to users as well, but in retrospect it's not surprising at all! The only snag is that the classification, which I made based on the implementation, would likely require some adjustment to become useful to users.

For example, "shortcut" includes some fairly common options such as -smp. "One-off" also would need a different name ("other"?) and it would also include some very common options (such as "-m").

A QEMU case study in grappling with software complexity

Posted Oct 13, 2021 18:35 UTC (Wed) by marcH (subscriber, #57642) [Link]

> The only snag is that the classification, which I made based on the implementation, would likely require some adjustment to become useful to users.

Agreed and thanks for considering this.

A QEMU case study in grappling with software complexity

Posted Oct 13, 2021 19:48 UTC (Wed) by kashyap (guest, #55821) [Link]

I didn't mention it out loud during the reviews, but I agree—as a user, the classification was useful for me too, despite being familiar with many of the options. So, documenting this upstream, with appropriate caveats, can be really handy. Especially for those navigating the intimidating man page for the first time. (If you haven't already sent the patch, I can make a to-do to take a stab at the first draft :-))

A QEMU case study in grappling with software complexity

Posted Oct 13, 2021 21:11 UTC (Wed) by pm215 (subscriber, #98099) [Link] (2 responses)

Would it be helpful also if the 'shortcut' and 'combo' options documentation always included documentation of the long-form equivalents?

A QEMU case study in grappling with software complexity

Posted Oct 14, 2021 6:52 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (1 responses)

For the combo options, that's complicated. Generally it's documented in docs/qdev-device-use.txt but that file is not yet part of the nice rST manuals.

For the shortcuts yeah, they should mention the long form.

A QEMU case study in grappling with software complexity

Posted Oct 14, 2021 7:33 UTC (Thu) by pm215 (subscriber, #98099) [Link]

It's the cases where it's complicated that are most useful to document :-)

(I've been putting off rstifying qdev-device-use because it's a big bag of stuff half of which which doesn't have an immediately obvious home in an existing bit of the rst manual...)

A QEMU case study in grappling with software complexity

Posted Oct 14, 2021 16:06 UTC (Thu) by chris_se (subscriber, #99706) [Link] (1 responses)

> Finally, with the legacy command-line options, "we hit rock bottom". Many of them are failed experiments (e.g. the ‑readconfig and ‑writeconfig options) [...]

Is -readconfig actually deprecated? -writeconfig: sure. But I consider -readconfig to be actually quite useful...

A QEMU case study in grappling with software complexity

Posted Oct 14, 2021 16:49 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

-readconfig is not deprecated, but it *is* a very messy experiment:

- it does not support many of the one-off options (but it does support some)

- it is untyped (everything is a string), while these days QEMU configuration knows (somewhat) in advance what fields are integers and which are strings - or at least would be able to emit decent error messages in case of confusion.

- it has undocumented weirdnesses, for example -smp corresponds to the [smp-opts] section of the configuration file instead of [smp], for no particular reason other than "people always forget about -readconfig during both coding and review"

So -readconfig itself is not deprecated, but I would like to deprecate or remove large chunks of the configuration file format. For example, these days -smp is a shortcut option, so one can also write e.g. smp.cpus = "4" in the [machine] section. Enforcing this would remove the weirdness of [smp-opts].

A QEMU case study in grappling with software complexity

Posted Oct 24, 2021 20:12 UTC (Sun) by Hi-Angel (guest, #110915) [Link]

> For those working on refactoring large codebases, he encouraged learning Coccinelle

Please don't. As someone who spent hours and hours of my life on Coccinelle, I advise you better learn pyparsing. Coccinelle might have been a good idea, except for anything harder than a straightforward variable rename you'll get stuck on the code non-working. Sometimes it prints vague errors, other times it just does not do conversion. In both cases you will spend hours trying to make it work. I have reported bugs to Coccinelle about it printing bad error description, and what I've heard back is that it is by design. The problem is that Coccinelle is written as yacc-based parser¹. What I gather is that improving debuggability of Coccinelle would require a complete rewrite.

My suggestion: use `pyparsing`. It is a python module that allows to catch patterns, and there's nothing specific to C lang. It is both bad and good thing. It's bad because it can't be as "smart" as Coccinelle claims to be (but it really doesn't matter because of the amount of time you gonna spend trying to make Cocinnelle actually work), but good because you can refactor other langs too.

1: https://github.com/coccinelle/coccinelle/issues/242#issue...