LWN.net Weekly Edition for October 31, 2024

Welcome to the LWN.net Weekly Edition for October 31, 2024

This edition contains the following feature content:

OSI readies controversial Open AI definition: the definition leaves out open training data, leaving some unhappy with the result.
An update on Apple M1/M2 GPU drivers: OpenGL and Vulkan conformance have been achieved; lots more games are playable too.
The performance of the Rust compiler: reducing compile time for Rust programs is an explicit goal of the project; progress has been made, with more to come.
realloc() and the oversize importance of zero-size objects: changing the longstanding behavior of realloc() in the GNU C library is somewhere between difficult and impossible.
Kernel optimization with BOLT: a technique to reorder the kernel binary using profiling information in order to reduce instruction-cache misses and improve performance.
AutoFDO and Propeller: a profile-guided mechanisms that use the compiler and linker to produce faster kernel binaries.
A new approach to validating test suites: a way to apply mutation to Rust programs in order to find gaps in their test coverage.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

OSI readies controversial Open AI definition

By Joe Brockmeier
October 25, 2024

The Open Source Initiative (OSI) has been working on defining Open Source AI—that is what constitutes an AI system that can be used, studied, modified, and shared for any purpose—for almost two years. Its board will be voting on the Open Source AI Definition (OSAID) on Sunday, October 27, with the 1.0 version slated to be published on October 28. It is never possible to please everyone in such an endeavor, and it would be folly to make that a goal. However, a number of prominent figures in the open-source community have voiced concerns that OSI is setting the bar too low with the OSAID—which will undo decades of community work to cajole vendors into adhering to or respecting the original Open Source Definition (OSD).

Defining Open Source AI

OSI executive director Stefano Maffulli announced the organization's intent to provide a definition for open-source AI in June 2023. He took exception to announcements of "large language models, foundational models, tooling, services all claiming to be 'open' or 'Open Source'", while adding restrictions which run afoul of the OSD. A survey of large-language model (LLM) systems in 2023 found that ostensibly open-source LLMs did not live up to the name.

The problem is not quite as simple as saying "use an OSD-compliant license" for LLMs, because there are many more components to consider. The original OSD is understood to apply to the source code of a program in "the preferred form in which a programmer would modify the program". A program is not considered open source if a developer cannot study, use, modify, and share a program, and a license is not OSD‑compliant if it does not preserve those freedoms. A program can include non-free data and still be open source. For example, the game Quake III Arena (Q3A) is available under the GPLv2. That distribution, however, does not include the pak files that contain the maps, textures, and other content required to actually play the commercial game. Despite that, others can still use the Q3A code to create their own games, such as Tremulous.

When discussing an "AI system", however, things are much more complicated. There is more than just the code that is used to run the models to do work of some kind, and the data is not something that can be wholly separate from the system in the way that it can be with a game. When looking at, say, LLMs, there is the model architecture, the code used to train models, model parameters, the techniques and methodologies used for training, the procedures for labeling training data, the supporting libraries, and (of course) the data used to train the models.

OSI has been working on its definition since last year. It held a kickoff meeting on June 21, 2023 at the Mozilla headquarters in San Francisco. It invited participation afterward via a regular series of in-person and online sessions, and with a forum for online discussions. LWN covered one of the sessions, held at FOSDEM 2024, in February.

The current draft of the OSAID takes its definition of an AI system from the Organisation for Economic Co-operation and Development (OECD) Recommendation of the Council on Artificial Intelligence:

A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.

This includes source code for training and running the system, model parameters "such as weights or other configuration settings", as well as "sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system".

Preferred form to make modifications

Those elements must all be available under OSI-approved licenses, according to the proposed definition, which seems perfectly in line with what we've come to expect when something is called "open source". There is an exception, though, for things like the data information and model parameters which must be available under "OSI-approved terms". The definition of OSI-approved terms is not supplied yet.

There is no requirement to make the training data available. To be compliant with the current draft of the OSAID, an AI system need only provide "detailed information" about the data but not the data itself.

The OSI published version 0.0.9 on August 22. It acknowledged then that "training data is one of the most hotly debated parts of the definition". However, the OSI was choosing not to require training data:

After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

As it stands, some feel that the OSAID falls short of allowing the four freedoms that it is supposed to ensure. For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system. "They would be able to build on top of it, through methods such as transfer learning and fine-tuning, but that's it."

Tom Callaway has written at length on LinkedIn about why open data should be a requirement. He acknowledges that there are good reasons that distributors of an AI system may not want, or be able, to distribute training data. For example, the data itself may have a high monetary value on its own, and a vendor may be unwilling or unable to share it. Acme Corp might license a data set and have permission to create an AI system using it, but not the ability to distribute the data itself. The data might have legal issues, ranging from confidentiality (e.g., medical data sets) to a desire to avoid lawsuits from using copyrighted data.

All of those are understandable reasons for not distributing data with an AI system, he said, but they don't argue for crafting a definition that allows companies to call their system open:

If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts. While there are lots of companies who would like to see open source mean less, I think it's critical that we not compromise here, even if it means there are less Open Source AI systems at first.

Objections to lack of training data are more than an attachment to the original meaning of open source. Giacomo Tesio posted a list of issues he considered unaddressed in the RC2 version of the OSAID, including a claim that there is inherent insecurity due to the ability to plant undetectable backdoors in machine-learning models.

Others weigh in

The Free Software Foundation (FSF) announced that it was working on "a statement of criteria for free machine learning applications" to call something a free (or libre) machine-learning application. The FSF says that it is close to a definition, and is working on the exact text. However, it adds that "we believe that we cannot say a ML application 'is free' unless all its training data and the related scripts for processing it respect all users, following the four freedoms".

However, the FSF makes a distinction between non-free and unethical in this case:

It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.

The Software Freedom Conservancy has announced an "aspirational statement" about LLM-backed generative AI for programming called "Machine-Learning-Assisted Programming that Respects User Freedom". Unlike the OSAID, this target focuses solely on computer-assisted programming, and was developed in response to GitHub Copilot. The announcement did not directly name the OSI or the OSAID effort, but said "we have avoided any process that effectively auto-endorses the problematic practices of companies whose proprietary products are already widely deployed". It describes an ideal LLM system built only with FOSS, with all components available, and only for the creation of FOSS.

Response to criticisms

I emailed Maffulli about some of the criticisms of the current OSAID draft, and asked why OSI appears to be "lowering the bar" when the OSI has never budged on source availability and use restrictions. He replied:

I'll be blunt: you mention "source redistribution" in your question and that's what leads people like [Callaway] into a mental trap [...]

There are some groups believing that more components are required to guarantee more transparency. Other groups instead believe that model parameters and architecture are enough to modify AI. The Open Source AI Definition, developed publicly with a wide variety of stakeholders worldwide, with deep expertise on building AI (see the list of endorsers), found that while those approaches are legitimate, neither is optimal. The OSAID grants users the rights (with licenses) and the tools (with the list of required components) to meaningfully collaborate and innovate on (and fork, if required) AI systems. We have not compromised on our principles: we learned many new things from actual AI experts along the way.

Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: "it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment". He added that OSI had synthesized input from "AI builders, users, and deployers, content creators, unions, ethicists, lawyers, software developers from all over the world" to arrive at the definition. A "simple translation" of the OSD, he said, would not work.

Stephen O'Grady, founder of the RedMonk analyst firm, also makes the case that the OSD does not easily translate to AI projects. But he does not believe that the term open source "can or should be extended into the AI world" as he wrote in a blog post on October 22:

At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts.

O'Grady makes the case that the OSI has set out on a pragmatic path to define open-source AI, which requires nuance. Open source has succeeded, in part, because the OSD removes nuance. Does a license comply with the OSD or doesn't it? It's pretty easy to determine. Less so with the OSAID. The pragmatic path, he said:

Involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: "If you're explaining, you're losing."

It would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate. That seems unlikely now, he said, after two years of trying to "thread the needle between idealism and capitalism to arrive at an ideologically sound and yet commercially acceptable" definition.

Indeed, it seems likely that the OSI board will move forward with the current draft of the OSAID or something close to it. The impact that will have is much less certain.

Comments (36 posted)

An update on Apple M1/M2 GPU drivers

By Jake Edge
October 30, 2024

XDC

The kernel graphics driver for the Apple M1 and M2 GPUs is, rather famously, written in Rust, but it has achieved conformance with various graphics standards, which is also noteworthy. At the X.Org Developers Conference (XDC) 2024, Alyssa Rosenzweig gave an update on the status of the driver, along with some news about the kinds of games it can support (YouTube video, slides). There has been lots of progress since her talk at XDC last year (YouTube video), with, of course, still more to come.

It is something of an XDC tradition, since she began it in Montreal in 2019 (YouTube video), for Rosenzweig to give her presentations dressed like a witch. This year's edition was no exception, though this time she started her talk in French, which resulted in some nervous chuckles from attendees. After a few sentences, she switched to English, "I'm just kidding", and continued with her talk.

Updates and tessellation

Last year at XDC, she and Asahi Lina reported that the driver had reached OpenGL ES 3.1 conformance. They also talked about geometry shaders, because "that was the next step". Since then, the driver has become OpenGL 4.6 conformant. That meant she was going to turn to talking about tessellation shaders, "as I threatened to do at the end of last year's talk".

Tessellation, which is a technique that "allows detail to be dynamically added and subtracted" from a scene, is required for OpenGL 4.0, and there is a hardware tessellator on the Apple GPU—but, "we can't use it". The hardware is too limited to implement any of the standards; "it is missing features that are hard required for OpenGL, Vulkan, and Direct3D". That makes it "pretty much useless to anybody who is not implementing Metal". Apple supports OpenGL 4.1, though it is not conformant, but if you use any of the features that the hardware does not support, it simply falls back to software; "we are not going to do that".

As far as Rosenzweig is aware, the hardware lacks point mode, where points are used instead of the usual triangles; it also lacks isoline support, but those two things can be emulated. The real problem comes with transform feedback and geometry shaders, neither of which is supported by the hardware, but the driver emulates them with compute shaders. However, the hardware tessellator cannot be used at all when those are being emulated because minute differences in the tessellation algorithms used by the hardware and the emulation would result in invariance failures. She is not sure whether that is a problem in practice or not, "but the spec says not to do it", so she is hoping not to have to go that route.

Instead, "we use software". In particular, Microsoft released a reference tessellator a decade or more ago, which was meant to show hardware vendors what they were supposed to implement when tessellation was first introduced. It is "a giant pile of 2000 lines of C++" that she does not understand, despite trying multiple times; "it is inscrutable". The code will tessellate a single patch, which gave the driver developers an idea: "if we can run that code, we can get the tessellation outputs and then we can just draw the triangles or the lines with this index buffer".

There are some problems with that approach, however, starting with the fact that the developers are writing a GPU driver; "famously, GPUs do not like running 2000 lines of C++". But, she announced, "we have conformant OpenCL 3.0 support" thanks to Karol Herbst, though it has not yet been released. OpenCL C is "pretty much the same as regular CPU C", though it has a few limitations and some extensions for GPUs. So the idea would be to turn the C++ tessellation code into OpenCL C code; "we don't have to understand any of it, we just need to not break anything when we do the port".

That works, but "tessellator.cl is the most unhinged file of my career"; doing things that way was also the most unhinged thing she has done in her career "and I'm standing up here in a witch hat for the fifth year in a row". The character debuted in the exact same room in 2019 when she was 17 years old, she recalled.

The CPU tessellator only operates on a single patch at a time, but a scene might have 10,000 patches—doing them all serially will be a real problem. GPUs are massively parallel, though, so having multiple threads each doing tessellation is "pretty easy to arrange". There is a problem with memory allocation; the CPU tessellator just allocates for each operation sequentially, but that will not work for parallel processing. Instead, the driver uses the GPU atomic instructions to manage the allocation of output buffers.

In order to draw the output of the tessellators, though, there is a need to use draw instructions with packed data structures as specified by the GPU. That is normally done from the C driver code using functions that are generated by the GenXML tool. Since the tessellators are simply C code, "thanks to OpenCL", the generated functions can be included into the code that runs on the GPU. Rosenzweig went into more detail, which fills in the holes (and likely inaccuracies) of the above description; those interested in the details should look at the presentation video and her slides.

"Does it work? Yes, it does." She showed an image of terrain tessellation from a Vulkan demo. It was run on an M2 Mac with "entirely OpenCL-based tessellation". There is also the question of "how is the performance of this abomination?" The answer is that it is "okay". On the system, software-only terrain tessellation runs at less than one frame-per-second (fps), which "is not very fun for playing games"; for OpenCL, it runs at 265fps, which is "pretty good" and is unlikely to be the bottleneck for real games. The hardware can do 820fps; "I did wire up the hardware tessellator just to get a number for this talk." There is still room for improvement on the driver's numbers, she said.

Vulkan and gaming

She also announced Vulkan 1.3 conformance for the Honeykrisp M1/M2 GPU driver. It started by copying the NVK Vulkan driver for NVIDIA GPUs, "smashed against the [Open]GL 4.6 [driver]", which started passing the conformance test suite "in about a month". That was six months ago and, since then, she has added geometry and tessellation shaders, transform feedback, and shader objects. The driver now supports every feature needed for multiple DirectX versions.

There are a lot of problems "if we want to run triple-A (AAA) games on this system", however. A target game runs on DirectX and Windows on an x86 CPU with 4KB pages, but "our target hardware is running literally none of those things". What is needed is to somehow translate DirectX to Vulkan, Windows to Linux, x86 to Arm64, and 4KB pages to 16KB pages. The first two have a well-known solution in the form of the DXVK driver and Wine, which are "generally packaged into Proton for Steam gaming". Going from x86 to Arm64 also has off-the-shelf solutions: FEX-Emu or Box64. She has a bias toward FEX-Emu; "when I am not committing Mesa crimes, I am committing FEX-Emulation crimes". The big problem, though, is the page-size difference.

FEX-Emu requires 4KB pages; Box64 has a "hack to use 16KB pages, but it doesn't work for Wine, so it doesn't help us here". MacOS can use 4KB pages for the x86 emulation, but "this requires very invasive kernel support"; Asahi Linux already has around 1000 patches that are making their way toward the mainline kernel, but "every one of those thousand is a challenge". Making changes like "rewriting the Linux memory manager" is not a reasonable path.

It turns out that, even though Linux does not support heterogeneous page sizes between different processes, it does support them between different kernels; "what I mean by that is virtualization". A KVM guest kernel can have a different page size than the host kernel. So, "this entire mess", consisting of FEX-Emu, Wine, DXVK, Honeykrisp, Steam, and the game, "we are going to throw that into a virtual machine, which is running a 4KB guest kernel".

There is some overhead, of course, but it is hardware virtualization, so that should have low CPU overhead. The problem lies with the peripherals, she said. So, instead of having Honeykrisp in the host kernel, it runs in the guest using virtgpu native contexts; all of the work to create the final GPU command buffer is done in the guest and handed to the host, rather than making all of the Vulkan calls individually traverse the virtual-machine boundary. The VirGL renderer on the host then hands that to the GPU, which "is not 100% native speed, but definitely well above 90%", Rosenzweig said.

The good news is that the overheads for the CPU and GPU do not stack, since the two run in parallel. "So all the crap overhead we have in the CPU is actually crap that is running in parallel to the crap overhead on the GPU, so we only pay the cost once."

"'Does it work?' is the question you all want to know." It does, she said, it runs games like Portal and Portal 2. She also listed a number of others: Castle Crashers, The Witcher 3, Fallout 4, Control, Ghostrunner, and Cyberpunk 2077.

All of the different pieces that she mentioned were made available on October 10, the day of the talk. For those running the Fedora Asahi Remix distribution, she suggested immediately updating to pick up the pieces that she had described. Before taking questions, she launched Steam, which took some time to come up, in part because of the virtual machine and the x86 emulation. Once it came up, she launched Control, which ran at 45fps on an M1 MAX system.

There was a question about resources from someone who has a Mac with 8GB of RAM. Rosenzweig said that the high-end gaming titles are only likely to work on systems with 16GB or more. She noted that she was playing Castle Crashers on an 8GB system during the conference, so some games will play; Portal will also work on that system. She hopes that the resources required will drop over time.

Another question was about ray-tracing support, since Control can use that feature. Rosenzweig suggested that patches were welcome but that she did not see that as a high priority ("frankly, I think ray tracing is a bit of a gimmick feature"). Apple hardware only supports it with the M3 and the current driver is for M1 and M2 GPUs, though she plans to start working on M3 before long. The session concluded soon after that, though Rosenzweig played Control, admittedly poorly, as time ran down.

[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Montreal for XDC. ]

Comments (16 posted)

The performance of the Rust compiler

By Daroc Alden
October 28, 2024

RustConf 2024

Sparrow Li presented virtually at RustConf 2024 about the current state of and future plans for the Rust compiler's performance. The compiler is relatively slow to compile large programs, although it has been getting better over time. The next big performance improvement to come will be parallelizing the compiler's parsing, type-checking, and related operations, but even after that, the project has several avenues left to explore.

As the projects written using Rust get larger, compilation latency has become increasingly important. The Rust compiler's design causes slow compilation, Li said, when compared to languages with implementations suitable for fast prototyping such as Python or Go. This slowness does have a cause — the compiler performs many sophisticated analyses and transformations — but it would be easier to develop with Rust if the language could be compiled faster.

To give some idea of how badly Rust performs compared to other languages, Li presented a graph showing the compile time of programs with different numbers of functions across several languages. Rust's line was noticeably steeper. For small numbers of functions, it was no worse than other languages, but compile times increased much more as the Rust programs grew larger. Improving compile times has become an area of focus for the whole community, Li said.

This year, the speed of the compiler has more than doubled. Most of this is due to small, incremental changes by a number of developers. Part of this improvement is also from work on tools like rustc-perf, which shows developers how their pull requests to the Rust compiler affect compile times. The efforts of the parallel-rustc working group (in which Li participates) are now available in the nightly language builds, and end up providing a 30% performance improvement in aggregate. Those improvements should land in stable Rust by next year.

There are a handful of other big improvements that will be available soon. Profile-guided optimization provides (up to) a 20% speedup on Linux. BOLT, a post-link optimizer, (which was the topic of a talk that LWN covered recently) can speed things up by 5-10%, and link-time optimization can add another 5%, Li said. The biggest improvement at this point actually comes from switching to the lld linker, however, which can improve end-to-end compilation time by 40%.

The majority of the speedups have not been from big-ticket items, however. "It is the accumulation of small things" that most improves performance, Li said. These include introducing laziness to some areas, improved hashing and hash tables, changing the layout of the types used by the compiler, and more. Unfortunately, code optimizations like these require a good understanding of the compiler, and become rarer over time as people address the lower-hanging fruit.

So, to continue improving performance, at some point the community will need to address the overall design of the compiler, Li said. Parallelization is one promising avenue. Cargo, Rust's build tool, already builds different crates in parallel, but it is often the case that a build will get hung up waiting for one large, commonly-used crate to build, so parallelism inside the compiler is required as well. There is already some limited parallelism in the compiler back-end (the part that performs code generation), but the front-end (which handles parsing and type-checking) is still serial. The close connection between different front-end components makes parallelizing it difficult — but the community would still like to try.

Proposed in 2018 and established in 2019, the parallel-rustc working group has been planning how to tackle the problem of parallelizing the front-end of the compiler for some years. The group finally resolved some of the technical difficulties in 2022, and landed an implementation in the nightly version of the compiler in 2023. The main focus of its work has been the Rust compiler's query system, which is now capable of taking advantage of multiple threads. The main technical challenge to implementing that was a performance loss when running in a single-threaded environment. The solution the working group came up with was to create data structures that can switch between single-threaded and thread-safe implementations at run time. This reduced the single-threaded overhead to only 1-3%, which was judged to be worth it in exchange for the improvement to the multithreaded case.

The query system was not the only component that needed to adapt, however. The working group also needed to change the memory allocator — while the default allocator was thread-safe, having too many threads was causing contention. The group solved this by using separate allocation pools for each thread, but Li warned that this wasn't the only source of cross-thread contention. When the compiler caches the result of a query, that result can be accessed from any thread, so there is still a good deal of cross-thread memory traffic there. The working group was able to partially address that by sharding the cache, but there's still room for improvements.

While performance improves when using up to eight threads in testing, Li said, performance actually goes down between eight and sixteen threads; there are too many inefficiencies from synchronization. The project has considered switching to a work-stealing implementation so that threads do not spend as much time waiting for each other, but that is hard to do without introducing deadlocks. Finally, some front-end components are still single-threaded — particularly macro expansion. At some point, those will need to be parallelized as well.

In all, though, Li thought that Rust's compilation performance has "a shiny future". There are a handful of approaches that still need to be explored, such as improving the linker, using thin link-time optimization, etc., but the community is working hard on implementing these. Rust's incremental compilation could also use some improvement, Li said. C++ does a better job there, because Rust's lexing, parsing, and macro expansion are not yet incremental.

Li finished the talk with a few areas that are not currently being looked into that might be fruitful to investigate: better inter-crate sharing of compilation tasks, cached binary crates, and refactoring the Rust compiler to reduce the amount of necessary global context. These "deserve a lot of deep thought", Li said. Ultimately, while the Rust compiler's performance is worse than many other languages, it's something that the project is aware of and actively working on. Performance is improving, and despite the easy changes becoming harder to find, it is likely to continue improving.

Comments (22 posted)

realloc() and the oversize importance of zero-size objects

By Jonathan Corbet
October 24, 2024

Small objects can lead to large email threads. In this case, the GNU C Library (glibc) community has been having an extensive debate over the handling of zero-byte allocations. Specifically, what should happen when a program calls realloc() specifying a size of zero? This is, it seems, a topic about which some people, at least, have strong feelings.

The job of realloc() is to change the size of an existing allocation; its prototype is:

    void *realloc(void *ptr, size_t size);

This call will replace the memory object at ptr with a new object of the given size and return a pointer to its replacement; that new object will contain the same data, up to the minimum of the old and new sizes. The old memory will be freed. Calls to realloc() are useful when an allocated block of memory needs to be resized on the fly.

An interesting question arises, though, when size is zero. If a program calls malloc() requesting a zero-byte object, the result is (in glibc) well defined: "If size is 0, then malloc() returns a unique pointer value that can later be successfully passed to free()." Needless to say, any attempt to store data using the returned pointer is unlikely to lead to joy, but the caller will get a pointer value back. This appears to be common behavior across malloc() implementations.

The situation is not quite as clear with realloc() in either the relevant standards or existing implementations; if it is called with a size of zero, it could do any of:

Behave like malloc() and return a pointer to a zero-byte object.
Behave like free(), release the object, and return NULL.
Behave like a spoiled brat, refuse to do anything, store an error number in errno, and return NULL.

Alejandro Colomar recently brought this question to the glibc development list. His opinion, expressed in unambiguous terms, was that realloc() should behave in the same way as malloc() when passed a size of zero; it should, in other words, free the passed-in object and return a pointer to a zero-byte object. But that is not what glibc's realloc() does; instead, it takes option 2 above: it frees the object and returns NULL. This behavior is not new; it dates back to 1999, when the behavior was changed to align with the C99 standard (or, at least, one reading thereof).

This change, Colomar asserted, should be reverted. That would make glibc conformant to his interpretation of C99 and in line with what the BSD systems do. He copied the discussion widely and rallied some support to this cause. None other than Douglas McIlroy put in an appearance to describe glibc's behavior as "a diabolical invention". Paul Eggert agreed, saying that "one must always be able to shrink an allocation, even if the shrinkage is [to] zero".

In truth, there were few defenders for glibc's behavior, but there is still resistance to changing it. As Joseph Myers pointed out, there are almost certainly programs that rely on the current behavior:

Given that any change in implementation behavior would introduce memory leaks or double frees depending on what an implementation currently does that applications rely on, it's not reasonable to try now to define it more precisely.

Colomar, though, disagreed strongly. Any memory leaks, he said, would be of zero-byte objects and could thus be ignored; Siddhesh Poyarekar pointed out, though, that a zero-byte object still requires the equivalent of two size_t values for tracking, so real memory would be leaked. Colomar also doubted the double-free case; Myers explained how that could come about. While there is disagreement over how many programs would be affected by this sort of subtle change, it is hard to argue that the number would be zero. That is going to make the glibc developers reluctant to change realloc() in this way.

Poyarekar, instead, attempted to find a compromise with this patch adding a new tunable that would allow the behavior of realloc() to be changed while leaving the current behavior as the default. Florian Weimer disliked the patch, saying that any applications that would benefit from the tunable are simply buggy in their current form; he later added that distributions would have to do a lot of work to support this tunable properly. Andres Schwab argued that a tunable is not appropriate, since a given program's needs will not change at run time; the behavior, if it changes at all, needs to be set at compile time.

Hanging over this whole discussion is another important detail: the C23 standard explicitly defines calling either malloc() or realloc() with a size of zero as undefined behavior. That is a change from C17, which called it implementation defined. Colomar had mentioned that change in his initial message, but dismissed it, saying "let's ignore that mistake". The glibc project cannot entirely ignore the standards it implements, though. So it is not surprising that, among others, Poyarekar also suggested updating the GNU tools and sanitizers to enforce this standard in the long term.

The conversation wound down without any definitive conclusions. One might surmise, though, that a change to the well-documented behavior of realloc() for the last 25 years would be unlikely to happen, even without the C23 change. Instead, glibc will have a hard time resisting the push from the standards committee to eliminate the existence of zero-size objects entirely. At some point, attempting to allocate such an object may well become an error, though this behavior will surely be controlled by a feature-test macro, as usual, to avoid breaking existing programs. Meanwhile, though, chances are good that we will see further ado over the allocation of nothing.

Comments (83 posted)

Kernel optimization with BOLT

By Jake Edge
October 25, 2024

LPC

A pair of talks in the toolchains track at the 2024 Linux Plumbers Conference covered different tools that can be used to optimize the kernel. First up was Maksim Panchenko to describe the binary optimization and layout tool (BOLT) that Meta uses on its production kernels. It optimizes the kernel binary by rearranging it to improve its code locality for better performance. A subsequent article will cover the second talk, which looked at automatic feedback-directed optimization (AutoFDO) and other related techniques that are used to optimize Google's kernels.

Panchenko began with a slide showing a handful of companies and projects that use BOLT, which can be seen in his slides or the YouTube video of the talk. It was designed at first for large applications at Meta, but it turned out to also accelerate compilers. So, for example, it is used by Python since version 3.12; it is also used by LLVM, Rust, and others.

If you look back to ten years ago in the open-source world, getting the maximum performance from an application was a matter of using GCC or Clang with -O3, profile-guided optimization (PGO), and link-time optimization (LTO). But applying PGO was "kind of painful", he said, so only those who cared a lot about performance were using it. Applying PGO to the kernel was even more painful, so most companies just used -O2 or -O3 on their kernels.

The main Meta application is a virtual machine running PHP code; around 2015, developers there came up with the idea of a binary optimizer that would work with both GCC and Clang to speed up the generated code. BOLT turned out to exceed developers' expectations; it gave large gains and was able to accelerate the compilers themselves. It has been used at Meta since 2016; "most of the cycles are spent in binaries optimized by BOLT" throughout the Meta fleet. BOLT was released as open source, under the Apache 2.0 license, in 2018 and has been part of LLVM since 2022.

Generally, developers think about how the data structures for their programs will be arranged in memory; it is less common for them to consider how the code is arranged. There are exceptions, including the Linux kernel developers, but most times the focus is on the data cache. The instruction cache is much smaller than the data cache and has not grown much over time, maybe doubling from 32KB to 64KB on Intel CPUs over the last 20 years. But, for large applications that do not spend most of their time in tight loops, the layout of code in memory matters a lot, so BOLT can make a major difference, he said.

Compilers do not have enough information to optimally place code, even when they have profiling information; it turns out that inlining functions changes the profile. The profile information may point to a function foo() that is called a lot, but when it is inlined, that information is not present. BOLT operates on the binary to observe the code that is being frequently executed so that all of it can be placed close together in memory. Once the BOLT paper was released in 2019, he said, other efforts, including Propeller, have come about; they are generally tied to a single toolchain, though, while BOLT can be used with GCC or Clang and with different linkers.

He showed a sort of heat map of the memory layout of the HHVM runtime, which is what is used by Meta for most of its workloads. One image showed hot code spread all over, while the post-BOLT image showed all of the hot code confined to the same small region of memory. That drastically reduces instruction-cache misses, translation-lookaside buffer (TLB) misses, and CPU time; for HHVM, BOLT produced a 7% performance increase, Panchenko said.

BOLT is a post-link optimizer that runs on ELF binaries, such as vmlinux. Even though it is part of the LLVM project, BOLT still supports GCC code; it also supports the "most popular" architectures: x86_64, Arm64, and RISC-V.

Applying BOLT to the kernel took some time, mostly in the form of ensuring that the resulting kernel would run and not crash. One of the big problems encountered was in finding a good benchmark to use to measure the impact of BOLT. There are lots of different micro-benchmarks available, but "I couldn't find any scalable, large-scale benchmark". In the end, he used the RocksDB db_bench fillseq benchmark, which showed a 2.5% improvement just by switching to a BOLT-optimized kernel. He and others at Meta ran one of the company's main services on a BOLT-optimized kernel that produced a 2% queries-per-second (QPS) improvement, "which was quite significant".

BOLT only changes branches and the location of the basic blocks in the binary to achieve its improvements. Most of the time, there is no need to recompile applications in order to apply BOLT; the exception is for programs built with split functions, which BOLT can do better than the compiler. The application does need to be relinked in order to produce relocation information. With that, and a profile of the kernel running a representative workload, BOLT will only take around four seconds to optimize vmlinux, he said.

The profile can be generated with a variety of mechanisms, including the last branch record (LBR) feature on Intel platforms and similar branch-sampling features on other architectures. If that is not available, the code can be instrumented to gather the needed information, but that has higher overhead than using LBR and the like. There are other options, but the profile quality, thus the BOLT optimizations, will not be as good.

BOLT needs an unstripped vmlinux binary because it uses the symbol names for code discovery, Panchenko said. BOLT can easily identify the boundaries of code and data in the ELF binary using the text and data segments that are defined. The segments are further divided into sections and BOLT uses the symbol-table information to identify the individual functions therein.

Then BOLT disassembles the functions, though, unlike objdump, it will "symbolize the operands of the instructions". Distinguishing between constants and addresses in the instructions can be a problem, however. The relocation information that is inserted by the linker helps with that; it can be used to "effectively do symbolic disassembly" of the code.

The resulting instruction stream can be optimized using normal techniques, such as peephole optimization, but "this is not very efficient". In order to do more optimizations on the code, an intermediate representation (IR) of some kind is needed. The IR used is "essentially a control-flow graph on top of MC instructions", Panchenko said; he was referring to the LLVM machine code (MC) project, which provides a number of tools and libraries that BOLT uses. The instructions look much like the assembly code, but some may be annotated or modified to identify tail calls versus other kinds of jumps, for example. "So if you look at BOLT disassembly, you will have a much better idea of what's happening in your application compared to regular objdump".

BOLT uses the profile information for the basic blocks in the control-flow graph in order to determine where the hot code resides. To make the best code-layout decisions, though, having weights on the edges, rather than just execution counts for the basic blocks, would be useful. The LBR profiling can provide that information, but BOLT can recover some information about edge weights even without it.

Then the graph is used to optimize the code; "The main optimization that gives us most of the gains is code reordering." The basic blocks can be grouped together to reduce the instruction-cache footprint of the code. That is done by breaking up the functions into fragments, some of which are hot code and others that are rarely or never executed (e.g. error-handling code). Compilers already do reordering, but on the function level; BOLT takes it one step further and reorders these function fragments.

Once the new code is generated, there is a question about where to put it, he said. Unless there is a big concern about disk space, it is more efficient to simply create a new text segment that contains the hot code, which will generally be quite a bit smaller (for a binary that is huge, "hundreds of megabytes", 20MB or less of hot code would end up in the new segment). So it is not much overhead, in terms of disk space, and "you get a much faster application".

Adding another text segment to the kernel binary may not be viable, he said, so he turned to an alternative that is "much more feasible". BOLT will simply rewrite the existing functions in the binary; those functions are already ordered by the compiler based on its analysis, so BOLT effectively uses that. BOLT could do a bit better function ordering based on the profile, but that small gain can be sacrificed in order to easily get the basic-block reordering, he said. It also helps avoid over-specializing the code for only the workload measured by the profile; for other workloads that were not measured, some of the cold code will be executed, but it will still be located nearby due to the compiler choices.

The kernel provides other challenges, because its code is modified at boot time and also while it is running. He spent a good chunk of the talk going into the details of how that type of code is handled. Things like static calls, SMP locks (that are patched out on uniprocessor systems), static keys, and alternative instructions for different subarchitectures are handled with annotations in the disassembled code, which is part of the metadata that BOLT uses to do its job. For some, BOLT simply does not operate on them, while others have mechanisms that the optimizer can use on them. All of that metadata can be dumped using BOLT tools.

Panchenko said that he was skipping over some topics, such as continuous profiling, which can be applied to BOLT-optimized binaries as they run; a new version of a binary can be produced that will reflect changes in the code and the workload. He also did not cover any of the other optimizations that BOLT applies. He finished by showing some of the output that running BOLT produces, noting that a four-second demo of it operating would not be all that interesting.

[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]

Comments (20 posted)

AutoFDO and Propeller

By Jake Edge
October 28, 2024

LPC

Rong Xu and Han Shen described the kernel-optimization techniques that Google uses in the toolchains track at the 2024 Linux Plumbers Conference. They talked about automatic feedback-directed optimization (AutoFDO), which can be used with the Propeller optimizer to produce kernels with better performance using profile information gathered from real workloads. There is a fair amount of overlap between these tools and the BOLT post-link optimizer, which was the subject of a talk that directly preceded this session.

AutoFDO

Xu started by saying that he would be covering AutoFDO, then Shen would talk about further performance enhancements using Propeller on top of AutoFDO. At Google, where Xu and Shen work, the amount of time spent executing in the kernel on its fleet was measured. Xu put up a slide showing the CPU-cycle percentage of the kernel, with idle time excluded, for a variety of workloads, which can be seen in the slides or YouTube video of the talk. Anywhere from 17-60% of the CPU cycles were spent in the kernel; the fleet-wide average is about 20%.

There have been lots of efforts toward optimizing the Linux kernel, he said; at Google, one of the main techniques is feedback-directed optimization (FDO), which uses run-time information to improve code generation. The core idea is to gather profile information on real workloads to allow the compiler to focus on the optimizations that have the most impact based on how the program runs.

FDO has proven to be effective and is used on nearly all of the applications at Google, for Android, ChromeOS, and in the data center, he said. It improves instruction-cache usage, reduces TLB misses, and improves branch performance. The overall improvement can be up to 20% for the run time, but there is another benefit; FDO usually also reduces code size.

There are two major types of FDO: instrumentation-based and sample-based. The instrumentation-based variant (also known as iFDO) is the original one; it uses an instrumented kernel to run tests that represent the real workload, which result in the profile to be used. The profile is then fed to the compiler to build an optimized kernel that can be deployed to production. The instrumented kernel provides accurate profiles, without depending on performance monitoring unit (PMU) hardware, but the resulting kernel is up to ten times slower, so it cannot be used in production. The load tests still need to be representative, however. Beyond that, another disadvantage is that iFDO requires kernel changes for the instrumentation.

Sample-based FDO (or AutoFDO) starts with a kernel built with debug symbols, then the perf tool is used to measure a representative workload. The raw perf data is converted to an AutoFDO profile that is fed to the compiler to build the optimized kernel. Because the perf-based data collection has such low overhead, the test kernel can be used in production with real workloads, which eliminates the concern about having truly representative test cases. But, AutoFDO does not optimize quite as well as iFDO because the profile data is not as complete; "so if you use the same load test, you will get close performance, but not as good". In addition, AutoFDO requires support for both a hardware PMU and perf.

For fast-moving projects with incremental releases, the suggested model is to use an iterative process where the profile information gathered running the kernel in production for a particular release is fed to the compiler building the test kernel for the next release. That release is run in production and the profile gathered is fed in to build the following release—and so on. There are code mismatches, but AutoFDO uses line numbers relative to the start of functions; if there is no major refactoring in the release, "we find this works pretty well". The developers think this mode will work well for minor releases of the kernel, he said.

AutoFDO requires last branch record LBR or similar support in the processor; AMD branch sampling (BRS) and Arm embedded trace macrocell (ETM) can provide the same kind of information. With that information, AutoFDO can create a profile that allows multiple types of optimizations to be made. There are three that usually have the most impact: function inlining, increasing branch fall-through, and indirect-call promotion.

Inlining helps reduce both the size and run time of the code. Falling through branches is faster than taking a branch even if it is correctly predicted; in addition, that optimization groups hot basic blocks together for better instruction-cache utilization. Lastly, indirect calls can be promoted to direct calls, which allows them to be inlined. Other optimization passes can query the profile data to retrieve counts for basic blocks, branches taken, and so on, so those optimizations will be improved as well.

Xu gave a brief report of some performance numbers. In nearly all of the tests (a mixture of micro-benchmarks and workload tests), iFDO improved performance slightly more than AutoFDO, and most were fairly modest gains (2-10%) overall. In a workload test of a Google database application, AutoFDO performance improved by 2.6%, while iFDO saw 2.9%. He worked with some Meta engineers who reported a roughly 5% performance increase in one of their services using AutoFDO and around 6% with iFDO.

Overall, the path for using AutoFDO on the kernel has been fairly smooth. "The testing turns out to be the most challenging part of the work"; finding representative workloads and benchmarks is difficult. Along the way, the developers have learned that profiles from Intel machines work well on AMD machines. For even better performance, he recommended using link-time optimization (LTO) and ThinLTO.

Propeller

Shen then stepped up to describe using Propeller to build kernels. He started by describing what a post-link optimizer is, but noted that the BOLT talk just prior had covered it well. A post-link optimizer only requires the binary and a profile of it running on a representative workload, it does not usually need the source code. It uses a disassembler to convert the machine code into a compiler intermediate representation (IR), then does analysis and optimization, and finally converts back to machine code for the final output.

Conceptually, Propeller is a post-link optimizer, but it takes a different approach; it uses the compiler and linker to do the transformations. As with other post-link optimizers, it takes in a binary and a profile to identify a series of optimizations, but it does not directly apply them. Instead, it writes compiler and linker profiles, then redoes the build using the same source code. The compiler and linker do their usual jobs, but use the profile information to apply the optimizations as part of the process. The result is the same, an optimized binary, but the internal steps are different.

There are a few reasons why this approach was used. While working on post-link optimization, the developers found that the disassembler does not work on all binaries. For example, it does not work on binaries with read-only data intertwined with code, or for binaries that must comply with various standards, such as FIPS compliance. Beyond that, there are scalability problems: "the single-machine multithreaded design of the post-link optimizer leads to excessive processing time and memory usage". That led them to shift some of that burden to the compiler and linker, he said.

There are a few advantages to the Propeller mechanism. First, the optimization transformation is "done with the uttermost accuracy" because they are handled by the compiler and linker. It also "opens the door for parallelizing because each compilation job is independent". A distributed build cache can be used; if no optimization instructions are found for a particular source file, it does not need to be rebuilt and the cached object file can be used. It also means that Propeller itself does not need to deal with the kernel metadata that gets built; that is all handled by the existing build process. Propeller does require the source code, however, and the rebuild step, which are disadvantages of the approach.

There are currently two major optimizations that are made by Propeller: basic-block layout and path cloning. There is also active work on adding inter-procedure register allocation, Shen said.

Propeller can be combined with AutoFDO by building an AutoFDO profile that is used to build a kernel that is then profiled for Propeller. The profile used to build the AutoFDO kernel is also used to rebuild the kernel again with the Propeller profiles. Each kernel is run in production to generate the profiles needed. The final result is a kernel that has been optimized by both AutoFDO and Propeller.

Unlike other post-link optimizers, Propeller can optimize kernel modules. Post-link optimizers operate on executables and shared libraries, but kernel modules are relocatable objects. Propeller can resolve the relocation information in the module to link to the proper symbols.

As mentioned earlier, AutoFDO can tolerate minor source-code changes and small differences in the build options, but that is not true for Propeller. The source code and build options of the kernel used for generating the profile must be identical to those that Propeller uses. The developers "are working hard to try to mitigate this limitation", he said.

Propeller requires specific software and hardware support to do its job. It depends on a tool that lives outside of the LLVM source tree. The developers are currently working on migrating the functionality into LLVM. It also needs branch information from the CPU, which comes from features that vary across architectures (LBR for Intel, BRS or LBRv2 for AMD, and SPE or ETM for Arm). For internal applications, Propeller has been verified on each architecture (though not each branch-tracking feature); for the kernel, only Intel LBR has been validated at this point.

Shen showed some heat maps and graphs of PMU stats for no FDO, iFDO, AutoFDO, and the latter two with ThinLTO and Propeller. As might be guessed, there were improvements in nearly all of the categories for Propeller.

Currently, patches enabling AutoFDO and Propeller in LLVM have been sent upstream for review, Shen said. Internally, the developers are doing "large-scale production tests to measure the performance company-wide"; they are also looking at customer kernels based on specific workloads with the idea of building different kernels for individual applications.

In summary, he said that FDO, using either iFDO and AutoFDO, "improves kernel performance significantly". AutoFDO integrates well with the kernel build and allows profiling in production, which is a huge benefit over iFDO. Their advice is to add Propeller on top to get the best possible performance.

During Q&A, Peter Zijlstra asked about why there was a need for multiple profile runs; the AutoFDO profile should already have the needed information, he said. Shen said that the first-pass optimizations change the code such that the profile information is out of date, which is where the Propeller profiling picks up. Zijlstra noted that the code is transformed again at that point, so another profile would be needed; Shen agreed that it is an iterative process and that more profiling and rebuilding passes improve the results.

Zijlstra also complained that calling Propeller a post-link optimizer was not accurate. The optimization happens during the build process. Another audience member agreed, saying that the profile information was gathered post-link, but that it was simply another form of profile-guided optimization; Zijlstra suggested that Propeller should simply be called that to avoid the confusion.

There is, it seems, a lot of work going into producing optimized kernels using LLVM, though BOLT can also operate on kernels built with GCC. That this work is being done by companies with large fleets of cloud systems is no real surprise; a tiny increase in performance multiplied by the fleet size makes for a rather large impact on CPU usage, thus power requirements, number of systems needed, and so on—all of which boils down to spending less money, of course.

[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]

Comments (1 posted)

A new approach to validating test suites

By Daroc Alden
October 29, 2024

RustConf 2024

The first program that Martin Pool ever wrote, he said, had bugs; the ones he's writing now most likely have bugs too. The talk Pool gave at RustConf this year was about a way to try to write programs with fewer bugs. He has developed a tool called cargo-mutants that highlights gaps in test coverage by identifying functions that can be broken without causing any tests to fail. This can be a valuable complement to other testing techniques, he explained.

There are lots of ways to write programs with fewer bugs, Pool said. Many people at RustConf probably jump immediately to thinking of type systems or lifetime annotations. Those tools are great, and can eliminate entire classes of bugs, but they can't help with semantic bugs, where the programmer tells the computer to do the wrong thing, he said. The best way to catch those bugs is to write tests.

Tests don't catch everything either — "I do write tests, and I still have bugs" — but they help. Sometimes, however, "you look at a bug and ask, how was this not caught?". The problem is that tests themselves can be buggy. Sometimes they just don't cover some area of the code; sometimes they do call the code, but don't verify the output strictly enough. Sometimes the tests were comprehensive when they were written, but now the program has grown and the tests no longer cover everything. A significant fraction of bugs could, in principle, be caught by tests, Pool said. It would be nice if there were a tool that could point out where tests are insufficient, so that we can avoid introducing new bugs into existing software.

In order to explain how cargo-mutants does that, Pool first briefly covered the process of writing tests: the programmer starts out knowing what the program should do, and then tries to write tests that tie what it should do to what it actually does. That tie should be robust — the tests should check all of the properties the programmer cares about, and they should fail if those properties don't hold. Good tests should also generalize, since exhaustively testing every possible behavior is impossible. So one way to think about finding weaknesses in a set of tests is to ask "what behaviors does the program have that are not checked by any tests?"

That is exactly what cargo-mutants does — it uncovers code that is buggy, unnecessary, underspecified, or just hard to test, by finding gaps in the test suite. The core algorithm is to make a copy of a project's source code, deliberately insert a bug, run the test suite, and expect the tests to fail. If they don't, report it as a "missed mutant" — that is, a buggy version of the program that wouldn't be caught by the tests, which needs some human attention to figure out the best way to address it.

Pool gave a specific example of a bug that was found by cargo-mutants. In one of his programs, he wrote an is_symlink() function, in a program that expected to get symbolic links and need to handle them. Cargo-mutants showed that if the body of the function were replaced with true, the tests did not catch it — or, in other words, the tests didn't cover the possibility of something not being a symbolic link. This is actually pretty common, Pool said, since humans often forget to write negative test cases. In any case, he added a test and immediately found that the original implementation was wrong as well.

For a somewhat larger example, he selected the semver crate, which provides functions for working with semantic version numbers. He isn't involved in the semver project, so it represents an unbiased test. He ran cargo-mutants on it, and found a handful of places where the test suite could be improved. This is typical of what is found among high-quality crates, Pool said. In semver's case, if the hashing code was broken, the tests would still pass. It's a good crate — not buggy — but there are still some gaps in the coverage that could make it easier to unknowingly introduce bugs.

That's not to say that every single missed mutant is necessarily a problem, Pool explained. People have limited time; they should focus on the most important things, instead of spending all of their time polishing their test suite. The findings from cargo-mutants are just things to think about. "The computer is not the boss of you." So what Pool recommends is skimming the list of missed mutants, looking for code that is important, or that should be thoroughly tested. Then, one should consider whether the introduced mutation is really incorrect, whether the original code was actually correct, and what test should have caught this.

Ultimately, it's important not to stress too much about the mutants the tool finds, Pool said. Cargo-mutants generates synthetic bugs — but the goal is always to catch and prevent real bugs. Pool compared cargo-mutants to vaccines: deliberately introducing a weakened version of a problem in a controlled way, so that you can beef up the system that is supposed to catch the real problem.

With the audience hopefully convinced to at least give cargo-mutants a try, Pool switched to talking about how the tool works under the hood. The most interesting part, he said, is actually generating synthetic bugs. There are a few requirements for the code that generates them to implement: they should be valid Rust code, that will point to real gaps, without being too repetitive. They should also, ideally, be generated deterministically, because that makes it easier to reproduce them. There's lots of existing research on how to do this, he said, but it turns out that small, local changes to the code are both easy to generate, and close to what could really change in the codebase.

So cargo-mutants uses cargo-metadata and Syn to understand the Rust code in a project and parse it. For each Rust module, it tries pattern matching on the syntax tree and applying two different kinds of changes: replacing a whole function with a stub, and changing arithmetic operations. In the future, the tool could add more, but just those two turn out to be "really powerful". When replacing a function, it generates likely values in a type-driven way — even quite complicated nested types can be generated, by first generating their constituent parts. So even functions returning complex, nested types can be stubbed out.

Pool has been working on cargo-mutants for a few years. He runs it on his own crates, and also on other open-source crates. With that experience, he can say that getting to complete coverage of every mutant can be a bit of work, because it points out gaps that can be hard to address. It's tempting to suppress a mutant, but sometimes they matter. Mostly, the tool finds gaps in the tests, but occasionally it finds a real bug as well, he said. The most difficult cases are when there is a test for something, but the test is flaky. When introducing cargo-mutants to a new project, he thinks it is easiest to turn it on for one file, module, or crate at a time, and gradually expand through the code base.

Cargo-mutants has 100% coverage of itself, naturally. Once you get to that level, Pool said, it's pretty easy to stay there, since the list of new mutants created by any given change is small. Getting to that point has revealed lots of interesting edge-cases, though.

One other difficulty has been parallelization. Cargo-mutants runs hundreds or thousands of incremental builds and tests, which eats a lot of CPU time. Since the tests of each mutant are independent, the tool can run multiple tests independently — even spread across multiple machines. To give an idea of the performance, Pool used numbers from cargo-mutants itself. The crate is 11,000 lines of code, including tests. Building it takes 12 seconds, while running the tests takes 55 seconds. Running cargo-mutants on itself generates 575 mutants (all of which are covered by the test suite, and so are not missed mutants), but because of incremental builds and parallelization, running the tests on all of them only takes about 15 minutes.

Pool finished by comparing cargo-mutants to other testing practices such as code coverage and fuzzing. Code coverage checks that code was executed, he explained. Cargo-mutants checks that the results of the test suite depend on running the code, which is a stronger guarantee. Cargo-mutants also tells you exactly what the uncaught bug is, which can be easier to handle. The downside is that cargo-mutants is significantly slower than existing code-coverage tools. A developer could easily run both, Pool said. Another common point of comparison is fuzzing or property testing. Fuzzing is a good way to find problems by "wiggling the inputs"; cargo-mutants finds problems by "wiggling the code".

There are a lot of tools that purport to help programmers write more correct software. The choice of which ones to use is often a tradeoff between correctness and time. Still, nearly every project has at least some tests — so an easy-to-use tool like cargo-mutants that helps improve a project's test suite without too much additional effort seems like it could be a valuable option.

Comments (21 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: CUPS vulnerability; Fedora 41; Firefox 132.0; Flock; OSAID 1.0; Thunderbird for Android; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>