LWN.net Weekly Edition for October 31, 2024
Welcome to the LWN.net Weekly Edition for October 31, 2024
This edition contains the following feature content:
- OSI readies controversial Open AI definition: the definition leaves out open training data, leaving some unhappy with the result.
- An update on Apple M1/M2 GPU drivers: OpenGL and Vulkan conformance have been achieved; lots more games are playable too.
- The performance of the Rust compiler: reducing compile time for Rust programs is an explicit goal of the project; progress has been made, with more to come.
- realloc() and the oversize importance of zero-size objects: changing the longstanding behavior of realloc() in the GNU C library is somewhere between difficult and impossible.
- Kernel optimization with BOLT: a technique to reorder the kernel binary using profiling information in order to reduce instruction-cache misses and improve performance.
- AutoFDO and Propeller: a profile-guided mechanisms that use the compiler and linker to produce faster kernel binaries.
- A new approach to validating test suites: a way to apply mutation to Rust programs in order to find gaps in their test coverage.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
OSI readies controversial Open AI definition
The Open Source Initiative (OSI) has been working on defining Open Source AI—that is what constitutes an AI system that can be used, studied, modified, and shared for any purpose—for almost two years. Its board will be voting on the Open Source AI Definition (OSAID) on Sunday, October 27, with the 1.0 version slated to be published on October 28. It is never possible to please everyone in such an endeavor, and it would be folly to make that a goal. However, a number of prominent figures in the open-source community have voiced concerns that OSI is setting the bar too low with the OSAID—which will undo decades of community work to cajole vendors into adhering to or respecting the original Open Source Definition (OSD).
Defining Open Source AI
OSI executive director Stefano Maffulli announced
the organization's intent to provide a definition for open-source AI
in June 2023. He took exception to announcements of
"large language models, foundational models, tooling, services all
claiming to be 'open' or 'Open Source'
", while adding restrictions
which run afoul of the OSD. A survey
of large-language model (LLM) systems in 2023 found that ostensibly
open-source LLMs did not live up to the name.
The problem is not quite as simple as saying "use an OSD-compliant
license" for LLMs, because there are many more components to
consider. The original OSD is understood to apply to the
source code of a program in "the preferred form in which a
programmer would modify the program
". A program is not considered
open source if a developer cannot study, use, modify, and share a
program, and a license is not OSD‑compliant if it does not
preserve those freedoms. A program can include non-free data and still
be open source. For example, the game Quake III Arena
(Q3A) is available under the GPLv2. That distribution, however, does
not include the pak
files that contain the maps, textures, and other content required to
actually play the commercial game. Despite that, others can still use
the Q3A code to create their own games, such as Tremulous.
When discussing an "AI system", however, things are much more complicated. There is more than just the code that is used to run the models to do work of some kind, and the data is not something that can be wholly separate from the system in the way that it can be with a game. When looking at, say, LLMs, there is the model architecture, the code used to train models, model parameters, the techniques and methodologies used for training, the procedures for labeling training data, the supporting libraries, and (of course) the data used to train the models.
OSI has been working on its definition since last year. It held a kickoff meeting on June 21, 2023 at the Mozilla headquarters in San Francisco. It invited participation afterward via a regular series of in-person and online sessions, and with a forum for online discussions. LWN covered one of the sessions, held at FOSDEM 2024, in February.
The current draft of the OSAID takes its definition of an AI system from the Organisation for Economic Co-operation and Development (OECD) Recommendation of the Council on Artificial Intelligence:
A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.
This includes source code for training and running the system,
model parameters "such as weights or other configuration
settings
", as well as "sufficiently detailed information about
the data used to train the system so that a skilled person can build a
substantially equivalent system
".
Preferred form to make modifications
Those elements must all be available under OSI-approved licenses,
according to the proposed definition, which seems perfectly in line
with what we've come to expect when something is called "open
source". There is an exception, though, for things like the data
information and model parameters which must be available under
"OSI-approved terms
". The definition of OSI-approved terms is
not supplied yet.
There is no requirement to make the training data available. To be
compliant with the current draft of the OSAID, an AI system need only
provide "detailed information
" about the data but not the data
itself.
The OSI published
version 0.0.9 on August 22. It acknowledged then that "training data is
one of the most hotly debated parts of the definition
". However,
the OSI was choosing not to require training data:
After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.
Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.
As it stands, some feel that the OSAID falls short of allowing the
four freedoms that it is supposed to ensure. For example, julia
ferraioli wrote
that without including data, the only things that the OSAID guarantees
are the ability to use and distribute an AI system. "They would be
able to build on top of it, through methods such as transfer learning
and fine-tuning, but that's it.
"
Tom Callaway has written at length on LinkedIn about why open data should be a requirement. He acknowledges that there are good reasons that distributors of an AI system may not want, or be able, to distribute training data. For example, the data itself may have a high monetary value on its own, and a vendor may be unwilling or unable to share it. Acme Corp might license a data set and have permission to create an AI system using it, but not the ability to distribute the data itself. The data might have legal issues, ranging from confidentiality (e.g., medical data sets) to a desire to avoid lawsuits from using copyrighted data.
All of those are understandable reasons for not distributing data with an AI system, he said, but they don't argue for crafting a definition that allows companies to call their system open:
If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts. While there are lots of companies who would like to see open source mean less, I think it's critical that we not compromise here, even if it means there are less Open Source AI systems at first.
Objections to lack of training data are more than an attachment to the original meaning of open source. Giacomo Tesio posted a list of issues he considered unaddressed in the RC2 version of the OSAID, including a claim that there is inherent insecurity due to the ability to plant undetectable backdoors in machine-learning models.
Others weigh in
The Free Software Foundation (FSF) announced
that it was working on "a statement of criteria for free machine
learning applications
" to call something a free (or libre)
machine-learning application. The FSF says that it is close to a
definition, and is working on the exact text. However, it adds that
"we believe that we cannot say a ML application 'is free' unless
all its training data and the related scripts for processing it
respect all users, following the four freedoms
".
However, the FSF makes a distinction between non-free and unethical in this case:
It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.
The Software Freedom Conservancy
has announced
an "aspirational statement
" about LLM-backed generative AI for
programming called "Machine-Learning-Assisted Programming that
Respects User Freedom". Unlike the OSAID, this target focuses solely
on computer-assisted programming, and was developed in
response to GitHub
Copilot. The announcement did not directly name the OSI or the OSAID effort, but
said "we have avoided any process that effectively auto-endorses
the problematic practices of companies whose proprietary products are
already widely deployed
". It describes an ideal LLM system built
only with FOSS, with all components available, and only for the creation of FOSS.
Response to criticisms
I emailed Maffulli about some of the criticisms of the current OSAID draft, and asked why OSI appears to be "lowering the bar" when the OSI has never budged on source availability and use restrictions. He replied:
I'll be blunt: you mention "source redistribution" in your question and that's what leads people like [Callaway] into a mental trap [...]
There are some groups believing that more components are required to guarantee more transparency. Other groups instead believe that model parameters and architecture are enough to modify AI. The Open Source AI Definition, developed publicly with a wide variety of stakeholders worldwide, with deep expertise on building AI (see the list of endorsers), found that while those approaches are legitimate, neither is optimal. The OSAID grants users the rights (with licenses) and the tools (with the list of required components) to meaningfully collaborate and innovate on (and fork, if required) AI systems. We have not compromised on our principles: we learned many new things from actual AI experts along the way.
Maffulli objected to the idea that the OSAID was weaker or making
concessions, and said that the preferred form for modifying ML systems
was what is in the OSAID: "it's not me nor OSI board saying that,
it's in the list of endorsers and in [Carnegie Mellon University's] comment
". He
added that OSI had synthesized input from "AI builders, users, and
deployers, content creators, unions, ethicists, lawyers, software
developers from all over the world
" to arrive at the definition. A
"simple translation
" of the OSD, he said, would not work.
Stephen O'Grady, founder of the RedMonk analyst firm, also makes
the case that the OSD does not easily translate to AI projects. But he
does not believe that the term open source "can or should be
extended into the AI world
" as he wrote
in a blog post on October 22:
At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts.
O'Grady makes the case that the OSI has set out on a pragmatic path to define open-source AI, which requires nuance. Open source has succeeded, in part, because the OSD removes nuance. Does a license comply with the OSD or doesn't it? It's pretty easy to determine. Less so with the OSAID. The pragmatic path, he said:
Involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: "If you're explaining, you're losing."
It would have been better, he said, if the OSI had not tried to
"bend and reshape a decades old definition
" and instead had
tried to craft something from a clean slate. That seems unlikely now,
he said, after two years of trying to "thread the needle between
idealism and capitalism to arrive at an ideologically sound and yet
commercially acceptable
" definition.
Indeed, it seems likely that the OSI board will move forward with the current draft of the OSAID or something close to it. The impact that will have is much less certain.
An update on Apple M1/M2 GPU drivers
The kernel graphics driver for the Apple M1 and M2 GPUs is, rather famously, written in Rust, but it has achieved conformance with various graphics standards, which is also noteworthy. At the X.Org Developers Conference (XDC) 2024, Alyssa Rosenzweig gave an update on the status of the driver, along with some news about the kinds of games it can support (YouTube video, slides). There has been lots of progress since her talk at XDC last year (YouTube video), with, of course, still more to come.
It is something of an XDC tradition, since she began it in Montreal in 2019
(YouTube video),
for Rosenzweig to give her presentations dressed like a witch.
This year's edition was no exception, though this time she started her talk in
French, which resulted in some nervous chuckles from attendees. After a few
sentences, she switched to English, "I'm just kidding
", and
continued with her talk.
Updates and tessellation
Last year at XDC, she and Asahi Lina reported that the driver had reached
OpenGL ES 3.1
conformance. They also talked about geometry
shaders, because "that was the next step
". Since then, the
driver has become OpenGL 4.6
conformant. That meant she was going to turn to talking about tessellation
shaders, "as I threatened to do at the end of last year's talk
".
Tessellation,
which is a technique that "allows detail to be dynamically added and
subtracted
" from a scene, is required for OpenGL 4.0, and there is
a hardware tessellator on the Apple GPU—but, "we can't use it
". The
hardware is too limited to implement any of the standards; "it is
missing features that are hard required for OpenGL, Vulkan, and Direct3D
". That
makes it "pretty much useless to anybody who is not implementing Metal
". Apple
supports OpenGL 4.1, though it is not conformant, but if you use any
of the features that the hardware does not support, it simply falls back to
software; "we are not going to do that
".
![Alyssa Rosenzweig [Alyssa Rosenzweig]](https://static.lwn.net/images/2024/xdc-rosenzweig-sm.png)
As far as Rosenzweig is aware, the hardware lacks point mode, where points
are used instead of the usual triangles; it also lacks isoline
support, but those two things can be emulated. The real problem comes
with transform
feedback and geometry shaders, neither of which is supported by the
hardware, but the driver emulates them with compute
shaders. However, the hardware tessellator cannot be used at all when
those are being emulated because minute differences in the tessellation algorithms used
by the hardware and the emulation would result in invariance
failures. She is not sure whether that is a problem in practice or not,
"but the spec says not to do it
", so she is hoping not to have to go
that route.
Instead, "we use software
". In particular, Microsoft released a
reference tessellator a decade or more ago, which was meant to show
hardware vendors what they were supposed to implement when tessellation was
first introduced. It is "a giant pile of 2000 lines of C++
" that
she does not understand, despite trying multiple times; "it is
inscrutable
". The code will tessellate a single patch, which gave the
driver developers an idea: "if we can run that code, we can get the
tessellation outputs and then we can just draw the triangles or the lines
with this index buffer
".
There are some problems with that approach, however, starting with the fact
that the developers are writing a GPU driver; "famously, GPUs do not like running
2000 lines of C++
". But, she announced, "we have conformant OpenCL 3.0 support
"
thanks to Karol Herbst, though it has not yet been released. OpenCL C is
"pretty much the same as regular CPU C
", though it has a few
limitations and some extensions for GPUs. So the idea would be to turn the
C++ tessellation code into OpenCL C code; "we don't have to understand
any of it, we just need to not break anything when we do the port
".
That works, but "tessellator.cl is the most unhinged file of my
career
"; doing things that way was also the most unhinged thing she has done in her career
"and I'm standing up here in a witch hat for the fifth year in a
row
". The character debuted in the exact same room in 2019 when she
was 17 years old, she recalled.
The CPU tessellator only operates on a single patch at a time, but a scene
might have 10,000 patches—doing them all serially will be a real problem.
GPUs are massively parallel, though, so having multiple threads each doing
tessellation is "pretty easy to arrange
". There is a problem with
memory allocation; the CPU tessellator just allocates for each operation
sequentially, but that will not work for parallel processing. Instead, the
driver uses the GPU atomic instructions to manage the allocation of output buffers.
In order to draw the output of the tessellators, though, there is a need to
use draw instructions with packed data structures as specified by the GPU.
That is normally done from the C driver code using functions that are generated
by the GenXML tool. Since the tessellators are simply C code,
"thanks to OpenCL
", the generated functions can be included into the
code that runs on the GPU. Rosenzweig went into more detail, which fills
in the holes (and likely inaccuracies) of the above description; those
interested in the details should look at the presentation video and her
slides.
"Does it work? Yes, it does.
"
She showed an image of terrain tessellation from a Vulkan demo. It was run
on an M2 Mac with "entirely OpenCL-based tessellation
". There is also
the question of "how is the performance of this abomination?
" The
answer is that it is "okay
". On the system, software-only terrain tessellation runs at
less than one frame-per-second (fps), which "is not very fun for playing
games
"; for OpenCL, it runs at 265fps, which is "pretty good
"
and is unlikely to be the bottleneck for real games. The hardware
can do
820fps; "I did wire up the hardware tessellator just to get a number for
this talk.
" There is still room for improvement on the driver's
numbers, she said.
Vulkan and gaming
She also announced Vulkan 1.3 conformance for the Honeykrisp M1/M2 GPU
driver. It started
by copying the NVK
Vulkan driver for NVIDIA GPUs, "smashed against the [Open]GL 4.6
[driver]
", which started passing the conformance test suite
"in about a month
". That was six months ago and, since then, she
has added geometry and tessellation shaders, transform feedback, and shader
objects. The driver now supports every feature needed for multiple
DirectX versions.
There are a lot of problems "if we want to run triple-A (AAA) games on
this system
", however. A target game runs on DirectX and Windows on an x86 CPU with
4KB pages, but "our target hardware is running literally none of those
things
". What is needed is to somehow translate DirectX to Vulkan,
Windows to Linux, x86 to Arm64, and 4KB pages to 16KB pages. The first two
have a well-known solution in the form of the DXVK driver and Wine, which are "generally packaged
into Proton
for Steam gaming
".
Going from x86 to Arm64 also has off-the-shelf solutions: FEX-Emu or Box64. She has a bias toward FEX-Emu;
"when I am not committing Mesa
crimes, I am committing FEX-Emulation crimes
". The big problem,
though, is the page-size difference.
FEX-Emu requires 4KB pages; Box64 has a "hack to use 16KB pages, but it
doesn't work for Wine, so it doesn't help us here
". MacOS can use 4KB
pages for the x86 emulation, but "this requires very invasive kernel
support
"; Asahi Linux already has around 1000 patches that are making
their way toward the mainline kernel, but "every one of those thousand
is a challenge
". Making changes like "rewriting the Linux memory
manager
" is not a reasonable path.
It turns out that, even though Linux does not support heterogeneous page
sizes between different processes, it does support them between different
kernels; "what I mean by that is virtualization
". A KVM guest
kernel can have a different page size than the host kernel. So, "this
entire mess
", consisting of FEX-Emu, Wine, DXVK, Honeykrisp, Steam, and
the game, "we are going to throw that into a virtual machine, which is
running a 4KB guest kernel
".
There is some overhead, of course, but it is hardware virtualization, so
that should have low CPU overhead. The problem lies with the peripherals,
she said. So, instead of having Honeykrisp in the host kernel, it runs in
the guest using virtgpu
native contexts; all of the work to create the final GPU command buffer is done
in the guest and handed to the host, rather than making all of the Vulkan
calls individually traverse the virtual-machine boundary. The VirGL renderer on the
host then hands that to the GPU, which "is not 100% native speed, but
definitely well above 90%
", Rosenzweig said.
The good news is that the overheads for the CPU and GPU do not stack, since
the two run in parallel. "So all the crap overhead we have in the CPU is
actually crap that is running in parallel to the crap overhead on the GPU,
so we only pay the cost once.
"
"'Does it work?' is the question you all want to know.
" It does,
she said, it runs
games like Portal and Portal 2. She also listed a number of
others: Castle Crashers,
The
Witcher 3, Fallout 4, Control, Ghostrunner, and Cyberpunk 2077.
All of the different pieces that she mentioned were made available on October 10, the day of the talk. For those running the Fedora Asahi Remix distribution, she suggested immediately updating to pick up the pieces that she had described. Before taking questions, she launched Steam, which took some time to come up, in part because of the virtual machine and the x86 emulation. Once it came up, she launched Control, which ran at 45fps on an M1 MAX system.
There was a question about resources from someone who has a Mac with 8GB of RAM. Rosenzweig said that the high-end gaming titles are only likely to work on systems with 16GB or more. She noted that she was playing Castle Crashers on an 8GB system during the conference, so some games will play; Portal will also work on that system. She hopes that the resources required will drop over time.
Another question was about ray-tracing
support, since Control can use that feature.
Rosenzweig suggested that patches were welcome but that she did not see
that as a high priority ("frankly, I think ray tracing is a bit of a
gimmick feature
"). Apple hardware only supports it with the M3 and
the current driver is for M1 and M2 GPUs, though she plans to start working
on M3 before long. The session concluded soon after that, though
Rosenzweig played Control, admittedly poorly, as time ran down.
[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Montreal for XDC. ]
The performance of the Rust compiler
Sparrow Li presented virtually at RustConf 2024 about the current state of and future plans for the Rust compiler's performance. The compiler is relatively slow to compile large programs, although it has been getting better over time. The next big performance improvement to come will be parallelizing the compiler's parsing, type-checking, and related operations, but even after that, the project has several avenues left to explore.
As the projects written using Rust get larger, compilation latency has become increasingly important. The Rust compiler's design causes slow compilation, Li said, when compared to languages with implementations suitable for fast prototyping such as Python or Go. This slowness does have a cause — the compiler performs many sophisticated analyses and transformations — but it would be easier to develop with Rust if the language could be compiled faster.
To give some idea of how badly Rust performs compared to other languages, Li presented a graph showing the compile time of programs with different numbers of functions across several languages. Rust's line was noticeably steeper. For small numbers of functions, it was no worse than other languages, but compile times increased much more as the Rust programs grew larger. Improving compile times has become an area of focus for the whole community, Li said.
This year, the speed of the compiler has more than doubled. Most of this is due to small, incremental changes by a number of developers. Part of this improvement is also from work on tools like rustc-perf, which shows developers how their pull requests to the Rust compiler affect compile times. The efforts of the parallel-rustc working group (in which Li participates) are now available in the nightly language builds, and end up providing a 30% performance improvement in aggregate. Those improvements should land in stable Rust by next year.
There are a handful of other big improvements that will be available soon. Profile-guided optimization provides (up to) a 20% speedup on Linux. BOLT, a post-link optimizer, (which was the topic of a talk that LWN covered recently) can speed things up by 5-10%, and link-time optimization can add another 5%, Li said. The biggest improvement at this point actually comes from switching to the lld linker, however, which can improve end-to-end compilation time by 40%.
The majority of the speedups have not been from big-ticket items, however. "It
is the accumulation of small things
" that most improves performance, Li
said. These include introducing laziness to some areas, improved hashing and
hash tables, changing the layout of the types used by the compiler, and more.
Unfortunately, code optimizations like these require a good understanding of the
compiler, and become rarer over time as people address the lower-hanging fruit.
So, to continue improving performance, at some point the community will need to address the overall design of the compiler, Li said. Parallelization is one promising avenue. Cargo, Rust's build tool, already builds different crates in parallel, but it is often the case that a build will get hung up waiting for one large, commonly-used crate to build, so parallelism inside the compiler is required as well. There is already some limited parallelism in the compiler back-end (the part that performs code generation), but the front-end (which handles parsing and type-checking) is still serial. The close connection between different front-end components makes parallelizing it difficult — but the community would still like to try.
Proposed in 2018 and established in 2019, the parallel-rustc working group has been planning how to tackle the problem of parallelizing the front-end of the compiler for some years. The group finally resolved some of the technical difficulties in 2022, and landed an implementation in the nightly version of the compiler in 2023. The main focus of its work has been the Rust compiler's query system, which is now capable of taking advantage of multiple threads. The main technical challenge to implementing that was a performance loss when running in a single-threaded environment. The solution the working group came up with was to create data structures that can switch between single-threaded and thread-safe implementations at run time. This reduced the single-threaded overhead to only 1-3%, which was judged to be worth it in exchange for the improvement to the multithreaded case.
The query system was not the only component that needed to adapt, however. The working group also needed to change the memory allocator — while the default allocator was thread-safe, having too many threads was causing contention. The group solved this by using separate allocation pools for each thread, but Li warned that this wasn't the only source of cross-thread contention. When the compiler caches the result of a query, that result can be accessed from any thread, so there is still a good deal of cross-thread memory traffic there. The working group was able to partially address that by sharding the cache, but there's still room for improvements.
While performance improves when using up to eight threads in testing, Li said, performance actually goes down between eight and sixteen threads; there are too many inefficiencies from synchronization. The project has considered switching to a work-stealing implementation so that threads do not spend as much time waiting for each other, but that is hard to do without introducing deadlocks. Finally, some front-end components are still single-threaded — particularly macro expansion. At some point, those will need to be parallelized as well.
In all, though, Li thought that Rust's compilation performance has "a shiny
future
". There are a handful of approaches that still need to be explored,
such as improving the linker, using thin link-time optimization, etc., but the
community is working hard on implementing these. Rust's incremental compilation
could also use some improvement, Li said. C++ does a better job there, because
Rust's lexing, parsing, and macro expansion are not yet incremental.
Li finished the talk with a few areas that are not currently being looked into
that might be fruitful to investigate: better inter-crate sharing of compilation
tasks, cached binary crates, and refactoring the Rust compiler to reduce the
amount of necessary global context. These "deserve a lot of deep
thought
", Li said. Ultimately, while the Rust compiler's performance is worse than
many other languages, it's something that the project is aware of and actively
working on. Performance is improving, and despite the easy changes becoming
harder to find, it is likely to continue improving.
realloc() and the oversize importance of zero-size objects
Small objects can lead to large email threads. In this case, the GNU C Library (glibc) community has been having an extensive debate over the handling of zero-byte allocations. Specifically, what should happen when a program calls realloc() specifying a size of zero? This is, it seems, a topic about which some people, at least, have strong feelings.The job of realloc() is to change the size of an existing allocation; its prototype is:
void *realloc(void *ptr, size_t size);
This call will replace the memory object at ptr with a new object of the given size and return a pointer to its replacement; that new object will contain the same data, up to the minimum of the old and new sizes. The old memory will be freed. Calls to realloc() are useful when an allocated block of memory needs to be resized on the fly.
An interesting question arises, though, when size is zero. If a
program calls malloc()
requesting a zero-byte object, the result is (in glibc) well defined:
"If size is 0, then malloc() returns a unique pointer
value that can later be successfully passed to free().
"
Needless to say, any attempt to store data using the returned pointer is
unlikely to lead to joy, but the caller will get a pointer value
back. This appears to be common behavior across malloc()
implementations.
The situation is not quite as clear with realloc() in either the relevant standards or existing implementations; if it is called with a size of zero, it could do any of:
- Behave like malloc() and return a pointer to a zero-byte object.
- Behave like free(), release the object, and return NULL.
- Behave like a spoiled brat, refuse to do anything, store an error number in errno, and return NULL.
Alejandro Colomar recently brought this question to the glibc development list. His opinion, expressed in unambiguous terms, was that realloc() should behave in the same way as malloc() when passed a size of zero; it should, in other words, free the passed-in object and return a pointer to a zero-byte object. But that is not what glibc's realloc() does; instead, it takes option 2 above: it frees the object and returns NULL. This behavior is not new; it dates back to 1999, when the behavior was changed to align with the C99 standard (or, at least, one reading thereof).
This change, Colomar asserted, should be reverted. That would make glibc
conformant to his interpretation of C99 and in line with what the BSD
systems do. He copied the discussion widely and rallied some support to
this cause. None other than Douglas McIlroy
put in an
appearance to describe glibc's behavior as "a diabolical invention
".
Paul Eggert agreed,
saying that "one must always be able to shrink an allocation, even if
the shrinkage is [to] zero
".
In truth, there were few defenders for glibc's behavior, but there is still resistance to changing it. As Joseph Myers pointed out, there are almost certainly programs that rely on the current behavior:
Given that any change in implementation behavior would introduce memory leaks or double frees depending on what an implementation currently does that applications rely on, it's not reasonable to try now to define it more precisely.
Colomar, though, disagreed strongly. Any memory leaks, he said, would be of zero-byte objects and could thus be ignored; Siddhesh Poyarekar pointed out, though, that a zero-byte object still requires the equivalent of two size_t values for tracking, so real memory would be leaked. Colomar also doubted the double-free case; Myers explained how that could come about. While there is disagreement over how many programs would be affected by this sort of subtle change, it is hard to argue that the number would be zero. That is going to make the glibc developers reluctant to change realloc() in this way.
Poyarekar, instead, attempted to find a compromise with this patch adding a new tunable that would allow the behavior of realloc() to be changed while leaving the current behavior as the default. Florian Weimer disliked the patch, saying that any applications that would benefit from the tunable are simply buggy in their current form; he later added that distributions would have to do a lot of work to support this tunable properly. Andres Schwab argued that a tunable is not appropriate, since a given program's needs will not change at run time; the behavior, if it changes at all, needs to be set at compile time.
Hanging over this whole discussion is another important detail: the C23
standard explicitly defines calling either malloc() or
realloc() with a size of zero as undefined behavior. That is a
change from C17, which called it implementation defined. Colomar had
mentioned that change in his initial message, but dismissed it, saying
"let's ignore that mistake
". The glibc project cannot entirely
ignore the standards it implements, though. So it is not surprising that,
among others, Poyarekar also suggested
updating the GNU tools and sanitizers to enforce this standard in the long
term.
The conversation wound down without any definitive conclusions. One might surmise, though, that a change to the well-documented behavior of realloc() for the last 25 years would be unlikely to happen, even without the C23 change. Instead, glibc will have a hard time resisting the push from the standards committee to eliminate the existence of zero-size objects entirely. At some point, attempting to allocate such an object may well become an error, though this behavior will surely be controlled by a feature-test macro, as usual, to avoid breaking existing programs. Meanwhile, though, chances are good that we will see further ado over the allocation of nothing.
Kernel optimization with BOLT
A pair of talks in the toolchains track at the 2024 Linux Plumbers Conference covered different tools that can be used to optimize the kernel. First up was Maksim Panchenko to describe the binary optimization and layout tool (BOLT) that Meta uses on its production kernels. It optimizes the kernel binary by rearranging it to improve its code locality for better performance. A subsequent article will cover the second talk, which looked at automatic feedback-directed optimization (AutoFDO) and other related techniques that are used to optimize Google's kernels.
Panchenko began with a slide showing a handful of companies and projects that use BOLT, which can be seen in his slides or the YouTube video of the talk. It was designed at first for large applications at Meta, but it turned out to also accelerate compilers. So, for example, it is used by Python since version 3.12; it is also used by LLVM, Rust, and others.
If you look back to ten years ago in the open-source world, getting the
maximum performance from an application was a
matter of using GCC or Clang with -O3, profile-guided
optimization (PGO), and link-time
optimization (LTO). But applying PGO was "kind of painful
", he
said, so only those who cared a lot about performance were using it.
Applying PGO to the kernel was even more painful, so most companies just used
-O2 or -O3 on their kernels.
The main Meta application is a virtual machine running PHP code; around 2015, developers there came up with the idea of a binary optimizer that
would work with both GCC and Clang to speed up the generated code. BOLT
turned out to exceed developers' expectations;
it gave large gains and was able to accelerate the compilers themselves.
It has been used at Meta since 2016; "most of the cycles are spent in
binaries optimized by BOLT
" throughout the Meta fleet. BOLT was
released as open source, under the Apache 2.0 license, in 2018 and has been part of LLVM since 2022.
![Maksim Panchenko [Maksim Panchenko]](https://static.lwn.net/images/2024/lpc-panchenko-sm.png)
Generally, developers think about how the data structures for their programs will be arranged in memory; it is less common for them to consider how the code is arranged. There are exceptions, including the Linux kernel developers, but most times the focus is on the data cache. The instruction cache is much smaller than the data cache and has not grown much over time, maybe doubling from 32KB to 64KB on Intel CPUs over the last 20 years. But, for large applications that do not spend most of their time in tight loops, the layout of code in memory matters a lot, so BOLT can make a major difference, he said.
Compilers do not have enough information to optimally place code, even when they have profiling information; it turns out that inlining functions changes the profile. The profile information may point to a function foo() that is called a lot, but when it is inlined, that information is not present. BOLT operates on the binary to observe the code that is being frequently executed so that all of it can be placed close together in memory. Once the BOLT paper was released in 2019, he said, other efforts, including Propeller, have come about; they are generally tied to a single toolchain, though, while BOLT can be used with GCC or Clang and with different linkers.
He showed a sort of heat map of the memory layout of the HHVM runtime, which is what is used by Meta for most of its workloads. One image showed hot code spread all over, while the post-BOLT image showed all of the hot code confined to the same small region of memory. That drastically reduces instruction-cache misses, translation-lookaside buffer (TLB) misses, and CPU time; for HHVM, BOLT produced a 7% performance increase, Panchenko said.
BOLT is a post-link optimizer that runs on ELF binaries, such as
vmlinux. Even though it is part of the LLVM project, BOLT still
supports GCC code; it also supports the "most popular
"
architectures: x86_64, Arm64, and RISC-V.
Applying BOLT to the kernel took some time, mostly in the form of ensuring
that the resulting kernel would run and not crash. One of the big problems
encountered was in finding a good benchmark to use to measure the impact of
BOLT. There are lots of different micro-benchmarks available, but "I
couldn't find any scalable, large-scale benchmark
". In the end, he
used the RocksDB
db_bench fillseq benchmark, which showed a 2.5% improvement just by
switching to a BOLT-optimized kernel. He and others at Meta ran one
of the company's main services on a BOLT-optimized kernel that produced a 2% queries-per-second (QPS) improvement, "which was quite significant
".
BOLT only changes branches and the location of the basic blocks in the binary to achieve its improvements. Most of the time, there is no need to recompile applications in order to apply BOLT; the exception is for programs built with split functions, which BOLT can do better than the compiler. The application does need to be relinked in order to produce relocation information. With that, and a profile of the kernel running a representative workload, BOLT will only take around four seconds to optimize vmlinux, he said.
The profile can be generated with a variety of mechanisms, including the last branch record (LBR) feature on Intel platforms and similar branch-sampling features on other architectures. If that is not available, the code can be instrumented to gather the needed information, but that has higher overhead than using LBR and the like. There are other options, but the profile quality, thus the BOLT optimizations, will not be as good.
BOLT needs an unstripped vmlinux binary because it uses the symbol names for code discovery, Panchenko said. BOLT can easily identify the boundaries of code and data in the ELF binary using the text and data segments that are defined. The segments are further divided into sections and BOLT uses the symbol-table information to identify the individual functions therein.
Then BOLT disassembles the functions, though, unlike objdump,
it will "symbolize the operands of the instructions
".
Distinguishing between constants and addresses in the instructions
can be a problem, however. The relocation information that is inserted by the
linker helps with that; it can be used to "effectively do symbolic
disassembly
" of the code.
The resulting instruction stream can be optimized
using normal techniques, such as peephole optimization, but "this is not
very efficient
". In order to do more optimizations on the code, an
intermediate representation (IR) of some kind is needed. The IR used is
"essentially a control-flow graph on top of MC instructions
",
Panchenko said; he was referring to the LLVM
machine code (MC) project, which provides a number of tools and
libraries that BOLT uses. The instructions look much like the assembly
code, but some may be annotated or modified to identify tail calls versus
other kinds of jumps, for example. "So if you look at BOLT disassembly,
you will have a much better idea of what's happening in your application
compared to regular objdump
".
BOLT uses the profile information for the basic blocks in the control-flow graph in order to determine where the hot code resides. To make the best code-layout decisions, though, having weights on the edges, rather than just execution counts for the basic blocks, would be useful. The LBR profiling can provide that information, but BOLT can recover some information about edge weights even without it.
Then the graph is used to optimize the code; "The main optimization that
gives us most of the gains is code reordering.
" The basic blocks can
be grouped together to reduce the instruction-cache footprint of the code.
That is done by breaking up the functions into fragments, some of which are
hot code and others that are rarely or never executed (e.g. error-handling
code). Compilers already do reordering, but on the function level; BOLT
takes it one step further and reorders these function fragments.
Once the new code is generated, there is a question about where to put it,
he said. Unless there is a big concern about disk space, it is more efficient to simply create a new text segment that
contains the hot code, which will generally be quite a bit smaller (for a
binary that is huge, "hundreds of megabytes
", 20MB or less of hot
code would end up in the new segment). So it is not much overhead, in
terms of disk space, and "you get a much faster
application
".
Adding another
text segment to the kernel binary may not be viable, he said, so he turned
to an alternative that is "much more feasible
".
BOLT
will simply rewrite the existing functions in the binary; those functions are already
ordered by the compiler based on its analysis, so BOLT effectively uses
that. BOLT could do a bit better function ordering based on the profile, but
that small gain can be sacrificed in order to easily get the basic-block
reordering, he said. It also helps avoid over-specializing the code for
only the workload measured by the profile; for other workloads that were
not measured, some of the cold code will be executed, but it will still be
located nearby due to the compiler choices.
The kernel provides other challenges, because its code is modified at boot time and also while it is running. He spent a good chunk of the talk going into the details of how that type of code is handled. Things like static calls, SMP locks (that are patched out on uniprocessor systems), static keys, and alternative instructions for different subarchitectures are handled with annotations in the disassembled code, which is part of the metadata that BOLT uses to do its job. For some, BOLT simply does not operate on them, while others have mechanisms that the optimizer can use on them. All of that metadata can be dumped using BOLT tools.
Panchenko said that he was skipping over some topics, such as continuous profiling, which can be applied to BOLT-optimized binaries as they run; a new version of a binary can be produced that will reflect changes in the code and the workload. He also did not cover any of the other optimizations that BOLT applies. He finished by showing some of the output that running BOLT produces, noting that a four-second demo of it operating would not be all that interesting.
[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]
AutoFDO and Propeller
Rong Xu and Han Shen described the kernel-optimization techniques that Google uses in the toolchains track at the 2024 Linux Plumbers Conference. They talked about automatic feedback-directed optimization (AutoFDO), which can be used with the Propeller optimizer to produce kernels with better performance using profile information gathered from real workloads. There is a fair amount of overlap between these tools and the BOLT post-link optimizer, which was the subject of a talk that directly preceded this session.
AutoFDO
Xu started by saying that he would be covering AutoFDO, then Shen would talk about further performance enhancements using Propeller on top of AutoFDO. At Google, where Xu and Shen work, the amount of time spent executing in the kernel on its fleet was measured. Xu put up a slide showing the CPU-cycle percentage of the kernel, with idle time excluded, for a variety of workloads, which can be seen in the slides or YouTube video of the talk. Anywhere from 17-60% of the CPU cycles were spent in the kernel; the fleet-wide average is about 20%.
There have been lots of efforts toward optimizing the Linux kernel, he said; at Google, one of the main techniques is feedback-directed optimization (FDO), which uses run-time information to improve code generation. The core idea is to gather profile information on real workloads to allow the compiler to focus on the optimizations that have the most impact based on how the program runs.
FDO has proven to be effective and is used on nearly all of the applications at Google, for Android, ChromeOS, and in the data center, he said. It improves instruction-cache usage, reduces TLB misses, and improves branch performance. The overall improvement can be up to 20% for the run time, but there is another benefit; FDO usually also reduces code size.
![Rong Xu [Rong Xu]](https://static.lwn.net/images/2024/lpc-xu-sm.png)
There are two major types of FDO: instrumentation-based and sample-based. The instrumentation-based variant (also known as iFDO) is the original one; it uses an instrumented kernel to run tests that represent the real workload, which result in the profile to be used. The profile is then fed to the compiler to build an optimized kernel that can be deployed to production. The instrumented kernel provides accurate profiles, without depending on performance monitoring unit (PMU) hardware, but the resulting kernel is up to ten times slower, so it cannot be used in production. The load tests still need to be representative, however. Beyond that, another disadvantage is that iFDO requires kernel changes for the instrumentation.
Sample-based FDO (or AutoFDO) starts with a kernel built with debug
symbols, then the perf tool is used to measure a representative workload.
The raw perf data is converted to an AutoFDO profile that is fed to the
compiler to build the optimized kernel. Because the perf-based data
collection has such low overhead, the test kernel can be used in production
with real workloads, which eliminates the concern about having truly
representative test cases. But, AutoFDO does not optimize quite as well as
iFDO
because the profile data is not as complete; "so if you use the same
load test, you will get close performance, but not as good
". In
addition, AutoFDO requires support for both a hardware PMU and perf.
For fast-moving projects with incremental releases, the suggested
model is to
use an iterative process where the profile information gathered running the
kernel in production for a particular
release is fed to the compiler building the test kernel for the next
release. That release is run in production and the profile gathered is fed
in to build the following release—and so on. There are code mismatches, but
AutoFDO uses line numbers relative to the start of functions; if there is
no major refactoring in the release, "we find this works pretty
well
". The developers think this mode will work well for minor releases of the kernel, he said.
AutoFDO requires last branch record LBR or similar support in the processor; AMD branch sampling (BRS) and Arm embedded trace macrocell (ETM) can provide the same kind of information. With that information, AutoFDO can create a profile that allows multiple types of optimizations to be made. There are three that usually have the most impact: function inlining, increasing branch fall-through, and indirect-call promotion.
Inlining helps reduce both the size and run time of the code. Falling through branches is faster than taking a branch even if it is correctly predicted; in addition, that optimization groups hot basic blocks together for better instruction-cache utilization. Lastly, indirect calls can be promoted to direct calls, which allows them to be inlined. Other optimization passes can query the profile data to retrieve counts for basic blocks, branches taken, and so on, so those optimizations will be improved as well.
Xu gave a brief report of some performance numbers. In nearly all of the tests (a mixture of micro-benchmarks and workload tests), iFDO improved performance slightly more than AutoFDO, and most were fairly modest gains (2-10%) overall. In a workload test of a Google database application, AutoFDO performance improved by 2.6%, while iFDO saw 2.9%. He worked with some Meta engineers who reported a roughly 5% performance increase in one of their services using AutoFDO and around 6% with iFDO.
Overall, the path for using AutoFDO on the kernel has been fairly smooth.
"The testing turns out to be the most challenging part of the work
";
finding representative workloads and benchmarks is difficult. Along the
way, the developers have learned that profiles from Intel machines work
well on AMD machines. For even better performance, he recommended using
link-time
optimization (LTO) and ThinLTO.
Propeller
Shen then stepped up to describe using Propeller to build kernels. He started by describing what a post-link optimizer is, but noted that the BOLT talk just prior had covered it well. A post-link optimizer only requires the binary and a profile of it running on a representative workload, it does not usually need the source code. It uses a disassembler to convert the machine code into a compiler intermediate representation (IR), then does analysis and optimization, and finally converts back to machine code for the final output.
Conceptually, Propeller is a post-link optimizer, but it takes a different approach; it uses the compiler and linker to do the transformations. As with other post-link optimizers, it takes in a binary and a profile to identify a series of optimizations, but it does not directly apply them. Instead, it writes compiler and linker profiles, then redoes the build using the same source code. The compiler and linker do their usual jobs, but use the profile information to apply the optimizations as part of the process. The result is the same, an optimized binary, but the internal steps are different.
![Han Shen [Han Shen]](https://static.lwn.net/images/2024/lpc-shen-sm.png)
There are a few reasons why this approach was used.
While working on post-link optimization, the developers found that the
disassembler does not work on all binaries. For example, it does not work
on binaries with
read-only data intertwined with code, or for binaries that must comply with various standards, such as FIPS
compliance. Beyond that, there are scalability problems: "the
single-machine multithreaded design of the post-link optimizer leads to
excessive processing time and memory usage
". That led them to shift
some of that burden to the compiler and linker, he said.
There are a few advantages to the Propeller mechanism. First, the
optimization transformation is "done with the uttermost accuracy
"
because they are handled by the compiler and linker. It also "opens the
door for parallelizing because each compilation job is independent
". A
distributed build cache can be used; if no optimization instructions are
found for a particular source file, it does not need to be rebuilt and the
cached object file can be used. It also means that Propeller itself does
not need to deal with the kernel metadata that gets built; that is all
handled by the existing build process. Propeller does require the source
code, however, and the rebuild step, which are disadvantages of the approach.
There are currently two major optimizations that are made by Propeller: basic-block layout and path cloning. There is also active work on adding inter-procedure register allocation, Shen said.
Propeller can be combined with AutoFDO by building an AutoFDO profile that is used to build a kernel that is then profiled for Propeller. The profile used to build the AutoFDO kernel is also used to rebuild the kernel again with the Propeller profiles. Each kernel is run in production to generate the profiles needed. The final result is a kernel that has been optimized by both AutoFDO and Propeller.
Unlike other post-link optimizers, Propeller can optimize kernel modules. Post-link optimizers operate on executables and shared libraries, but kernel modules are relocatable objects. Propeller can resolve the relocation information in the module to link to the proper symbols.
As mentioned earlier, AutoFDO can tolerate minor source-code changes and
small differences in the build options, but that is not true for Propeller.
The source code and build options of the kernel used for generating the
profile must be identical to those that Propeller uses. The developers
"are working hard to try to mitigate this limitation
", he said.
Propeller requires specific software and hardware support to do its job. It depends on a tool that lives outside of the LLVM source tree. The developers are currently working on migrating the functionality into LLVM. It also needs branch information from the CPU, which comes from features that vary across architectures (LBR for Intel, BRS or LBRv2 for AMD, and SPE or ETM for Arm). For internal applications, Propeller has been verified on each architecture (though not each branch-tracking feature); for the kernel, only Intel LBR has been validated at this point.
Shen showed some heat maps and graphs of PMU stats for no FDO, iFDO, AutoFDO, and the latter two with ThinLTO and Propeller. As might be guessed, there were improvements in nearly all of the categories for Propeller.
Currently, patches enabling AutoFDO and Propeller in LLVM have been sent
upstream for review, Shen said. Internally, the developers are doing
"large-scale production tests to measure the performance
company-wide
"; they are also looking at customer kernels based on
specific workloads with the idea of building different kernels for
individual applications.
In summary, he said that FDO, using either iFDO and AutoFDO, "improves
kernel performance significantly
". AutoFDO integrates well with the
kernel build and allows profiling in production, which is a huge benefit
over iFDO. Their advice is to add Propeller on top to get the best possible performance.
During Q&A, Peter Zijlstra asked about why there was a need for multiple profile runs; the AutoFDO profile should already have the needed information, he said. Shen said that the first-pass optimizations change the code such that the profile information is out of date, which is where the Propeller profiling picks up. Zijlstra noted that the code is transformed again at that point, so another profile would be needed; Shen agreed that it is an iterative process and that more profiling and rebuilding passes improve the results.
Zijlstra also complained that calling Propeller a post-link optimizer was not accurate. The optimization happens during the build process. Another audience member agreed, saying that the profile information was gathered post-link, but that it was simply another form of profile-guided optimization; Zijlstra suggested that Propeller should simply be called that to avoid the confusion.
There is, it seems, a lot of work going into producing optimized kernels using LLVM, though BOLT can also operate on kernels built with GCC. That this work is being done by companies with large fleets of cloud systems is no real surprise; a tiny increase in performance multiplied by the fleet size makes for a rather large impact on CPU usage, thus power requirements, number of systems needed, and so on—all of which boils down to spending less money, of course.
[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for the Linux Plumbers Conference. ]
A new approach to validating test suites
The first program that Martin Pool ever wrote, he said, had bugs; the ones he's writing now most likely have bugs too. The talk Pool gave at RustConf this year was about a way to try to write programs with fewer bugs. He has developed a tool called cargo-mutants that highlights gaps in test coverage by identifying functions that can be broken without causing any tests to fail. This can be a valuable complement to other testing techniques, he explained.
There are lots of ways to write programs with fewer bugs, Pool said. Many people at RustConf probably jump immediately to thinking of type systems or lifetime annotations. Those tools are great, and can eliminate entire classes of bugs, but they can't help with semantic bugs, where the programmer tells the computer to do the wrong thing, he said. The best way to catch those bugs is to write tests.
![Martin Pool [Martin Pool]](https://static.lwn.net/images/2024/martin-pool-small.png)
Tests don't catch everything either — "I do write tests, and I still have
bugs
" — but they help. Sometimes, however, "you look at a bug and ask, how
was this not caught?
". The problem is that tests themselves can be buggy.
Sometimes they just don't cover some area of the code; sometimes they do
call the code, but don't verify the output strictly enough. Sometimes the tests
were comprehensive when they were written, but now the program has grown and the
tests no longer cover everything. A significant fraction of bugs could, in
principle, be caught by tests, Pool said. It would be nice if there were a tool
that could point out where tests are insufficient, so that we can avoid
introducing new bugs into existing software.
In order to explain how cargo-mutants does that, Pool first briefly covered the
process of writing tests: the programmer starts out knowing what the program
should do, and then tries to write tests that tie what it should do to what it
actually does. That tie should be robust — the tests should check all of the
properties the programmer cares about, and they should fail if those properties
don't hold. Good tests should also generalize, since exhaustively testing every
possible behavior is impossible. So one way to think about finding weaknesses in
a set of tests is to ask "what behaviors does the program have that are not
checked by any tests?
"
That is exactly what cargo-mutants does — it uncovers code that is buggy, unnecessary, underspecified, or just hard to test, by finding gaps in the test suite. The core algorithm is to make a copy of a project's source code, deliberately insert a bug, run the test suite, and expect the tests to fail. If they don't, report it as a "missed mutant" — that is, a buggy version of the program that wouldn't be caught by the tests, which needs some human attention to figure out the best way to address it.
Pool gave a specific example of a bug that was found by cargo-mutants. In one of his programs, he wrote an is_symlink() function, in a program that expected to get symbolic links and need to handle them. Cargo-mutants showed that if the body of the function were replaced with true, the tests did not catch it — or, in other words, the tests didn't cover the possibility of something not being a symbolic link. This is actually pretty common, Pool said, since humans often forget to write negative test cases. In any case, he added a test and immediately found that the original implementation was wrong as well.
For a somewhat larger example, he selected the semver crate, which provides functions for working with semantic version numbers. He isn't involved in the semver project, so it represents an unbiased test. He ran cargo-mutants on it, and found a handful of places where the test suite could be improved. This is typical of what is found among high-quality crates, Pool said. In semver's case, if the hashing code was broken, the tests would still pass. It's a good crate — not buggy — but there are still some gaps in the coverage that could make it easier to unknowingly introduce bugs.
That's not to say that every single missed mutant is necessarily a problem, Pool
explained. People have limited time; they should focus on the most important
things, instead of spending all of their time polishing their test suite. The
findings from cargo-mutants are just things to think about. "The computer is
not the boss of you.
" So what Pool recommends is skimming the list of missed
mutants, looking for code that is important, or that should be
thoroughly tested. Then, one should consider whether the introduced mutation is
really incorrect, whether the original code was actually correct, and what test
should have caught this.
Ultimately, it's important not to stress too much about the mutants the tool finds, Pool said. Cargo-mutants generates synthetic bugs — but the goal is always to catch and prevent real bugs. Pool compared cargo-mutants to vaccines: deliberately introducing a weakened version of a problem in a controlled way, so that you can beef up the system that is supposed to catch the real problem.
With the audience hopefully convinced to at least give cargo-mutants a try, Pool switched to talking about how the tool works under the hood. The most interesting part, he said, is actually generating synthetic bugs. There are a few requirements for the code that generates them to implement: they should be valid Rust code, that will point to real gaps, without being too repetitive. They should also, ideally, be generated deterministically, because that makes it easier to reproduce them. There's lots of existing research on how to do this, he said, but it turns out that small, local changes to the code are both easy to generate, and close to what could really change in the codebase.
So cargo-mutants uses
cargo-metadata and
Syn to understand the Rust code in a project
and parse it. For each Rust module, it tries pattern matching on the syntax tree
and applying two different kinds of changes: replacing a whole function with a
stub, and changing arithmetic operations. In the future, the tool could add more,
but just those two turn out to be "really powerful
". When replacing a
function, it generates likely values in a type-driven way — even quite
complicated nested types can be generated, by first generating their constituent
parts. So even functions
returning complex, nested types can be stubbed out.
Pool has been working on cargo-mutants for a few years. He runs it on his own crates, and also on other open-source crates. With that experience, he can say that getting to complete coverage of every mutant can be a bit of work, because it points out gaps that can be hard to address. It's tempting to suppress a mutant, but sometimes they matter. Mostly, the tool finds gaps in the tests, but occasionally it finds a real bug as well, he said. The most difficult cases are when there is a test for something, but the test is flaky. When introducing cargo-mutants to a new project, he thinks it is easiest to turn it on for one file, module, or crate at a time, and gradually expand through the code base.
Cargo-mutants has 100% coverage of itself, naturally. Once you get to that level, Pool said, it's pretty easy to stay there, since the list of new mutants created by any given change is small. Getting to that point has revealed lots of interesting edge-cases, though.
One other difficulty has been parallelization. Cargo-mutants runs hundreds or thousands of incremental builds and tests, which eats a lot of CPU time. Since the tests of each mutant are independent, the tool can run multiple tests independently — even spread across multiple machines. To give an idea of the performance, Pool used numbers from cargo-mutants itself. The crate is 11,000 lines of code, including tests. Building it takes 12 seconds, while running the tests takes 55 seconds. Running cargo-mutants on itself generates 575 mutants (all of which are covered by the test suite, and so are not missed mutants), but because of incremental builds and parallelization, running the tests on all of them only takes about 15 minutes.
Pool finished by comparing cargo-mutants to other testing practices such as code
coverage and fuzzing. Code coverage checks that code was executed, he explained.
Cargo-mutants checks that the results of the test suite depend on running the
code, which is a stronger guarantee. Cargo-mutants also tells you exactly what
the uncaught bug is, which can be easier to handle. The downside is that
cargo-mutants is significantly slower than existing code-coverage tools. A
developer could easily run both, Pool said.
Another common point of comparison is fuzzing or property testing.
Fuzzing is a good way to find problems by "wiggling the inputs
";
cargo-mutants finds problems by "wiggling the code
".
There are a lot of tools that purport to help programmers write more correct software. The choice of which ones to use is often a tradeoff between correctness and time. Still, nearly every project has at least some tests — so an easy-to-use tool like cargo-mutants that helps improve a project's test suite without too much additional effort seems like it could be a valuable option.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: CUPS vulnerability; Fedora 41; Firefox 132.0; Flock; OSAID 1.0; Thunderbird for Android; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.