A new approach to validating test suites

By Daroc Alden
October 29, 2024

RustConf 2024

The first program that Martin Pool ever wrote, he said, had bugs; the ones he's writing now most likely have bugs too. The talk Pool gave at RustConf this year was about a way to try to write programs with fewer bugs. He has developed a tool called cargo-mutants that highlights gaps in test coverage by identifying functions that can be broken without causing any tests to fail. This can be a valuable complement to other testing techniques, he explained.

There are lots of ways to write programs with fewer bugs, Pool said. Many people at RustConf probably jump immediately to thinking of type systems or lifetime annotations. Those tools are great, and can eliminate entire classes of bugs, but they can't help with semantic bugs, where the programmer tells the computer to do the wrong thing, he said. The best way to catch those bugs is to write tests.

Tests don't catch everything either — "I do write tests, and I still have bugs" — but they help. Sometimes, however, "you look at a bug and ask, how was this not caught?". The problem is that tests themselves can be buggy. Sometimes they just don't cover some area of the code; sometimes they do call the code, but don't verify the output strictly enough. Sometimes the tests were comprehensive when they were written, but now the program has grown and the tests no longer cover everything. A significant fraction of bugs could, in principle, be caught by tests, Pool said. It would be nice if there were a tool that could point out where tests are insufficient, so that we can avoid introducing new bugs into existing software.

In order to explain how cargo-mutants does that, Pool first briefly covered the process of writing tests: the programmer starts out knowing what the program should do, and then tries to write tests that tie what it should do to what it actually does. That tie should be robust — the tests should check all of the properties the programmer cares about, and they should fail if those properties don't hold. Good tests should also generalize, since exhaustively testing every possible behavior is impossible. So one way to think about finding weaknesses in a set of tests is to ask "what behaviors does the program have that are not checked by any tests?"

That is exactly what cargo-mutants does — it uncovers code that is buggy, unnecessary, underspecified, or just hard to test, by finding gaps in the test suite. The core algorithm is to make a copy of a project's source code, deliberately insert a bug, run the test suite, and expect the tests to fail. If they don't, report it as a "missed mutant" — that is, a buggy version of the program that wouldn't be caught by the tests, which needs some human attention to figure out the best way to address it.

Pool gave a specific example of a bug that was found by cargo-mutants. In one of his programs, he wrote an is_symlink() function, in a program that expected to get symbolic links and need to handle them. Cargo-mutants showed that if the body of the function were replaced with true, the tests did not catch it — or, in other words, the tests didn't cover the possibility of something not being a symbolic link. This is actually pretty common, Pool said, since humans often forget to write negative test cases. In any case, he added a test and immediately found that the original implementation was wrong as well.

For a somewhat larger example, he selected the semver crate, which provides functions for working with semantic version numbers. He isn't involved in the semver project, so it represents an unbiased test. He ran cargo-mutants on it, and found a handful of places where the test suite could be improved. This is typical of what is found among high-quality crates, Pool said. In semver's case, if the hashing code was broken, the tests would still pass. It's a good crate — not buggy — but there are still some gaps in the coverage that could make it easier to unknowingly introduce bugs.

That's not to say that every single missed mutant is necessarily a problem, Pool explained. People have limited time; they should focus on the most important things, instead of spending all of their time polishing their test suite. The findings from cargo-mutants are just things to think about. "The computer is not the boss of you." So what Pool recommends is skimming the list of missed mutants, looking for code that is important, or that should be thoroughly tested. Then, one should consider whether the introduced mutation is really incorrect, whether the original code was actually correct, and what test should have caught this.

Ultimately, it's important not to stress too much about the mutants the tool finds, Pool said. Cargo-mutants generates synthetic bugs — but the goal is always to catch and prevent real bugs. Pool compared cargo-mutants to vaccines: deliberately introducing a weakened version of a problem in a controlled way, so that you can beef up the system that is supposed to catch the real problem.

With the audience hopefully convinced to at least give cargo-mutants a try, Pool switched to talking about how the tool works under the hood. The most interesting part, he said, is actually generating synthetic bugs. There are a few requirements for the code that generates them to implement: they should be valid Rust code, that will point to real gaps, without being too repetitive. They should also, ideally, be generated deterministically, because that makes it easier to reproduce them. There's lots of existing research on how to do this, he said, but it turns out that small, local changes to the code are both easy to generate, and close to what could really change in the codebase.

So cargo-mutants uses cargo-metadata and Syn to understand the Rust code in a project and parse it. For each Rust module, it tries pattern matching on the syntax tree and applying two different kinds of changes: replacing a whole function with a stub, and changing arithmetic operations. In the future, the tool could add more, but just those two turn out to be "really powerful". When replacing a function, it generates likely values in a type-driven way — even quite complicated nested types can be generated, by first generating their constituent parts. So even functions returning complex, nested types can be stubbed out.

Pool has been working on cargo-mutants for a few years. He runs it on his own crates, and also on other open-source crates. With that experience, he can say that getting to complete coverage of every mutant can be a bit of work, because it points out gaps that can be hard to address. It's tempting to suppress a mutant, but sometimes they matter. Mostly, the tool finds gaps in the tests, but occasionally it finds a real bug as well, he said. The most difficult cases are when there is a test for something, but the test is flaky. When introducing cargo-mutants to a new project, he thinks it is easiest to turn it on for one file, module, or crate at a time, and gradually expand through the code base.

Cargo-mutants has 100% coverage of itself, naturally. Once you get to that level, Pool said, it's pretty easy to stay there, since the list of new mutants created by any given change is small. Getting to that point has revealed lots of interesting edge-cases, though.

One other difficulty has been parallelization. Cargo-mutants runs hundreds or thousands of incremental builds and tests, which eats a lot of CPU time. Since the tests of each mutant are independent, the tool can run multiple tests independently — even spread across multiple machines. To give an idea of the performance, Pool used numbers from cargo-mutants itself. The crate is 11,000 lines of code, including tests. Building it takes 12 seconds, while running the tests takes 55 seconds. Running cargo-mutants on itself generates 575 mutants (all of which are covered by the test suite, and so are not missed mutants), but because of incremental builds and parallelization, running the tests on all of them only takes about 15 minutes.

Pool finished by comparing cargo-mutants to other testing practices such as code coverage and fuzzing. Code coverage checks that code was executed, he explained. Cargo-mutants checks that the results of the test suite depend on running the code, which is a stronger guarantee. Cargo-mutants also tells you exactly what the uncaught bug is, which can be easier to handle. The downside is that cargo-mutants is significantly slower than existing code-coverage tools. A developer could easily run both, Pool said. Another common point of comparison is fuzzing or property testing. Fuzzing is a good way to find problems by "wiggling the inputs"; cargo-mutants finds problems by "wiggling the code".

There are a lot of tools that purport to help programmers write more correct software. The choice of which ones to use is often a tradeoff between correctness and time. Still, nearly every project has at least some tests — so an easy-to-use tool like cargo-mutants that helps improve a project's test suite without too much additional effort seems like it could be a valuable option.

Index entries for this article
Conference	RustConf/2024

Coverage for important strings

Posted Oct 29, 2024 18:45 UTC (Tue) by iabervon (subscriber, #722) [Link] (12 responses)

If you accidentally typed a character when your cursor was inside a string constant and then saved the file after making a different, correct, change, would your test suite detect the unintended change? Is that true for every string constant that ought to matter in your program? It's often the case that your test suite will run every line of code, but there will be some strings that get put into a table, and other strings in that table are used by tests to affect outcomes, but some of them just get stored as yet another simple HTML tag name with no special effects.

Coverage for important strings

Posted Oct 29, 2024 19:37 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

I've started running `cargo-mutants` on some of my crates and this thought crossed my mind. However, some strings just…don't matter. They might be the base name of a temporary file that just needs to agree in every location. In fact, if something like that *does* trip the test suite, it feels like a phantom invisible dependency that I'd rather move out to a single constant to share among the important places. But using `cargo mutants` to *find* these strings would be nice.

Other strings should certainly *not* be permuted into some test string though (e.g., changing `cp` to `rm` in something doing commands sounds…very risky). Alas, even in Rust, strings end up in many places (usually wrapped in some type-scented decoration for type checking during compilation), but mutation of them is probably something needing typed string literals of some kind.

Coverage for important strings

Posted Oct 31, 2024 16:48 UTC (Thu) by mbp (subscriber, #2737) [Link] (1 responses)

It doesn't mutate strings at the moment, but it's an interesting idea. At least some of them are probably important, although probably many . Perhaps byte strings or C strings are more likely to be critical that they stay exactly as they are.

> Other strings should certainly *not* be permuted into some test string though (e.g., changing `cp` to `rm` in something doing commands sounds…very risky).

Mutating the program is just unavoidably, intentionally, going to cause new behaviors. If you have a function that, for example, takes a path to delete and with "" deletes your home directory, then cargo mutants might cause that to execute, but so might any other bug. For programs containing code that might cause side effects beyond the program's own execution you probably want to run the tests in a container or VM: https://mutants.rs/limitations.html?highlight=delete#caut...

Coverage for important strings

Posted Nov 4, 2024 13:16 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I think some way to mark "these strings are content" versus "these strings are control" would be useful. That way things like comment format strings could be mutated but things like `["git", "apply", …]` command arrays would not be. But I'm not sure how to do that without sprinkling attributes everywhere (and wanting expression attributes to boot).

Coverage for important strings

Posted Oct 29, 2024 20:46 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (8 responses)

I'm pretty sure that my tests would not catch a significant proportion of such changes, and that maybe half of those uncaught changes are significant (the others being cases where the string didn't matter, for example it's fine if my panic spells the word "available" wrong)

On the other hand, I always check what I'm about to commit, because I know I will read the changes sometimes, and if they're full of stray garbage that's a problem. Likewise if I turn out to have changed more than I realised, my log message should reflect the actual work, not just the last thing I did before I typed commit. So if I made a correct change but also mangled a string, that string is likely to stick out when examining the git diff.

Coverage for important strings

Posted Oct 29, 2024 21:27 UTC (Tue) by iabervon (subscriber, #722) [Link] (3 responses)

I've hit a key with the cursor in the wrong window while writing my commit message, so I look at the commits after making them, before pushing them anywhere. I'd say it's likely to get my attention, too, but then I worry about just how unlikely I am to miss it, particularly on a day when my typing is subpar.

Coverage for important strings

Posted Oct 30, 2024 12:15 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

That's one thing I would like AI to handle for me: checking the code changes match the commit message. In some cases you don't even need AI; if the commit message is "Whitespace." then the diff should be empty modulo whitespace, and for "Comment." the old and new should be identical after stripping comments.

Coverage for important strings

Posted Oct 31, 2024 9:50 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

Why would you use AI (presumably a LLM) for something that a few lines of shell script in a pre-commit hook can do for you?

Coverage for important strings

Posted Oct 31, 2024 13:08 UTC (Thu) by epa (subscriber, #39769) [Link]

"Whitespace" and "Comment" yes a shell script can check. More complex but still trivial changes like "Renamed variable" or "Moved code into new subroutine" can be checked automatically but only with some knowledge of the language's syntax and semantics. At this point, rather than writing a parser for each language you might reach for an 80% solution by asking a LLM to work out roughly what's going on and whether there were any accidental code changes. If I had the option to get my commits and commit messages reviewed by AI, I'd certainly try it, although I am reluctant to have the AI write the code itself.

Coverage for important strings

Posted Oct 29, 2024 23:30 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

"Every developer should be as rigorous as me with such things." Alas, a team of size one does not typically scale so well.

I'd be interested to also drive mutation two ways: CI first smoke tests those that have been discovered to be sensitive to the code in question (you still want the full test suite eventually). Also, run the tests under mutation of the patched code to find issues. I think it's a tool in the toolbox and, with some plumbing between git and CI, could be very powerful to help find issues before maintainer time is spent looking at diffs.

Coverage for important strings

Posted Oct 31, 2024 17:00 UTC (Thu) by mbp (subscriber, #2737) [Link]

There are some features for running under CI, including focusing on the changed code: https://mutants.rs/ci.html

It does not yet comment on your PRs, but I'd like to add that.

Coverage for important strings

Posted Oct 30, 2024 8:56 UTC (Wed) by taladar (subscriber, #68407) [Link]

> for example it's fine if my panic spells the word "available" wrong

There is a tool called https://crates.io/crates/typos-cli that you can put into your pre-commit hook to avoid committing mistakes like that. The main issue with it is that it only supports English so if you have other languages in your source code you might get false positives (but there are ways to add to the list of words accepted or to exclude files).

Coverage for important strings

Posted Oct 31, 2024 23:09 UTC (Thu) by himi (subscriber, #340) [Link]

`git commit -v` is very useful there, though I find it's easier to skip re-checking the diff at that point if it's too long and involved . . . Which has the side-effect of encouraging me to make my commits smaller, so I don't have that niggling sense of discomfort from seeing the diff and going "naah, not today - tl;dr"

I've also found gitui a useful tool for giving me a different view of what I've changed - I rarely end up using it to drive the actual commit process (I use it more often as a friendly way of doing `git add -i`), but it gives a different perspective which can help get past that tl;dr issue.

For the specific case of mutating strings in the code, this seems like a situation where fuzzing would be useful - it'd need to be aware of things like format string mini-languages as well as potentially the distinction between a "plain" string and something like a path (in Python you could target mutations at pathlib.Path objects, for example), but just throwing mutations at any constant/hard-coded string value would be a useful starting point. Though you'd need to run your tests in a sandboxed environment, since the semantics of `subprocess.run(["nm", objfile])` and `subprocess.run(["rm", objfile])` are kind of different . . .

podcast interview about mutation testing at Google for a variety of languages

Posted Oct 30, 2024 4:38 UTC (Wed) by alison (subscriber, #63752) [Link] (1 responses)

Software Engineering recently posted a fine article about automated mutation-testing tools:

https://se-radio.net/2024/09/se-radio-632-goran-petrovic-on-mutation-testing-at-google/

podcast interview about mutation testing at Google for a variety of languages

Posted Nov 1, 2024 10:22 UTC (Fri) by k3ninho (subscriber, #50375) [Link]

Thank you for this, this is a strategy I'd not considered feasible against the explosion of test cases that an all-branches coverage report can cause (even if you pack together tests so that there's orthogonal coverage of two or more areas in one test case without an early failure clouding the validity of a later negative test result).

I think this conversation hides an elephant, too. There's a difference in approach between the London Style and the Detroit Style for an integrated test suite[1]. The London 'Extreme Tuesday Club' is an ongoing meetup founded in the 2000's to discuss Extreme Programming, while the wiki.c2.com community that sprung up around Kent Beck's development of Extreme Programming (and Agile) while working a project for Chrysler in Detroit.

London Style breaks the test scope into everything that has an interface -- and assumes a stable code base, possibly a practice where legacy interfaces don't change but implementation details might -- so there's scope to test all the viable code paths. (Plus you get to reject bad data when it's an unexpected use of the existing code.) There can be a reluctance to rework interfaces because you also have to keep a test suite 'always green' for no feature gain -- the stability of the code can ossify the project.

Detroit Style is a scaffold to a changing code base, or changing interfaces, so you have only relevant test code for the things you want to protect from regressions. Investing in mutation cases for a fluid code-base might not be worth the time, but the fluidity of the Linux code-base shows that change is ongoing and we need to run test cases to protect against regressions.

These approaches probably should overlap, but 'should is considered harmful' for the number of unchecked assumptions it brings to the conversation.

1: https://zone84.tech/architecture/london-and-detroit-schoo...

K3n.

Experience from Google

Posted Oct 30, 2024 13:24 UTC (Wed) by bjackman (subscriber, #109548) [Link] (1 responses)

I had a pretty positive experience with the mutation testing capabilities that Google's internal Monorepo has for Go code. I think it's a pretty useful technology.

The main weakness I found with it is that early on, it would produce a lot of mutants that were pretty un-interesting. For example, it would say "here you're handling an error from this IO operation, but if you were to ignore the error no tests would fail". It gave me the impression to get really total test coverage you'd have a quite unfavourable cost/benefit tradeoff: maybe you need really fiddly error injection logic to get tests that would detect those bugs, and maybe those bugs are just really unlikely because it's quite hard to forget to handle IO errors if you have a reasonable coding style.

I think the solution Google adopted for that problem was just to have heuristics about "interesting" mutations, because I stopped having that problem a while after the feature was introduced.

Experience from Google

Posted Oct 31, 2024 16:57 UTC (Thu) by mbp (subscriber, #2737) [Link]

I also enjoyed using that system at Google and it inspired me to work on cargo-mutants.

There is some challenge in finding interesting mutants and avoiding uninteresting results. I wrote a bit about that here: https://mutants.rs/goals.html

Untested error handling code is interesting. The specifics of error handling and the defaults if you do nothing will be quite different between Go and Rust, of course. But, more broadly, it's not uncommon to write some error handling code but find it hard to test.

I do really sympathize that it can be annoying to test, and you might have to build or hook up some whole error-injection framework to make the code reachable. (https://docs.rs/failpoints/ in Rust looks good, for example.)

On the other hand, if you don't have any tests, how do you know the error handling code is correct? (I'm talking about code more complex than just "log a message and return the error.") I've found it's actually surprisingly easy for it to be wrong, and a bit harder to reason about because it's departing from the normal path of the program. And this is not just a subjective impression: https://www.eecg.utoronto.ca/~yuan/papers/failure_analysi... found that a lot of critical failures in distributed systems were due to bugs in error handling code. If it's not tested, why not just panic or return the error?

This is automating how developers should write tests in the first place

Posted Oct 31, 2024 16:15 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> The problem is that tests themselves can be buggy. Sometimes they just don't cover some area of the code; sometimes they do call the code, but don't verify the output strictly enough.

Even before looking at automated mutations (which looks awesome!), how do you "test a (new) test"? By simply, manually and temporarily... breaking / "mutating" the production code it is supposed to test. That makes your test fail, and now you know your new test does at least one useful thing and does not just waste time. Yet I've seen vast numbers of pointless, untested tests that never, ever failed in their entire life! If it's not tested then it does not work: also true with test code itself.

I understand the "laziness" of not writing enough tests, but why would anyone waste both human and computer time writing untested tests? Because in many companies/projects, validation code is not considered as "noble" as production code and delegated to less experienced and less valued contractors/interns/etc. who don't care much and are not compensated based on proper quality metrics but based on some checkboxes instead. Spoiler alert: these projects have bigger, non-technical quality issues and won't be using cargo-mutants or anything like it any time soon.

Conversely: keeping good test suites and test infrastructure closed-source can be an attempt to stay ahead in a very competitive, open-source market.

This is automating how developers should write tests in the first place

Posted Oct 31, 2024 18:51 UTC (Thu) by mbp (subscriber, #2737) [Link]

> By simply, manually and temporarily... breaking / "mutating" the production code it is supposed to test. That makes your test fail, and now you know your new test does at least one useful thing and does not just waste time. Yet I've seen vast numbers of pointless, untested tests that never, ever failed in their entire life! If it's not tested then it does not work: also true with test code itself.

Yes, exactly! Doing this fairly often finds the test doesn't really check what you think it should.

Even if you do this manually when adding new tests, you probably don't do it to existing tests every time you change the code or the tests. Automating it lets you continuously test this property, although admittedly with less human intelligence about how to break the code -- but on the other hand without carrying over unreliable assumptions about the "right" bug to introduce.

It's like shifting from occasional manual ad-hoc testing to automated unit tests.

Dangerous mutations

Posted Oct 31, 2024 18:52 UTC (Thu) by jrtc27 (subscriber, #107748) [Link] (1 responses)

I was going to ask how this would handle testing mutating functions that are inherently dangerous if you get them wrong, such as those that implement, or call functions that implement, deleting things on disk, but I see that the linked-to documentation already calls this out (https://mutants.rs/limitations.html#caution-on-side-effects). I wonder though if this should be presented up-front rather than tacked on as the final point in a Limitations section near the end of the documentation which most people thus won't bother reading?

Dangerous mutations

Posted Nov 4, 2024 19:02 UTC (Mon) by mbp (subscriber, #2737) [Link]

Yes, I think I'll move that closer to the start of the doc.

However, I also think people might focus on this too much in mutation testing, when all the same bugs could occur organically in your codebase. If your codebase does inherently dangerous stuff you should run it on a VM or at least a machine of which you have very strong backups. Don't run tests with credentials that could delete the production database!