A new approach to validating test suites
The first program that Martin Pool ever wrote, he said, had bugs; the ones he's writing now most likely have bugs too. The talk Pool gave at RustConf this year was about a way to try to write programs with fewer bugs. He has developed a tool called cargo-mutants that highlights gaps in test coverage by identifying functions that can be broken without causing any tests to fail. This can be a valuable complement to other testing techniques, he explained.
There are lots of ways to write programs with fewer bugs, Pool said. Many people at RustConf probably jump immediately to thinking of type systems or lifetime annotations. Those tools are great, and can eliminate entire classes of bugs, but they can't help with semantic bugs, where the programmer tells the computer to do the wrong thing, he said. The best way to catch those bugs is to write tests.
Tests don't catch everything either — "I do write tests, and I still have
bugs
" — but they help. Sometimes, however, "you look at a bug and ask, how
was this not caught?
". The problem is that tests themselves can be buggy.
Sometimes they just don't cover some area of the code; sometimes they do
call the code, but don't verify the output strictly enough. Sometimes the tests
were comprehensive when they were written, but now the program has grown and the
tests no longer cover everything. A significant fraction of bugs could, in
principle, be caught by tests, Pool said. It would be nice if there were a tool
that could point out where tests are insufficient, so that we can avoid
introducing new bugs into existing software.
In order to explain how cargo-mutants does that, Pool first briefly covered the
process of writing tests: the programmer starts out knowing what the program
should do, and then tries to write tests that tie what it should do to what it
actually does. That tie should be robust — the tests should check all of the
properties the programmer cares about, and they should fail if those properties
don't hold. Good tests should also generalize, since exhaustively testing every
possible behavior is impossible. So one way to think about finding weaknesses in
a set of tests is to ask "what behaviors does the program have that are not
checked by any tests?
"
That is exactly what cargo-mutants does — it uncovers code that is buggy, unnecessary, underspecified, or just hard to test, by finding gaps in the test suite. The core algorithm is to make a copy of a project's source code, deliberately insert a bug, run the test suite, and expect the tests to fail. If they don't, report it as a "missed mutant" — that is, a buggy version of the program that wouldn't be caught by the tests, which needs some human attention to figure out the best way to address it.
Pool gave a specific example of a bug that was found by cargo-mutants. In one of his programs, he wrote an is_symlink() function, in a program that expected to get symbolic links and need to handle them. Cargo-mutants showed that if the body of the function were replaced with true, the tests did not catch it — or, in other words, the tests didn't cover the possibility of something not being a symbolic link. This is actually pretty common, Pool said, since humans often forget to write negative test cases. In any case, he added a test and immediately found that the original implementation was wrong as well.
For a somewhat larger example, he selected the semver crate, which provides functions for working with semantic version numbers. He isn't involved in the semver project, so it represents an unbiased test. He ran cargo-mutants on it, and found a handful of places where the test suite could be improved. This is typical of what is found among high-quality crates, Pool said. In semver's case, if the hashing code was broken, the tests would still pass. It's a good crate — not buggy — but there are still some gaps in the coverage that could make it easier to unknowingly introduce bugs.
That's not to say that every single missed mutant is necessarily a problem, Pool
explained. People have limited time; they should focus on the most important
things, instead of spending all of their time polishing their test suite. The
findings from cargo-mutants are just things to think about. "The computer is
not the boss of you.
" So what Pool recommends is skimming the list of missed
mutants, looking for code that is important, or that should be
thoroughly tested. Then, one should consider whether the introduced mutation is
really incorrect, whether the original code was actually correct, and what test
should have caught this.
Ultimately, it's important not to stress too much about the mutants the tool finds, Pool said. Cargo-mutants generates synthetic bugs — but the goal is always to catch and prevent real bugs. Pool compared cargo-mutants to vaccines: deliberately introducing a weakened version of a problem in a controlled way, so that you can beef up the system that is supposed to catch the real problem.
With the audience hopefully convinced to at least give cargo-mutants a try, Pool switched to talking about how the tool works under the hood. The most interesting part, he said, is actually generating synthetic bugs. There are a few requirements for the code that generates them to implement: they should be valid Rust code, that will point to real gaps, without being too repetitive. They should also, ideally, be generated deterministically, because that makes it easier to reproduce them. There's lots of existing research on how to do this, he said, but it turns out that small, local changes to the code are both easy to generate, and close to what could really change in the codebase.
So cargo-mutants uses
cargo-metadata and
Syn to understand the Rust code in a project
and parse it. For each Rust module, it tries pattern matching on the syntax tree
and applying two different kinds of changes: replacing a whole function with a
stub, and changing arithmetic operations. In the future, the tool could add more,
but just those two turn out to be "really powerful
". When replacing a
function, it generates likely values in a type-driven way — even quite
complicated nested types can be generated, by first generating their constituent
parts. So even functions
returning complex, nested types can be stubbed out.
Pool has been working on cargo-mutants for a few years. He runs it on his own crates, and also on other open-source crates. With that experience, he can say that getting to complete coverage of every mutant can be a bit of work, because it points out gaps that can be hard to address. It's tempting to suppress a mutant, but sometimes they matter. Mostly, the tool finds gaps in the tests, but occasionally it finds a real bug as well, he said. The most difficult cases are when there is a test for something, but the test is flaky. When introducing cargo-mutants to a new project, he thinks it is easiest to turn it on for one file, module, or crate at a time, and gradually expand through the code base.
Cargo-mutants has 100% coverage of itself, naturally. Once you get to that level, Pool said, it's pretty easy to stay there, since the list of new mutants created by any given change is small. Getting to that point has revealed lots of interesting edge-cases, though.
One other difficulty has been parallelization. Cargo-mutants runs hundreds or thousands of incremental builds and tests, which eats a lot of CPU time. Since the tests of each mutant are independent, the tool can run multiple tests independently — even spread across multiple machines. To give an idea of the performance, Pool used numbers from cargo-mutants itself. The crate is 11,000 lines of code, including tests. Building it takes 12 seconds, while running the tests takes 55 seconds. Running cargo-mutants on itself generates 575 mutants (all of which are covered by the test suite, and so are not missed mutants), but because of incremental builds and parallelization, running the tests on all of them only takes about 15 minutes.
Pool finished by comparing cargo-mutants to other testing practices such as code
coverage and fuzzing. Code coverage checks that code was executed, he explained.
Cargo-mutants checks that the results of the test suite depend on running the
code, which is a stronger guarantee. Cargo-mutants also tells you exactly what
the uncaught bug is, which can be easier to handle. The downside is that
cargo-mutants is significantly slower than existing code-coverage tools. A
developer could easily run both, Pool said.
Another common point of comparison is fuzzing or property testing.
Fuzzing is a good way to find problems by "wiggling the inputs
";
cargo-mutants finds problems by "wiggling the code
".
There are a lot of tools that purport to help programmers write more correct software. The choice of which ones to use is often a tradeoff between correctness and time. Still, nearly every project has at least some tests — so an easy-to-use tool like cargo-mutants that helps improve a project's test suite without too much additional effort seems like it could be a valuable option.
| Index entries for this article | |
|---|---|
| Conference | RustConf/2024 |
