A new approach to validating test suites
The first program that Martin Pool ever wrote, he said, had bugs; the ones he's writing now most likely have bugs too. The talk Pool gave at RustConf this year was about a way to try to write programs with fewer bugs. He has developed a tool called cargo-mutants that highlights gaps in test coverage by identifying functions that can be broken without causing any tests to fail. This can be a valuable complement to other testing techniques, he explained.
There are lots of ways to write programs with fewer bugs, Pool said. Many people at RustConf probably jump immediately to thinking of type systems or lifetime annotations. Those tools are great, and can eliminate entire classes of bugs, but they can't help with semantic bugs, where the programmer tells the computer to do the wrong thing, he said. The best way to catch those bugs is to write tests.
Tests don't catch everything either — "I do write tests, and I still have
bugs
" — but they help. Sometimes, however, "you look at a bug and ask, how
was this not caught?
". The problem is that tests themselves can be buggy.
Sometimes they just don't cover some area of the code; sometimes they do
call the code, but don't verify the output strictly enough. Sometimes the tests
were comprehensive when they were written, but now the program has grown and the
tests no longer cover everything. A significant fraction of bugs could, in
principle, be caught by tests, Pool said. It would be nice if there were a tool
that could point out where tests are insufficient, so that we can avoid
introducing new bugs into existing software.
In order to explain how cargo-mutants does that, Pool first briefly covered the
process of writing tests: the programmer starts out knowing what the program
should do, and then tries to write tests that tie what it should do to what it
actually does. That tie should be robust — the tests should check all of the
properties the programmer cares about, and they should fail if those properties
don't hold. Good tests should also generalize, since exhaustively testing every
possible behavior is impossible. So one way to think about finding weaknesses in
a set of tests is to ask "what behaviors does the program have that are not
checked by any tests?
"
That is exactly what cargo-mutants does — it uncovers code that is buggy, unnecessary, underspecified, or just hard to test, by finding gaps in the test suite. The core algorithm is to make a copy of a project's source code, deliberately insert a bug, run the test suite, and expect the tests to fail. If they don't, report it as a "missed mutant" — that is, a buggy version of the program that wouldn't be caught by the tests, which needs some human attention to figure out the best way to address it.
Pool gave a specific example of a bug that was found by cargo-mutants. In one of his programs, he wrote an is_symlink() function, in a program that expected to get symbolic links and need to handle them. Cargo-mutants showed that if the body of the function were replaced with true, the tests did not catch it — or, in other words, the tests didn't cover the possibility of something not being a symbolic link. This is actually pretty common, Pool said, since humans often forget to write negative test cases. In any case, he added a test and immediately found that the original implementation was wrong as well.
For a somewhat larger example, he selected the semver crate, which provides functions for working with semantic version numbers. He isn't involved in the semver project, so it represents an unbiased test. He ran cargo-mutants on it, and found a handful of places where the test suite could be improved. This is typical of what is found among high-quality crates, Pool said. In semver's case, if the hashing code was broken, the tests would still pass. It's a good crate — not buggy — but there are still some gaps in the coverage that could make it easier to unknowingly introduce bugs.
That's not to say that every single missed mutant is necessarily a problem, Pool
explained. People have limited time; they should focus on the most important
things, instead of spending all of their time polishing their test suite. The
findings from cargo-mutants are just things to think about. "The computer is
not the boss of you.
" So what Pool recommends is skimming the list of missed
mutants, looking for code that is important, or that should be
thoroughly tested. Then, one should consider whether the introduced mutation is
really incorrect, whether the original code was actually correct, and what test
should have caught this.
Ultimately, it's important not to stress too much about the mutants the tool finds, Pool said. Cargo-mutants generates synthetic bugs — but the goal is always to catch and prevent real bugs. Pool compared cargo-mutants to vaccines: deliberately introducing a weakened version of a problem in a controlled way, so that you can beef up the system that is supposed to catch the real problem.
With the audience hopefully convinced to at least give cargo-mutants a try, Pool switched to talking about how the tool works under the hood. The most interesting part, he said, is actually generating synthetic bugs. There are a few requirements for the code that generates them to implement: they should be valid Rust code, that will point to real gaps, without being too repetitive. They should also, ideally, be generated deterministically, because that makes it easier to reproduce them. There's lots of existing research on how to do this, he said, but it turns out that small, local changes to the code are both easy to generate, and close to what could really change in the codebase.
So cargo-mutants uses
cargo-metadata and
Syn to understand the Rust code in a project
and parse it. For each Rust module, it tries pattern matching on the syntax tree
and applying two different kinds of changes: replacing a whole function with a
stub, and changing arithmetic operations. In the future, the tool could add more,
but just those two turn out to be "really powerful
". When replacing a
function, it generates likely values in a type-driven way — even quite
complicated nested types can be generated, by first generating their constituent
parts. So even functions
returning complex, nested types can be stubbed out.
Pool has been working on cargo-mutants for a few years. He runs it on his own crates, and also on other open-source crates. With that experience, he can say that getting to complete coverage of every mutant can be a bit of work, because it points out gaps that can be hard to address. It's tempting to suppress a mutant, but sometimes they matter. Mostly, the tool finds gaps in the tests, but occasionally it finds a real bug as well, he said. The most difficult cases are when there is a test for something, but the test is flaky. When introducing cargo-mutants to a new project, he thinks it is easiest to turn it on for one file, module, or crate at a time, and gradually expand through the code base.
Cargo-mutants has 100% coverage of itself, naturally. Once you get to that level, Pool said, it's pretty easy to stay there, since the list of new mutants created by any given change is small. Getting to that point has revealed lots of interesting edge-cases, though.
One other difficulty has been parallelization. Cargo-mutants runs hundreds or thousands of incremental builds and tests, which eats a lot of CPU time. Since the tests of each mutant are independent, the tool can run multiple tests independently — even spread across multiple machines. To give an idea of the performance, Pool used numbers from cargo-mutants itself. The crate is 11,000 lines of code, including tests. Building it takes 12 seconds, while running the tests takes 55 seconds. Running cargo-mutants on itself generates 575 mutants (all of which are covered by the test suite, and so are not missed mutants), but because of incremental builds and parallelization, running the tests on all of them only takes about 15 minutes.
Pool finished by comparing cargo-mutants to other testing practices such as code
coverage and fuzzing. Code coverage checks that code was executed, he explained.
Cargo-mutants checks that the results of the test suite depend on running the
code, which is a stronger guarantee. Cargo-mutants also tells you exactly what
the uncaught bug is, which can be easier to handle. The downside is that
cargo-mutants is significantly slower than existing code-coverage tools. A
developer could easily run both, Pool said.
Another common point of comparison is fuzzing or property testing.
Fuzzing is a good way to find problems by "wiggling the inputs
";
cargo-mutants finds problems by "wiggling the code
".
There are a lot of tools that purport to help programmers write more correct software. The choice of which ones to use is often a tradeoff between correctness and time. Still, nearly every project has at least some tests — so an easy-to-use tool like cargo-mutants that helps improve a project's test suite without too much additional effort seems like it could be a valuable option.
| Index entries for this article | |
|---|---|
| Conference | RustConf/2024 |
Posted Oct 29, 2024 18:45 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (12 responses)
Posted Oct 29, 2024 19:37 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Other strings should certainly *not* be permuted into some test string though (e.g., changing `cp` to `rm` in something doing commands sounds…very risky). Alas, even in Rust, strings end up in many places (usually wrapped in some type-scented decoration for type checking during compilation), but mutation of them is probably something needing typed string literals of some kind.
Posted Oct 31, 2024 16:48 UTC (Thu)
by mbp (subscriber, #2737)
[Link] (1 responses)
> Other strings should certainly *not* be permuted into some test string though (e.g., changing `cp` to `rm` in something doing commands sounds…very risky).
Mutating the program is just unavoidably, intentionally, going to cause new behaviors. If you have a function that, for example, takes a path to delete and with "" deletes your home directory, then cargo mutants might cause that to execute, but so might any other bug. For programs containing code that might cause side effects beyond the program's own execution you probably want to run the tests in a container or VM: https://mutants.rs/limitations.html?highlight=delete#caut...
Posted Nov 4, 2024 13:16 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Oct 29, 2024 20:46 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (8 responses)
On the other hand, I always check what I'm about to commit, because I know I will read the changes sometimes, and if they're full of stray garbage that's a problem. Likewise if I turn out to have changed more than I realised, my log message should reflect the actual work, not just the last thing I did before I typed commit. So if I made a correct change but also mangled a string, that string is likely to stick out when examining the git diff.
Posted Oct 29, 2024 21:27 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (3 responses)
Posted Oct 30, 2024 12:15 UTC (Wed)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Oct 31, 2024 9:50 UTC (Thu)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Oct 31, 2024 13:08 UTC (Thu)
by epa (subscriber, #39769)
[Link]
Posted Oct 29, 2024 23:30 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
I'd be interested to also drive mutation two ways: CI first smoke tests those that have been discovered to be sensitive to the code in question (you still want the full test suite eventually). Also, run the tests under mutation of the patched code to find issues. I think it's a tool in the toolbox and, with some plumbing between git and CI, could be very powerful to help find issues before maintainer time is spent looking at diffs.
Posted Oct 31, 2024 17:00 UTC (Thu)
by mbp (subscriber, #2737)
[Link]
It does not yet comment on your PRs, but I'd like to add that.
Posted Oct 30, 2024 8:56 UTC (Wed)
by taladar (subscriber, #68407)
[Link]
There is a tool called https://crates.io/crates/typos-cli that you can put into your pre-commit hook to avoid committing mistakes like that. The main issue with it is that it only supports English so if you have other languages in your source code you might get false positives (but there are ways to add to the list of words accepted or to exclude files).
Posted Oct 31, 2024 23:09 UTC (Thu)
by himi (subscriber, #340)
[Link]
I've also found gitui a useful tool for giving me a different view of what I've changed - I rarely end up using it to drive the actual commit process (I use it more often as a friendly way of doing `git add -i`), but it gives a different perspective which can help get past that tl;dr issue.
For the specific case of mutating strings in the code, this seems like a situation where fuzzing would be useful - it'd need to be aware of things like format string mini-languages as well as potentially the distinction between a "plain" string and something like a path (in Python you could target mutations at pathlib.Path objects, for example), but just throwing mutations at any constant/hard-coded string value would be a useful starting point. Though you'd need to run your tests in a sandboxed environment, since the semantics of `subprocess.run(["nm", objfile])` and `subprocess.run(["rm", objfile])` are kind of different . . .
Posted Oct 30, 2024 4:38 UTC (Wed)
by alison (subscriber, #63752)
[Link] (1 responses)
https://se-radio.net/2024/09/se-radio-632-goran-petrovic-on-mutation-testing-at-google/
Posted Nov 1, 2024 10:22 UTC (Fri)
by k3ninho (subscriber, #50375)
[Link]
I think this conversation hides an elephant, too. There's a difference in approach between the London Style and the Detroit Style for an integrated test suite[1]. The London 'Extreme Tuesday Club' is an ongoing meetup founded in the 2000's to discuss Extreme Programming, while the wiki.c2.com community that sprung up around Kent Beck's development of Extreme Programming (and Agile) while working a project for Chrysler in Detroit.
London Style breaks the test scope into everything that has an interface -- and assumes a stable code base, possibly a practice where legacy interfaces don't change but implementation details might -- so there's scope to test all the viable code paths. (Plus you get to reject bad data when it's an unexpected use of the existing code.) There can be a reluctance to rework interfaces because you also have to keep a test suite 'always green' for no feature gain -- the stability of the code can ossify the project.
Detroit Style is a scaffold to a changing code base, or changing interfaces, so you have only relevant test code for the things you want to protect from regressions. Investing in mutation cases for a fluid code-base might not be worth the time, but the fluidity of the Linux code-base shows that change is ongoing and we need to run test cases to protect against regressions.
These approaches probably should overlap, but 'should is considered harmful' for the number of unchecked assumptions it brings to the conversation.
1: https://zone84.tech/architecture/london-and-detroit-schoo...
K3n.
Posted Oct 30, 2024 13:24 UTC (Wed)
by bjackman (subscriber, #109548)
[Link] (1 responses)
The main weakness I found with it is that early on, it would produce a lot of mutants that were pretty un-interesting. For example, it would say "here you're handling an error from this IO operation, but if you were to ignore the error no tests would fail". It gave me the impression to get really total test coverage you'd have a quite unfavourable cost/benefit tradeoff: maybe you need really fiddly error injection logic to get tests that would detect those bugs, and maybe those bugs are just really unlikely because it's quite hard to forget to handle IO errors if you have a reasonable coding style.
I think the solution Google adopted for that problem was just to have heuristics about "interesting" mutations, because I stopped having that problem a while after the feature was introduced.
Posted Oct 31, 2024 16:57 UTC (Thu)
by mbp (subscriber, #2737)
[Link]
There is some challenge in finding interesting mutants and avoiding uninteresting results. I wrote a bit about that here: https://mutants.rs/goals.html
Untested error handling code is interesting. The specifics of error handling and the defaults if you do nothing will be quite different between Go and Rust, of course. But, more broadly, it's not uncommon to write some error handling code but find it hard to test.
I do really sympathize that it can be annoying to test, and you might have to build or hook up some whole error-injection framework to make the code reachable. (https://docs.rs/failpoints/ in Rust looks good, for example.)
On the other hand, if you don't have any tests, how do you know the error handling code is correct? (I'm talking about code more complex than just "log a message and return the error.") I've found it's actually surprisingly easy for it to be wrong, and a bit harder to reason about because it's departing from the normal path of the program. And this is not just a subjective impression: https://www.eecg.utoronto.ca/~yuan/papers/failure_analysi... found that a lot of critical failures in distributed systems were due to bugs in error handling code. If it's not tested, why not just panic or return the error?
Posted Oct 31, 2024 16:15 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
Even before looking at automated mutations (which looks awesome!), how do you "test a (new) test"? By simply, manually and temporarily... breaking / "mutating" the production code it is supposed to test. That makes your test fail, and now you know your new test does at least one useful thing and does not just waste time. Yet I've seen vast numbers of pointless, untested tests that never, ever failed in their entire life! If it's not tested then it does not work: also true with test code itself.
I understand the "laziness" of not writing enough tests, but why would anyone waste both human and computer time writing untested tests? Because in many companies/projects, validation code is not considered as "noble" as production code and delegated to less experienced and less valued contractors/interns/etc. who don't care much and are not compensated based on proper quality metrics but based on some checkboxes instead. Spoiler alert: these projects have bigger, non-technical quality issues and won't be using cargo-mutants or anything like it any time soon.
Conversely: keeping good test suites and test infrastructure closed-source can be an attempt to stay ahead in a very competitive, open-source market.
Posted Oct 31, 2024 18:51 UTC (Thu)
by mbp (subscriber, #2737)
[Link]
Yes, exactly! Doing this fairly often finds the test doesn't really check what you think it should.
Even if you do this manually when adding new tests, you probably don't do it to existing tests every time you change the code or the tests. Automating it lets you continuously test this property, although admittedly with less human intelligence about how to break the code -- but on the other hand without carrying over unreliable assumptions about the "right" bug to introduce.
It's like shifting from occasional manual ad-hoc testing to automated unit tests.
Posted Oct 31, 2024 18:52 UTC (Thu)
by jrtc27 (subscriber, #107748)
[Link] (1 responses)
Posted Nov 4, 2024 19:02 UTC (Mon)
by mbp (subscriber, #2737)
[Link]
However, I also think people might focus on this too much in mutation testing, when all the same bugs could occur organically in your codebase. If your codebase does inherently dangerous stuff you should run it on a VM or at least a machine of which you have very strong backups. Don't run tests with credentials that could delete the production database!
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
Coverage for important strings
podcast interview about mutation testing at Google for a variety of languages
podcast interview about mutation testing at Google for a variety of languages
Experience from Google
Experience from Google
This is automating how developers should write tests in the first place
This is automating how developers should write tests in the first place
Dangerous mutations
Dangerous mutations
