KS2012: Regression testing
The "regression testing" slot on day 1 of the 2012 Kernel Summit consisted
of presentations from Dave Jones and Mel Gorman. Dave's presentation
described his new fuzz testing tool, while Mel's was concerned with some
steps to improve benchmarking for detecting regressions.
Trinity: intelligent fuzz testing
Dave Jones talked about a testing tool that he has been working on for the last 18 months. That tool, Trinity, is a type of system call fuzz tester. Dave noted that fuzz testing is nothing new, and that the Linux community has had fuzz testing projects for around a decade. The problem is that past fuzz testers take a fairly simplistic approach, passing random bit patterns in the system call arguments. This suffices to find the really simple bugs, for example, detecting that a numeric value passed to a file descriptor argument does not correspond to a valid open file descriptor. However, once these simple bugs are fixed, fuzz testers tend to simply encounter the error codes (EINVAL, EBADF, and so on) that system calls (correctly) return when they are given bad arguments.
What distinguishes Trinity is the addition of some domain-specific intelligence. The tool includes annotations that describe the arguments expected by each system call. For example, if a system call expects a file descriptor argument, then rather than passing a random number, Trinity opens a range of different types of files, and passes the resulting descriptors to the system call. This allows fuzz testing to get past the simplest checks performed on system call arguments, and find deeper bugs. Annotations are available to indicate a range of argument types, including memory addresses, pathnames, PIDs, lengths, and so on. Using these annotations, Trinity can generate tests that are better targeted at the argument type (for example, the Trinity web site notes that powers of two plus or minus one are often effective for triggering bugs associated with "length" arguments). The resulting tests performed by Trinity are consequently more sophisticated than traditional fuzz testers, and find new types of errors in system calls.
Ted Ts'o asked whether it's possible to bias the tests performed by Trinity in favor of particular kernel subsystems. In response, Dave noted that Trinity can be directed to open the file descriptors that it uses for testing off a particular filesystem (for example, an ext4 partition).
Dave stated that Trinity is run regularly against the linux-next tree as well as against Linus's tree. He noted that Trinity has found bugs in the networking code, filesystem code, and many other parts of the kernel. One of the goals of his talk was simply to encourage other developers to start employing Trinity to test their subsystems and architectures. Trinity currently supports the x86, ia64, powerpc, and sparc architectures.
Benchmarking for regressions
Mel Gorman's talk slot was mainly concerned with improving the
discovery of performance regressions. He
noted that, in the past, "we talked
about benchmarking for patches when they get merged. But there's been much
inconsistency over time.
" In particular, he called out the practice
of writing commit changelog entries that simply give benchmark statistics
from running a particular benchmarking tool as being nearly useless for
detecting regressions.
Mel would like to see more commit changelogs that provide enough information to perform reproducible benchmarks. Leading by example, Mel uses his own benchmarking framework, MMTests, and he has posted historical results from kernels 2.6.32 through to 3.4. What he would like to see is changelog entries that, in addition to giving benchmark results, identify the benchmark framework they use and include (pointers to) the specific configuration used with the framework. (The configuration could be in the changelog, or if too large, it could be stored in some reasonably stable location such as the kernel Bugzilla.)
H. Peter Anvin responded that "I hope you know how hard it
is for submitters to give us real numbers at all.
" But this didn't
deter Mel from reiterating his desire for sufficient information to
reproduce benchmarking tests; he noted that many regressions take a long
time to be discovered, which increases the importance of being able to
reproduce past tests.
Ted Ts'o observed that there seemed to be a need for a per-subsystem
approach to benchmarking. He then asked whether individual subsystems would
even be able come to consensus on what would be a reasonable set of
metrics, and noted that those metrics should not take too long to run
(since metrics that take a long time to execute are likely not to executed
in practice). Mel offered that, if necessary, he would volunteer to help
write configuration scripts for kernel subsystems. From there, discussion
moved into a few other related topics, without reaching any firm
resolutions. However, performance regressions are a subject of great
concern to kernel developers, and the topic of reproducible benchmarking is
one that will likely be revisited soon.
Index entries for this article | |
---|---|
Kernel | Development tools/Trinity |
Kernel | Regression testing |
Kernel | User-space API/Testing |