The "regression testing" slot on day 1 of the 2012 Kernel Summit consisted
of presentations from Dave Jones and Mel Gorman. Dave's presentation
described his new fuzz testing tool, while Mel's was concerned with some
steps to improve benchmarking for detecting regressions.
Trinity: intelligent fuzz testing
Dave Jones talked about a testing tool that he has been working on for
the last 18 months. That tool, Trinity, is a type of
system call fuzz
tester. Dave noted that fuzz testing is nothing new, and that the Linux
community has had fuzz testing projects for around a decade. The problem is
that past fuzz testers take a fairly simplistic approach, passing random
bit patterns in the system call arguments. This suffices to find the really
simple bugs, for example, detecting that a numeric value passed to a file
descriptor argument does not correspond to a valid open file descriptor.
However, once these simple bugs are fixed, fuzz testers tend to simply
encounter the error codes (EINVAL, EBADF, and so on) that
system calls (correctly) return when they are given bad arguments.
What distinguishes Trinity is the addition of some domain-specific
intelligence. The tool includes annotations that describe the
arguments expected by each system call. For example, if a system call
expects a file descriptor argument, then rather than passing a random
number, Trinity opens a range of different types of files, and passes the
resulting descriptors to the system call. This allows fuzz testing to get
past the simplest checks performed on system call arguments, and find
deeper bugs. Annotations are available to indicate a range of argument
types, including memory addresses, pathnames, PIDs,
lengths, and so on. Using these annotations, Trinity can generate tests
that are better targeted at the argument type (for example, the Trinity web
site notes that powers of two plus or minus one are often effective for
triggering bugs associated with "length" arguments). The resulting tests
performed by Trinity are consequently more sophisticated than traditional
fuzz testers, and find new types of errors in system calls.
Ted Ts'o asked whether it's possible to bias the tests performed by Trinity
in favor of particular kernel subsystems. In response, Dave noted that
Trinity can be directed to open the file descriptors that it uses for
testing off a particular filesystem (for example, an ext4 partition).
Dave stated that Trinity is run regularly against the
linux-next tree as well as against Linus's tree. He noted that
Trinity has found
bugs in the networking code, filesystem code, and many other parts of
the kernel. One of the goals of his talk was simply to encourage other
developers to start employing Trinity to test their subsystems and
architectures. Trinity currently supports the x86, ia64, powerpc, and sparc
architectures.
Benchmarking for regressions
Mel Gorman's talk slot was mainly concerned with improving the
discovery of performance regressions. He
noted that, in the past, "we talked
about benchmarking for patches when they get merged. But there's been much
inconsistency over time." In particular, he called out the practice
of writing commit changelog entries that simply give benchmark statistics
from running a particular benchmarking tool as being nearly useless for
detecting regressions.
Mel would like to see more commit changelogs that provide enough
information to perform reproducible benchmarks. Leading by example, Mel
uses his own benchmarking framework, MMTests, and he has posted
historical results from kernels 2.6.32 through to 3.4. What he would like to
see is changelog entries that, in addition to giving benchmark results,
identify the benchmark framework they use and include (pointers to) the
specific configuration used with the framework. (The configuration could be
in the changelog, or if too large, it could be stored in some reasonably
stable location such as the kernel Bugzilla.)
H. Peter Anvin responded that "I hope you know how hard it
is for submitters to give us real numbers at all." But this didn't
deter Mel from reiterating his desire for sufficient information to
reproduce benchmarking tests; he noted that many regressions take a long
time to be discovered, which increases the importance of being able to
reproduce past tests.
Ted Ts'o observed that there seemed to be a need for a per-subsystem
approach to benchmarking. He then asked whether individual subsystems would
even be able come to consensus on what would be a reasonable set of
metrics, and noted that those metrics should not take too long to run
(since metrics that take a long time to execute are likely not to executed
in practice). Mel offered that, if necessary, he would volunteer to help
write configuration scripts for kernel subsystems. From there, discussion
moved into a few other related topics, without reaching any firm
resolutions. However, performance regressions are a subject of great
concern to kernel developers, and the topic of reproducible benchmarking is
one that will likely be revisited soon.
(
Log in to post comments)