Various topics related to kernel quality

By Jonathan Corbet
November 14, 2007

Discussions of kernel quality are not a new phenomenon on linux-kernel. It is, indeed, a topic which comes up with a certain regularity, more so than with many other free software projects. The size of the kernel, the rate at which its code changes, and the wide range of environments in which the kernel runs all lead to unique challenges; add in the fact that kernel bugs can lead to catastrophic system failures and you have the material for no end of debate.

The latest round began when Natalie Protasevich, a Google developer who spends some time helping Andrew Morton track bugs, posted this list of a few dozen open bugs which seemed worthy of further attention. Andrew responded with his view of what was happening with those bug reports; that view was "no response from developers" in most cases:

So I count around seven reports which people are doing something with and twenty seven which have been just ignored.

A number of developers came back saying, in essence, that Andrew was employing an overly heavy hand and that his assertions were not always correct. Regardless of whether his claims are correct, Andrew has clearly touched a nerve.

He defended his posting by raising his often-expressed fear that the quality of the kernel is in decline. This is, he says, something which requires attention now:

If the kernel _is_ slowly deteriorating then this won't become readily apparent until it has been happening for a number of years. By that stage there will be so much work to do to get us back to an acceptable level that it will take a huge effort. And it will take a long time after that for the kernel to get its reputation back.

But is the kernel deteriorating? That is a very hard question to answer for a number of reasons. There is no objective standard by which the quality of the kernel can be judged. Certain kinds of problems can be found by automated testing, but, in the kernel space, many bugs can only be found by running the kernel with specific workloads on specific combinations of hardware. A rising number of bug reports does not necessarily indicate decreasing quality when both the number of users and the size of the code base are increasing.

Along the same lines, as Ingo Molnar pointed out, a decreasing number of bug reports does not necessarily mean that quality is improving. It could, instead, indicate that testers are simply getting frustrated and dropping out of the development process - a worsening kernel could actually cause the reporting of fewer bugs. So Ingo says we need to treat our testers better, but we also need to work harder at actually measuring the quality of the kernel:

I tried to make the point that the only good approach is to remove our current subjective bias from quality metrics and to at least realize what a cavalier attitude we still have to QA. The moment we are able to _measure_ how bad we are, kernel developers will adopt in a second and will improve those metrics. Lets use more debug tools, both static and dynamic ones. Lets measure tester base and we need to measure _lost_ early adopters and the reasons why they are lost.

It is generally true that problems which can be measured and quantified tend to be addressed more quickly and effectively. The classic example is PowerTop, which makes power management problems obvious. Once developers could see where the trouble was and, more to the point, could see just how much their fixes improved the situation, vast numbers of problems went away over a short period of time. At the moment, the kernel developers can adopt any of a number of approaches to improving kernel quality, but they [PULL QUOTE: In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark. END QUOTE] will not have any way of really knowing if that effort is helping the situation or not. In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark.

As an example, consider the discussion of the "git bisect" feature. If one is trying to find a regression which happened between 2.6.23 and 2.6.24-rc1, one must conceivably look at several thousand patches to find the one which caused the problem - a task which most people tend to find just a little intimidating. Bisection helps the tester perform a binary search over a range of patches, eliminating half of them in each compile-and-boot cycle. Using bisect, a regression can be tracked down in a relatively automatic way with "only" a dozen or so kernel builds and reboots. At the end of the process, the guilty patch will have been identified in an unambiguous way.

Bisection works so well that developers will often ask a tester to use it to track down a problem they are reporting. Some people see this practice as a way for lazy kernel developers to dump the work of tracking down their bugs on the users who are bitten by those bugs. Building and testing a dozen kernels is, they say, too much to ask of a tester. Mark Lord, for example, asserts that most bugs are relatively easy to find when a developer actually looks at the code; the whole bisect process is often unnecessary:

I'm just asking that developers here do more like our Top Penguin does, and actually look at problems and try to understand them and suggest fixes to try. And not rely solely on the git-bisect crutch. It's a good crutch, provided the reporter is a kernel developer, or has a lot of time on their hands. But we debugged Linux here for a long time without it.

On the other hand, some developers see bisection as a powerful tool which has made it easier for testers to actively help the process. David Miller says:

Like the internet, this time spent is beneficial because it's pushing the work out to the end nodes. In fact git bisect is an awesome example of the end node principle in action for software development and QA. For the end-user wanting their bug fixed and the developer it's a win win situation because the reporter is actually able to do something proactive which will help get the bug they want fixed faster.

Returning to original bug list: another issue which came up was the use of mailing lists other than linux-kernel. Some of the bugs had not been addressed because they had never been reported to the mailing list dedicated to the affected subsystem. Other bugs, marked by Andrew as having had no response, had, in fact, been discussed (and sometimes fixed) on subsystem-specific lists. In both situations, the problem is a lack of communication between subsystem lists and the larger community.

In response, some developers have, once again, called for a reduction in the use of subsystem-specific lists. We are, they say, all working on a single kernel, and we are all interested in what happens with that kernel. Discussing kernel subsystems in isolation is likely to result in a lower-quality kernel. Ingo Molnar expresses it this way:

We lose much more by forced isolation of discussion than what we win by having less traffic! It's _MUCH_ easier to narrow down information (by filter by threads, by topics, by people, etc.) than it is to gobble information together from various fractured sources. We learned it _again and again_ that isolation of kernel discussions causes bad things.

Moving discussions back onto linux-kernel seems like a very hard sell, though. Most subsystem-specific lists feature much lower traffic, a friendlier atmosphere, and more focused conversation. Many subscribers of such lists are unlikely to feel that moving back to linux-kernel would improve their lives. So, perhaps, the best that can be hoped for is that more developers would subscribe to both lists and make a point of ensuring that relevant information flows in both directions.

David Miller pointed out another reason why some bug reports don't see a lot of responses: developers have to choose which bugs to try to address. Problems which affect a lot of users, and which can be readily reproduced, have a much higher chance of getting quick developer attention. Bug reports which end up at the bottom of the prioritized list ("chaff"), instead, tend to languish. The system, says David, tends to work reasonably well:

Luckily if the report being ignored isn't chaff, it will show up again (and again and again) and this triggers a reprioritization because not only is the bug no longer chaff, it also now got a lot of information tagged to it so it's a double worthwhile investment to work on the problem.

Given that there are unlikely to ever be enough developers to respond to every single kernel bug report, the real problem comes down to prioritization. Andrew Morton has a clear idea of which reports should be handled first: regressions from previous releases.

If we're really active in chasing down the regressions then I think we can be confident that the kernel isn't deteriorating. Probably it will be improving as we also fix some always-been-there bugs.

Attention to regressions has improved significantly over the last couple of years or so. They tend to be much more actively tracked, and the list of known regressions is consulted before kernel releases are made. The real problem, according to Andrew, is that any regressions which are still there after a release tend to fall off the list. Better attention to those problems would help to ensure that the quality of the kernel improved over time.

Index entries for this article
Kernel	Development model/Kernel quality

Regressions

Posted Nov 15, 2007 9:20 UTC (Thu) by jhellan (guest, #17103) [Link]

Regressions should be a priority. But they cannot be the only priority. A locomotive builder
who adopted that strategy in 1930 would be building the world's best locomotives by the time
they went out of business.

Email is the problem

Posted Nov 15, 2007 13:14 UTC (Thu) by Cato (guest, #7643) [Link] (3 responses)

Surely the problem is that email is a relatively blunt tool to track bugs and fixes?
Something web-based that lets you track the bug over time, relate it to duplicates, attach
patches, make comments, etc, would seem a better idea - by having subsystem tags such as SCSI
you could also enable smaller groups of developers to focus on their area.  Also, maybe the
kernel needs bug triage people who simply prioritise and tag bugs so they find their way to
the right developer community.

Email is the problem

Posted Nov 15, 2007 17:27 UTC (Thu) by iabervon (subscriber, #722) [Link] (2 responses)

The original email in this thread was a list of bugs listed in the kernel bugzilla that had
gotten "no attention". As it turned out, many of the bugs had gotten attention in email from
developers, and some of them had been fixed. Of course, none of this was clear from the web
site, which isn't sufficiently useful for most people who actually fix bugs to bother using
for serious discussion.

The issue with using anything other than email is that people will pretty much never remember
to do any particular thing other than reading their email, and they only really look at the
main content of the email. They'll probably ignore links and attachments. And what they'd do
is a bunch of research and local work, and then reply to the email. Anything that depends on
some other way of getting people's attention or detecting their responses isn't going to work.

Email is the problem

Posted Nov 16, 2007 20:25 UTC (Fri) by oak (guest, #2786) [Link] (1 responses)

Bugzilla works also through email, at least to an extent.
It could also be set to send mail/CC to LKML.

Email is the problem

Posted Nov 27, 2007 0:54 UTC (Tue) by Tara_Li (guest, #26706) [Link]

Perhaps the answer is a weekly summary/digest kind of report from each of the sub-lists, that
gets posted to the main list and any related lists (libata goes to the filesystems group, and
vice-versa), and a similar list from the main list going to the sublists...

Quality measurement suggestion

Posted Nov 15, 2007 13:19 UTC (Thu) by walles (guest, #954) [Link] (2 responses)

Measure how many bugs in the bugzilla get closed every month / week / whatever and make a
graph on the front page of http://bugzilla.kernel.org.

The number of closed bugs will go up if:
* The number of users / testers go up.
* The percentage of users / testers who report bugs go up.
* The quality of the bug reports go up.
* The percentage of bugs that actually get resolved go up.

This is easy to measure, it's related to the kernel quality, and it should be easy (for
somebody with access to http://bugzilla.kernel.org) to visualize.

Quality measurement suggestion

Posted Nov 15, 2007 14:38 UTC (Thu) by Lev (guest, #41433) [Link]

You missed that the number of closed bugs will also go up if:
* The number of bugs inserted by kernel developers goes up.

So, number-of-closed-bugs is certainly related to kernel quality, but not in a straightforward
manner.

Quality measurement suggestion

Posted Nov 15, 2007 17:44 UTC (Thu) by iabervon (subscriber, #722) [Link]

... and people find and close bugs on Bugzilla that are actually fixed. Some large portion of
the open bugs on Bugzilla at any point (at least historically; not sure if Andrew has improved
this) are issues that were fixed in response to something else and then forgotten by the
reporter. An illustriative anecdote is that somebody has a problem, and submits a
slightly-inaccurate bug report to Bugzilla. This gets misdirected due to the error and nobody
has anything to say about it. The submiter pokes at it further, comes up with a better
charactization of the bug. ignores Bugzilla (since that didn't generate a response before) and
posts to an applicable mailing list. People on the list work with the user and fix the issue.
The reporter is satisfied and has forgetten about Bugzilla entirely. Somebody trawling
Bugzilla digs up the entry, and says that it's gotten no attention. The people who fixed the
actual issue say that the bug report is inaccurate and the issue is probably the one that's
been fixed for ages (assuming they connect the dots at this point and still remember the
issue).

Various topics related to kernel quality

Posted Nov 15, 2007 15:30 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

It's quite easy to check kernel codebase quality. Just look at the frequency of -mm releases.

To be released a -mm kernel must somehow build and boot on at least a few systems. That it
took a month before a new -mm kernel was released and that it exploded at once on lots of
systems tells a lot about the codebase current state.

Using LMKL

Posted Nov 16, 2007 2:02 UTC (Fri) by nevets (subscriber, #11875) [Link]

Being one of the developers for the -rt patch, we have always maintained that we want our
development on LKML. We've had a few developers ask us to go away and start our own list.
Because a lot of problems we find with the -rt patch end up being a bad design or hard to hit
bug in mainline, we always pushed back to keep our development on LKML.

We now have a linux-rt-users mailing list. Although it is suppose to be for rt users, we also
have our development emails go back and forth there too. But one thing that I (and others) try
to do, is to CC both that list and LKML on anything related to the kernel development of the
-rt patch. This keeps those only on LKML informed, as well as those that look at the much
lesser traffic linux-rt-users (which I can actually keep up on).

Perhaps other subsystems should do the same. That is to CC the LKML with their discussions to
development. Who knows, they may get some good ideas from those outside of their development
circles. I know the -rt patch development certainly has!

-- Steve