Various topics related to kernel quality
The latest round began when Natalie Protasevich, a Google developer who spends some time helping Andrew Morton track bugs, posted this list of a few dozen open bugs which seemed worthy of further attention. Andrew responded with his view of what was happening with those bug reports; that view was "no response from developers" in most cases:
A number of developers came back saying, in essence, that Andrew was employing an overly heavy hand and that his assertions were not always correct. Regardless of whether his claims are correct, Andrew has clearly touched a nerve.
He defended his posting by raising his often-expressed fear that the quality of the kernel is in decline. This is, he says, something which requires attention now:
But is the kernel deteriorating? That is a very hard question to answer for a number of reasons. There is no objective standard by which the quality of the kernel can be judged. Certain kinds of problems can be found by automated testing, but, in the kernel space, many bugs can only be found by running the kernel with specific workloads on specific combinations of hardware. A rising number of bug reports does not necessarily indicate decreasing quality when both the number of users and the size of the code base are increasing.
Along the same lines, as Ingo Molnar pointed out, a decreasing number of bug reports does not necessarily mean that quality is improving. It could, instead, indicate that testers are simply getting frustrated and dropping out of the development process - a worsening kernel could actually cause the reporting of fewer bugs. So Ingo says we need to treat our testers better, but we also need to work harder at actually measuring the quality of the kernel:
It is generally true that problems which can be measured and quantified tend to be addressed more quickly and effectively. The classic example is PowerTop, which makes power management problems obvious. Once developers could see where the trouble was and, more to the point, could see just how much their fixes improved the situation, vast numbers of problems went away over a short period of time. At the moment, the kernel developers can adopt any of a number of approaches to improving kernel quality, but they [PULL QUOTE: In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark. END QUOTE] will not have any way of really knowing if that effort is helping the situation or not. In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark.
As an example, consider the discussion of the "git bisect" feature. If one is trying to find a regression which happened between 2.6.23 and 2.6.24-rc1, one must conceivably look at several thousand patches to find the one which caused the problem - a task which most people tend to find just a little intimidating. Bisection helps the tester perform a binary search over a range of patches, eliminating half of them in each compile-and-boot cycle. Using bisect, a regression can be tracked down in a relatively automatic way with "only" a dozen or so kernel builds and reboots. At the end of the process, the guilty patch will have been identified in an unambiguous way.
Bisection works so well that developers will often ask a tester to use it to track down a problem they are reporting. Some people see this practice as a way for lazy kernel developers to dump the work of tracking down their bugs on the users who are bitten by those bugs. Building and testing a dozen kernels is, they say, too much to ask of a tester. Mark Lord, for example, asserts that most bugs are relatively easy to find when a developer actually looks at the code; the whole bisect process is often unnecessary:
On the other hand, some developers see bisection as a powerful tool which has made it easier for testers to actively help the process. David Miller says:
Returning to original bug list: another issue which came up was the use of mailing lists other than linux-kernel. Some of the bugs had not been addressed because they had never been reported to the mailing list dedicated to the affected subsystem. Other bugs, marked by Andrew as having had no response, had, in fact, been discussed (and sometimes fixed) on subsystem-specific lists. In both situations, the problem is a lack of communication between subsystem lists and the larger community.
In response, some developers have, once again, called for a reduction in the use of subsystem-specific lists. We are, they say, all working on a single kernel, and we are all interested in what happens with that kernel. Discussing kernel subsystems in isolation is likely to result in a lower-quality kernel. Ingo Molnar expresses it this way:
Moving discussions back onto linux-kernel seems like a very hard sell, though. Most subsystem-specific lists feature much lower traffic, a friendlier atmosphere, and more focused conversation. Many subscribers of such lists are unlikely to feel that moving back to linux-kernel would improve their lives. So, perhaps, the best that can be hoped for is that more developers would subscribe to both lists and make a point of ensuring that relevant information flows in both directions.
David Miller pointed out another reason why some bug reports don't see a lot of responses: developers have to choose which bugs to try to address. Problems which affect a lot of users, and which can be readily reproduced, have a much higher chance of getting quick developer attention. Bug reports which end up at the bottom of the prioritized list ("chaff"), instead, tend to languish. The system, says David, tends to work reasonably well:
Given that there are unlikely to ever be enough developers to respond to every single kernel bug report, the real problem comes down to prioritization. Andrew Morton has a clear idea of which reports should be handled first: regressions from previous releases.
Attention to regressions has improved significantly over the last couple of
years or so. They tend to be much more actively tracked, and the list of
known regressions is consulted before kernel releases are made. The real
problem, according to Andrew, is that any regressions which are still there
after a release tend to fall off the list. Better attention to those
problems would help to ensure that the quality of the kernel improved over
time.
Index entries for this article | |
---|---|
Kernel | Development model/Kernel quality |
Posted Nov 15, 2007 9:20 UTC (Thu)
by jhellan (guest, #17103)
[Link]
Posted Nov 15, 2007 13:14 UTC (Thu)
by Cato (guest, #7643)
[Link] (3 responses)
Posted Nov 15, 2007 17:27 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (2 responses)
Posted Nov 16, 2007 20:25 UTC (Fri)
by oak (guest, #2786)
[Link] (1 responses)
Posted Nov 27, 2007 0:54 UTC (Tue)
by Tara_Li (guest, #26706)
[Link]
Posted Nov 15, 2007 13:19 UTC (Thu)
by walles (guest, #954)
[Link] (2 responses)
Posted Nov 15, 2007 14:38 UTC (Thu)
by Lev (guest, #41433)
[Link]
Posted Nov 15, 2007 17:44 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Nov 15, 2007 15:30 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Posted Nov 16, 2007 2:02 UTC (Fri)
by nevets (subscriber, #11875)
[Link]
Regressions
Regressions should be a priority. But they cannot be the only priority. A locomotive builder
who adopted that strategy in 1930 would be building the world's best locomotives by the time
they went out of business.
Email is the problem
Surely the problem is that email is a relatively blunt tool to track bugs and fixes?
Something web-based that lets you track the bug over time, relate it to duplicates, attach
patches, make comments, etc, would seem a better idea - by having subsystem tags such as SCSI
you could also enable smaller groups of developers to focus on their area. Also, maybe the
kernel needs bug triage people who simply prioritise and tag bugs so they find their way to
the right developer community.
Email is the problem
The original email in this thread was a list of bugs listed in the kernel bugzilla that had
gotten "no attention". As it turned out, many of the bugs had gotten attention in email from
developers, and some of them had been fixed. Of course, none of this was clear from the web
site, which isn't sufficiently useful for most people who actually fix bugs to bother using
for serious discussion.
The issue with using anything other than email is that people will pretty much never remember
to do any particular thing other than reading their email, and they only really look at the
main content of the email. They'll probably ignore links and attachments. And what they'd do
is a bunch of research and local work, and then reply to the email. Anything that depends on
some other way of getting people's attention or detecting their responses isn't going to work.
Email is the problem
Bugzilla works also through email, at least to an extent.
It could also be set to send mail/CC to LKML.
Email is the problem
Perhaps the answer is a weekly summary/digest kind of report from each of the sub-lists, that
gets posted to the main list and any related lists (libata goes to the filesystems group, and
vice-versa), and a similar list from the main list going to the sublists...
Quality measurement suggestion
Measure how many bugs in the bugzilla get closed every month / week / whatever and make a
graph on the front page of http://bugzilla.kernel.org.
The number of closed bugs will go up if:
* The number of users / testers go up.
* The percentage of users / testers who report bugs go up.
* The quality of the bug reports go up.
* The percentage of bugs that actually get resolved go up.
This is easy to measure, it's related to the kernel quality, and it should be easy (for
somebody with access to http://bugzilla.kernel.org) to visualize.
Quality measurement suggestion
You missed that the number of closed bugs will also go up if:
* The number of bugs inserted by kernel developers goes up.
So, number-of-closed-bugs is certainly related to kernel quality, but not in a straightforward
manner.
Quality measurement suggestion
... and people find and close bugs on Bugzilla that are actually fixed. Some large portion of
the open bugs on Bugzilla at any point (at least historically; not sure if Andrew has improved
this) are issues that were fixed in response to something else and then forgotten by the
reporter. An illustriative anecdote is that somebody has a problem, and submits a
slightly-inaccurate bug report to Bugzilla. This gets misdirected due to the error and nobody
has anything to say about it. The submiter pokes at it further, comes up with a better
charactization of the bug. ignores Bugzilla (since that didn't generate a response before) and
posts to an applicable mailing list. People on the list work with the user and fix the issue.
The reporter is satisfied and has forgetten about Bugzilla entirely. Somebody trawling
Bugzilla digs up the entry, and says that it's gotten no attention. The people who fixed the
actual issue say that the bug report is inaccurate and the issue is probably the one that's
been fixed for ages (assuming they connect the dots at this point and still remember the
issue).
Various topics related to kernel quality
It's quite easy to check kernel codebase quality. Just look at the frequency of -mm releases.
To be released a -mm kernel must somehow build and boot on at least a few systems. That it
took a month before a new -mm kernel was released and that it exploded at once on lots of
systems tells a lot about the codebase current state.
Using LMKL
Being one of the developers for the -rt patch, we have always maintained that we want our
development on LKML. We've had a few developers ask us to go away and start our own list.
Because a lot of problems we find with the -rt patch end up being a bad design or hard to hit
bug in mainline, we always pushed back to keep our development on LKML.
We now have a linux-rt-users mailing list. Although it is suppose to be for rt users, we also
have our development emails go back and forth there too. But one thing that I (and others) try
to do, is to CC both that list and LKML on anything related to the kernel development of the
-rt patch. This keeps those only on LKML informed, as well as those that look at the much
lesser traffic linux-rt-users (which I can actually keep up on).
Perhaps other subsystems should do the same. That is to CC the LKML with their discussions to
development. Who knows, they may get some good ideas from those outside of their development
circles. I know the -rt patch development certainly has!
-- Steve