| Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |
The last couple of years have seen a renewed push within the kernel community to avoid regressions. When a patch is found to have broken something that used to work, a fix must be merged or the offending patch will be removed from the kernel. It's a straightforward and logical idea, but there's one little problem: when a kernel series includes over 12,000 changesets (as 2.6.25 does), how does one find the patch which caused the problem? Sometimes it will be obvious, but, for other problems, there are literally thousands of patches which could be the source of the regression. Digging through all of those patches in search of a bug can be a needle-in-the-haystack sort of proposition.
One of the many nice tools offered by the git source code management system is called "bisect." The bisect feature helps the user perform a binary search through a range of patches until the one containing the bug is found. All that is needed is to specify the most recent kernel which is known to work (2.6.24, say), and the oldest kernel which is broken (2.6.25-rc9, perhaps), and the bisect feature will check out a version of the kernel at the midpoint between those two. Finding that midpoint is non-trivial, since, in git, the stream of patches is not a simple line. But that's the sort of task we keep computers around for. Once the midpoint kernel has been generated, the person chasing the bug can build and test it, then tell git whether it exhibits the bug or not. A kernel at the new midpoint will be produced, and the process continues. With bisect, the problematic patch can be found in a maximum of a dozen or so compile-boot-test cycles.
Bisect is not a perfect tool. If patch submitters are not careful, bisect can create a broken kernel when it splits a patch series. The patch which causes a bug to manifest itself may not be the one which introduced the bug. In the worst case, a developer may merge a long series of patches, finishing with one brief change which enables all the code added previously; in this case, bisect will find the final patch, which will only be marginally useful. If the person reporting the bug is running a distributor's kernel, it may be hard to get that kernel in a form which is amenable to the bisection process. Bisection might require unacceptable downtime on the only (production) system which is affected by the bug. And, of course, the process of checking out, building, booting, and testing a dozen kernels is not something which one fits into a coffee break. It requires a certain determination on the part of the tester and quite a bit of time.
All of the points above would suggest that requesting a bisection from a user reporting a bug should be done as a last resort. In that context, it is worth looking at the story of a recent bug report which suggests that some observers, at least, think that kernel developers are relying a little too heavily on this tool. An April 9, Mark Lord reported a regression in the networking stack; after making a couple of guesses, the network developers suggested that the problem be bisected.
Mark replied that he did not have the time to go through a full bisection, and that he would much rather be provided a list of commits which might be at fault. That list was not forthcoming, though; there were no developers who had an idea of where the problem might be and, as it turns out, the developer who introduced the bug lives in a time zone which caused him to miss the discussion. Mark's response was strong:
Andrew Morton also worries that developers resort too quickly to a bisection request rather than working with users as was once done. Either that, he says, or developers just ignore the report from the beginning.
Other developers have answers to these worries, of course. Kernel developers often are not in a position to reproduce a reported bug; it may depend on the specifics of the user's hardware or workload. So they must depend on the user to try things and inform them when a change fixes the problem. Here's David Miller's view on how things used to work:
We're able to automate this now and it's not a bad thing.
The other answer that one hears is that the situation now is much different, with far more users, much more code, and more problems to deal with. The old "back and forth" mode was better suited to smaller user and developer communities; in the current world, things must be done differently. David Miller again:
There is another aspect of the problem which is spoken about a bit less frequently: developers must prioritize bug reports and decide which ones to work on. Unlike some projects, the kernel does not have anybody serving in any sort of bug triage role, so, in the absence of a disgruntled and paying customer, most developers make their own decisions on which problems to try to solve. It should not be surprising that problems with the most complete information are the ones which are most likely to be addressed first.
A bug report with a bisection that fingers a specific commit is a report with very good information, one which is generally easy to resolve. As an example, consider Mark Lord's report again; he did eventually take the time (five hours, apparently) to bisect the problem and report the results; the bug was found and fixed almost immediately thereafter - despite the fact that the responsible developer was still sleeping on the other side of the planet.
Even less spoken about is the fact that quite a few problems are one-off occurrences. Somewhere out there in the world, there is a single user who, due to a highly uncommon mixture of hardware and software, experiences a problem which affects (almost) nobody else. Marginal hardware, out-of-tree patches, and overclocking only make the problem worse. Arjan van de Ven's kernel oops summaries are illustrative in this regard; the statistics for the 2.6.25-rc kernels show that a half-dozen problems account for over half of the reports, while the vast majority of oopses have only a single occurrence.
Kernel developers have learned that this kind of problem report tends to go away by itself; the affected user finds a way around the issue (or just gives up) and nobody else ever complains. One can well argue that trying to chase down this kind of problem is not a good use of a kernel developer's time. The hard part is figuring out which reports are of this variety. One relatively straightforward way is to wait until reports from other users confirm the problem - or until a sufficiently determined user bisects the problem and provides a commit ID. In this sense, bisection serves as a sort of triage mechanism which requires users to perform enough work to show that the problem is real.
So the developers do have very good reasons for requesting bisections from users. That said, there is reason to worry that many users will simply stop sending in bug reports. If the only response they can expect is a bisection request (which they may be in no position to answer), they may see no point in reporting bugs at all. Fewer bug reports is not the path toward more solid kernel releases. So, as useful as it is, bisection will have to be a tool of last resort in most cases. The good news is that the development community does seem to understand that; bisection remains just one of the many tools we have for the isolation and solution of problems.
The not-quite-so-good news is that, as Al Viro and James Morris have pointed out, the real problem is in the review of code so that fewer bugs are created in the first place. That is not a problem which can be solved with bisection.
Bisection divides users and developers
Posted Apr 15, 2008 20:31 UTC (Tue) by jwb (guest, #15467) [Link]
I think that, in general, developers these days expect far too much work on the part of the user. I reported a bug against the intel xorg driver package in Ubuntu. They had imported some changes from upstream which broke any laptop with a GM965 graphics chip. I narrowed the result down to two candidate changes, and the package maintainer still marked my bug as "incomplete" because, I guess, I didn't narrow it down to _one_ patch. In other experiences I have come across projects that expect you to test and report against the tip of the source tree, even if there's no reason to believe that anything in the tip addresses the problem you are reporting. These types of actions are understandable defensive moves on the part of the developers, but to the user they are off-putting and onerous.
Bisection divides users and developers
Posted Apr 15, 2008 21:06 UTC (Tue) by arjan (subscriber, #36785) [Link]
If you're unhappy with how your distro provides support, that realistically is between you and your distro. The upstream project has to draw a line somewhere. I totally agree that blindly asking for "please test the tip" isn't the right thing, that's just pushing people away for now. At the same time, if someone, say, reports a bug in the 2.6.9 kernel, it's also not realistic for kernel developers to work on that. I consider it a reasonable request to the user to at least use the last or last-but-one released versions; if you're using something earlier it can mean pretty much two things: 1) you rolled your own - you should be able to roll a more recent version 2) you're using a distro package and don't know how to use a newer version - you should see if the distro support can help you Most healthy projects move so fast that a 2 year old version is no longer useful for the developers to spend time on. This is part of the prioritization thing the articile mentioned: as developer you end up spending your debug time on those reports which have the highest value for the time invested. That is a combination of 1) a sufficiently diagnosed bug 2) a bug that hits many people 3) a bug that has a high probability of being unfixed still (and the fix being applicable to your development codebase) 4) a bug that can easily be reproduced The more vectors a bug scores on, the more likely a developer will spend time on the bug. And that's ok in my view...
Bisection divides users and developers
Posted Apr 15, 2008 21:52 UTC (Tue) by epa (subscriber, #39769) [Link]
I guess there's a difference between a bug report and a support request. Clearly if a bug has been found, a bug report explaining how to reproduce it is not incomplete. All you need is instructions on how to reproduce the behaviour, and evidence from documentation (or from wise people) that it is indeed a bug. However if you expect something to be done to fix the bug, you have to rely on someone being motivated to fix it. That could be the project maintainer as a labour of love, or it could be someone you pay for support. Or if you are not paying cash, you may be expected to do some of the work yourself, for example running git bisect. Similarly, if the ancient foo-1.2 release is still being 'maintained', then any bug report against that version is valid. But to get support you may be expected to put in some work yourself checking out the very latest code. I agree that this can be offputting and some projects are surely losing out on help they might get from users, by making the users jump through too many hoops.
Bisection divides users and developers
Posted Apr 24, 2008 7:11 UTC (Thu) by jmspeex (subscriber, #51639) [Link]
I've had similar experience even dealing directly with vanilla kernels (full story at http://kerneltrap.org/Quote/Quality_of_the_Bug_Report ). Long story short, despite working pretty hard to pinpoint a regression that took days to reproduce, no developer even bothered to have a look at what could have been broken.
How about a distro-provided bisection facility?
Posted Apr 15, 2008 20:46 UTC (Tue) by JoeBuck (guest, #2330) [Link]
Let's say you're a distro, and a user complains that your shiny, newly released kernel has a major regression. Why couldn't the distro itself provide a bisection-generation facility? This could be some combination of pre-built bisections (maybe for the first 2-3 cuts) and nice packaging to automate bisection generation. Ideally the new kernels could be tested in the context of a live CD distribution, to minimize the risks from running unstable kernels.One could even conceive of a special kernel-testing distro that would run off of a live CD and automate the whole process. The CD (which might be on a USB flash device instead) would just iterate the following process, and the user would only need to wait for the compiles and test for the bug when prompted:
newest_good_kernel = what_I_was_running_before;
oldest_bad_kernel = what_you_shipped_me;
while (more_than_one_rev_between(newest_good_kernel,oldest_bad_kernel)) {
midpoint_kernel = git_bisect(newest_good_kernel,oldest_bad_kernel);
if (big_pipe && someone_has_built(midpoint_kernel))
download(midpoint_kernel);
else
compile(midpoint_kernel);
reboot_and_fire_up(midpoint_kernel);
tell_user_to_test;
if (user_says_the_bug_is_still_there)
oldest_bad_kernel = midpoint_kernel;
else
newest_good_kernel = midpoint_kernel;
reboot_and_fire_up(known_good_kernel); // for the next build
}
bad_patch = compute_diff(newest_good_kernel, oldest_bad_kernel);
send_report(bad_patch, user_comments);
Some types of bugs, such as file system corruption showing up after a while, would be trickier to test for, and the live CD would have to be able to ask for scratch media, be able to reset it to a known state, etc. But if testing can be made easier for interested users, we'll get more testing.
How about a distro-provided bisection facility?
Posted Apr 15, 2008 21:58 UTC (Tue) by jd (guest, #26381) [Link]
Some distros (Red Hat and SuSE spring to mind) are big enough that some (but not all) bisecting could actually be done automagically on a server at the distro's HQ. I'm picturing something like this:
This method has several advantages. Firstly, if the bug can be easily repeated, it moves the heavy lifting from users to people who (usually) have more powerful hardware at their disposal. Secondly, by distinguishing hardware-specific and hardware-agnostic bugs, there is automatically more information available for debugging. Thirdly, you really want to get to the final destination of having a way of reporting and filtering bug reports that maximizes both the quantity and quality of what kernel developers get, which means the manual parts have to be minimal and reducable by automation.
It also has several disadvantages. More users can bisect than can produce an automatable test plan. It's far harder for an automated system to eliminate non-identical reports that are of the same bug and carry no additional information. Too many automated bug reports may lead to developers ignoring them - and a bug in the bug reporter itself certainly would. So few distros can afford the hardware that would be required to do this well that it would have limited benefit. By necessarily using such high-end hardware, as opposed to what users are likely to have, a lot of hardware-related (and almost all hardware-specific) bugs - which, beween the two, will account for a sizeable fraction of all bugs - cannot be automatically bisected on a remote machine. Automated reporting systems cannot answer additional kernel developer questions or carry out additional testing onthe developers' behalf.
Ultimately, the question becomes one of how to get the most results from the most testing, given that testing is something programmers generally avoid if possible and the users most likely to do something funky enough to cause a crash are the ones who don't know what they're doing. The semi-automated method above won't solve that last one, though.
How about a distro-provided bisection facility?
Posted Apr 15, 2008 23:05 UTC (Tue) by JoeBuck (guest, #2330) [Link]
Certainly this is possible (to have people at the distros do the bisection), and in fact it's already being done, but many kernel bugs that escape into released kernels aren't noticed because they only affect users with very specific hardware. And these, I think, are exactly the cases where kernel developers ask the end user to bisect it, because they have no way on their end to make progress.
How about a distro-provided bisection facility?
Posted Apr 16, 2008 11:29 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
Would the kernel revisions really have to be distributed as source? Perhaps they could be distributed as pre-compiled object files. This would be much quicker for testing, might do away with the need for scratch space (if there was enough RAM available for linking), and space could still be saved by only including the object files which had actually changed since the last revision in a particular revision on the live CD. kexec could be used to load the newly linked kernel.
How about a distro-provided bisection facility?
Posted Apr 16, 2008 11:47 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
Or perhaps I am thinking too complicated - some sort of binary diffs should do the trick just as well or better.
How about a distro-provided bisection facility?
Posted Apr 16, 2008 14:05 UTC (Wed) by nix (subscriber, #2304) [Link]
Both of these have the problem that kernel configurations are wildly variable and capable of enormous variation. This would only be practical for a limited set of distro-compiled kernels (thus with known .configs), but it might work for them, and that would still be quite useful (I guess it's more likely that someone who can build their own kernel can bisect it for bugs as well).
How about a distro-provided bisection facility?
Posted Apr 16, 2008 14:10 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
I think that the original poster was indeed talking about distribution kernels.
Top issues for 2.6.25-rc with annotations
Posted Apr 15, 2008 20:51 UTC (Tue) by arjan (subscriber, #36785) [Link]
at http://www.kerneloops.org/twentyfive.html I'm trying to add annotation to the various issues to show fixed/unfixed/external patch etc, as well as a very short description of what the issue is. We are fixing the big ones at least.... that to me means we're at least doing something right in terms of quality.
Top issues for 2.6.25-rc with annotations
Posted Apr 15, 2008 21:33 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link]
Fedora 9 will install kerneloops package by default which should give a nice or (not quite nice depending on your viewpoint) boost to the stats. It is now in most mainstream distros. # yum install kerneloops Have fun.
Top issues for 2.6.25-rc with annotations
Posted Apr 16, 2008 14:39 UTC (Wed) by willy (subscriber, #9762) [Link]
We've just been discussing installing kerneloops in Debian by default. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=475398 (looks like it'll happen for Lenny)
Mark's response was strong
Posted Apr 16, 2008 0:28 UTC (Wed) by clugstj (subscriber, #4020) [Link]
Mark's response was hyperbole ("This 100% reliance on git-bisect"). The fact is, he was
running an unreleased kernel and expected unpaid volunteers to solve his problem.
Any tool that allows users to narrow down a bug is a good thing - whether or not all/some of
them will think it is worth their while to use the tool.
Mark's response was strong
Posted Apr 16, 2008 13:44 UTC (Wed) by kirkengaard (guest, #15022) [Link]
He, too, is an unpaid volunteer. Last I checked, linux-kernel wasn't a client of Real-Time Remedies Inc, and this seems to have occurred while hobby-hacking on his empeg devices. Once upon a time, that was an unthinkable distinction -- we were all unpaid volunteers. In this case, it would be preferable for you to blow that suggestion out of some other orifice. This is not a corporate "do our work for us" request, this is hacker-to-hacker. Your following comment is dangerously akin to suggesting that he got what he deserved for running an unreleased kernel, and that his laziness is the root of the problem. In the thread process, it is quite obvious that he did test a wide range of kernels (i.e. 2.6.11-2.6.24), and that he did observe the mass number of commit changes around the relevant close() code in the networking stack. He also provided excellent troubleshooting of the problem, tracing the error down to exactly what happened (i.e. premature reset of the connection on close()). Once bisect was suggested, <http://thread.gmane.org/gmane.linux.kernel/663422>, it became the solution. Nobody had an answer off the top of their heads -- or informed Mark that the relevant developer was asleep -- and it became "here, find the rest of the information and give it to us." Thus the argument about bug reporting being a two-way street, and the suggestion that Mark expected it to be a one-way street -- ignoring the work he had done already to report the bug in a very thorough manner. From here, the flamewar threshold was crossed in short order. Having the time is the issue. Assuming the timestamps are valid for estimation purpose, the report was filed at 6:56, and his "If I had the time right now, maybe." comment was at 21:05. Between, he posted four times, each with more information from his bug-tracking work. That's a lot of work product. Be careful about your assumptions when you make off-the-cuff remarks like that. Mark's response was strong, but not unjustified. This is the way the community has worked in the past, and the impression he got of "(shrug) Dunno, go bisect." is not hard to see.
Mark's response was strong
Posted Apr 16, 2008 14:59 UTC (Wed) by mb (subscriber, #50428) [Link]
> Having the time is the issue. Assuming the timestamps are valid for estimation purpose, the > report was filed at 6:56, and his "If I had the time right now, maybe." comment was at 21:05. > Between, he posted four times, each with more information from his bug-tracking work. That's a > lot of work product. In that time he could easily have done a complete bisect instead. bisect saves time for developers _and_ users.
Mark's response was strong
Posted Apr 16, 2008 17:39 UTC (Wed) by bronson (subscriber, #4806) [Link]
You're ignoring Mark's point. I think he was right to push back a little. If the automatic first response of developers is "go bisect it!" then that doesn't save time for anybody. Most bugs don't need a full bisection and many bugs won't bisect well well (as noted by the article). Both parties in this discussion had excellent points. Ideally devs will have to compromise a little by considering the bug report for 30 sec to reduce wild goose chases and making users feel like they're getting the runaround. Users will have to compromise a little more because they scale. In an ideal world. :)
Mark's response was strong
Posted Apr 16, 2008 17:58 UTC (Wed) by mb (subscriber, #50428) [Link]
> If the automatic first response of developers is "go bisect it!" then that doesn't save time > for anybody. Most bugs don't need a full bisection and many bugs won't bisect well well (as > noted by the article). Ok, point accepted. :)
Mark's response was strong
Posted Apr 17, 2008 0:30 UTC (Thu) by clugstj (subscriber, #4020) [Link]
I wasn't being off-the-cuff. After 14 hours (by your estimation) his bug wasn't fixed and he goes on a rant? Seems a bit excessive to me.
Mark's response was strong
Posted Apr 17, 2008 5:18 UTC (Thu) by dlang (subscriber, #313) [Link]
he wasn't complaining that the bug didn't get fixed in 14 hours, he was complaining that the attitude of the developers seemed to be "we won't look at the problem until you bisect it"
Bisection divides users and developers
Posted Apr 16, 2008 4:24 UTC (Wed) by imcdnzl (subscriber, #28899) [Link]
A point was made in the article about how the bisection might make an unworkable or uncompilable kernel - something I have had personally a few times. In one of the latest versions of git (can't remember version sorry) you can make a kernel as unusable after a bisect and it will then go and get another bisection point.
Bisection divides users and developers
Posted Apr 17, 2008 9:05 UTC (Thu) by dmk (subscriber, #50141) [Link]
A little bit offtopic, but also mentioned in the Thread (in the "the real problem is not this, but:" way of threadhijacking) was the lack of reviewers, and al viro suggested some kind of independent "per-subsystem-reviews". I think this an excellent idea would be some kind of "this month is the big "we all review thisandthat area of the kernel" month!" new-wave PR-thingy! maybe hosted by the kernelnewbies oder janitors... this could be specially mentored by the responsible developers. I think the linux-kernel could benefit from something like that.
Bisection divides users and developers
Posted Apr 17, 2008 18:08 UTC (Thu) by appie (guest, #34002) [Link]
All I can think of: how can any developer require a user to grok using git at all. The amount of people savvy enough to do a bisect will be very very small. Having someone actually reporting a bug is probably the tip of the iceberg. Volunteering and working in your own time, but if one contributes buggy code, one should either facilitate in debugging and fixing it or not submitting it in the first place. It's in everyone's interest not to piss off or push away participation from the non-(kernel)-hackers section of the FOSS community. I'm not quite sure if it would be feasible, but having a repository of installable kernels in various staged (i.e. patches applied) of the process would help.
Bisection has other problems
Posted Apr 30, 2008 15:11 UTC (Wed) by eliezert (subscriber, #35757) [Link]
The requirement that patch sets don't break bisection can make things very hard on driver maintainers. Lets say I have replaced 30% of my driver code with newer, better code. It's extremely hard to break the changes into separate patches that are logically separate, so that they can be reviewed one one hand, and on the other, none of them break anything, so bisection works. Maybe the way to solve this is to have bisection by patch-sets rather than by individual patches.
Copyright © 2008, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds