Kernel bugs: out of control?

[Posted May 10, 2006 by corbet]

As has been widely reported, Andrew Morton recently told an audience at LinuxTag about his fears that the Linux kernel is getting buggier over time. That worry resonates with a number of users and developers, many of whom have never gotten entirely used to the 2.6 development model. The result of this discussion may be a long look at how the kernel is developed, culminating in a discussion at the annual Kernel Summit in Ottawa this July. Easy answers may be difficult to come by, however.

Even the core question - are more bugs being added to the kernel than are being fixed? - is not straightforward. Many developers have a sort of gut sense that the answer is "yes," but the issue is hard to quantify. There is no mechanism in place to track the number of kernel users, the number of known bugs, and when those bugs are fixed. Some information can be found in the kernel bug tracker run by OSDL, but acceptance of this tracker by kernel developers is far from universal, and only a subset of bugs are reported there. Distributors have their own bug trackers, but there is little flow of information between those trackers and the OSDL one; distributor trackers will also reflect problems (and fixes) in distributor patches which are not in the mainline kernel.

Dave Jones publishes statistics from the Fedora tracker, but it is hard to know what to make of them.

Part of the problem is that an increasing bug count does not, in itself, indicate that the kernel is getting worse. A kernel which is larger and more complex may have more bugs, even if the density of those bugs is going down - and the 2.6 kernel is growing quickly. Increased scrutiny will result in a higher level of reported bugs, but a lot of those bugs could be quite old. The recent Coverity scans, for example, revealed some longstanding bugs. If the user base is growing and becoming more diverse, more bugs will be reported in the same code, even if that code has not changed.

Dustin Kirkland has taken a different approach. For each 2.6 kernel version, he performed a search for "linux 2.6.x", followed by searches for strings like "linux 2.6.x panic". The trouble reports were then normalized by the total number of results, and the graph shown on the right was produced (click on it for the full-resolution version). Dustin's results show a relatively stable level of problem reports, with the number of problems dropping for the most recent kernel releases.

Clearly, there are limits to the conclusions which can be drawn from these sorts of statistics. The results which show up in Google may not be representative of the real troubles afflicting Linux users, and the lower levels for recent kernels may simply reflect the fact that fewer people are using those kernels. But the fact that these results are as good as anything else available shows how little hard information is available.

Some other efforts are in the works to attempt to quantify the problem - stay tuned to LWN for information as it becomes available. In a way, however, whether the problem is getting worse is an irrelevant question. The simple fact is that there are more kernel bugs than anybody would like to see, and, importantly, many of these bugs are remaining unfixed for very long periods of time. So, regardless of whether the situation is getting worse, it seems worth asking (1) where the bugs are coming from, and (2) why are they not getting fixed?

The first question has no easy answer. It would be nice if somebody would look at bug fixes entering the kernel with an eye toward figuring out when the fixed bug was first introduced - and whether similar bugs might exist elsewhere. That would be a long and labor-intensive task, however, and nobody is doing it. In general, the kernel lacks a person whose time is dedicated to tracking (and understanding) bugs. At the 2005 Kernel Summit, Andrew Morton indicated that he would like to have a full-time bugmaster, but this person does not yet exist. If, somehow, such a position could be funded (it is hard to see as a long-term volunteer job), it could help with the tracking and understanding of bugs - and with ensuring that those bugs get fixed.

Why bugs do not get fixed might be a little easier to understand. Certainly part of the problem must be that it is more fun to develop cool new features than to track down obscure problems. The older development process - where, at times, new features would not even be merged into a development kernel for a year at a time - might have provided more motivation for bug fixing than the 2.6 process, where the merge window opens every month or two. But feature development cannot be the entire problem; most developers have enough pride and care about their work to want their code to work properly.

The kernel is a highly modular body of code with a large development community. Many (or even most) developers only understand a relatively small part of it. So it is easy for kernel developers to feel that the bulk of the outstanding bugs are "not their department" - somebody else's problem. But the person nominally responsible for a particular part of the code may be overwhelmed with other issues, unresponsive and difficult to deal with, or missing in action. Many parts of the kernel have no active maintainer at all. So problems in many kernel subsystems tend to get fixed slowly, if at all - especially in the absence of an irate and paying customer. For this reason, Andrew has encouraged kernel developers to branch out and address bugs outside of their normal areas. That is a hard sell, however.

Kernel bugs can be seriously hard to find and fix. The kernel must operate - on very intimate terms - with an unbelievable variety of hardware and software configurations. Many users stumble across problems that no developer or tester has ever encountered. Reproducing these problems can be impossible, especially if nobody with an interest in the area has the affected hardware. Tracking down many of these bugs can require long conversations where the developer asks the reporter to try different things and come back with the results. Developers often lack the patience for these exchanges, but, crucially, users often do as well. So a lot of these problems just fall by the wayside and are not fixed for a long time, if ever.

Bug prevention is an area with ongoing promise. Many of the most error-prone kernel interfaces have been fixed over the years, eliminating whole classes of problems, but more can be done. More formal regression tests could be a good thing, but (1) the kernel developers have, so far, not found a huge amount of value in the results from efforts like the Linux Test Project, and (2) no amount of regression testing can realistically be expected to find the hardware-related problems which are the root of so many kernel bugs. Static analysis offers a great deal of promise, but free tools like sparse need quite a bit of work, yet, to realize that promise.

The end result is that, while there are ways in which the kernel process can be improved, there is a distinct lack of quick fixes in sight. Fixing kernel bugs is hard work, and the kernel maintainers lack the ability to order anybody to do that work. So, while the kernel community can be expected to come to grips with the problem - to the extent that there is a problem - the process of getting to a higher-quality kernel could take some time.

Index entries for this article
Kernel	Development model/Kernel quality

Kernel bugs: out of control?

Posted May 11, 2006 7:56 UTC (Thu) by malor (guest, #2973) [Link] (44 responses)

I don't understand why people don't stick with the most basic truth of software development: the way to make stable code is to stop adding features, and start fixing bugs.

It's really that simple. 2.4 was a mess until Linux branched to 2.5. Marcelo took it over from there, and while 2.4 isn't perfect. it's quite stable. He's done a fantastic job.

2.6, in comparison, is terrible, and getting worse, for precisely the same reason that 2.4 was bad: they won't let it stabilize. They're adding features too fast to debug, much too fast for anyone to deal with. It's a part-time job just keeping up with the new features they keep adding.

NOBODY understands the security implications of moving as fast as they do. As of this writing, there have been SIXTEEN PATCHES to 2.6.16, a theoretically 'stable' kernel, in about six weeks. That's more than a patch every three days.

They kernel devs are fundamentally waving their hands in the air, expecting 'the distros' to make their code work. Quality is *not something that can be retrofit*. The pay vendors, like Red Hat, seem to be doing an okay job, but Debian, at least, is really having trouble. They mostly publish Linus trees with just a few patches. 2.6.16 has been a *disaster*.

This model of kernel development works well only for developers. For the rest of the world, it is a trainwreck. Stable code needs to be stable and supported for a couple of years, not dropped like a hot rock after 2 months. Linus et al are seriously risking Linux's ownership of the word 'stable' in the market. Once that's lost, it won't be easily regained. (Just ask Microsoft!)

2.4 has been out about five and a half years: THAT is stable software. 2.6.16.16 has been about one hour as of this writing: the lifetime of 2.6.16.15 was two entire days. A 'stable' piece of software that requires sixteen patches in six weeks is NOT STABLE. 2.4 has had, as of this writing, 32 patches in its ENTIRE LIFESPAN of 5.5 years.

Linus NEEDS to appoint a 2.6 maintainer, and go play in the 2.7 sandbox, instead of putting the entire world through this alpha-quality crap.

Hear! hear!

Posted May 11, 2006 8:27 UTC (Thu) by dion (guest, #2764) [Link]

If only Linus would see it that way...

Maybe a bugmaster is really just another name for "maintainer of the stable release"?

Kernel bugs: out of control?

Posted May 11, 2006 8:51 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (9 responses)

That's definitely one side of the coin, and if its the only side that resonates with what you do (e.g. if you have a pile of 3-5 year old web servers running a patched Linux) then I'm glad you're able to enjoy 2.4 series kernels and that you feel so strongly about quality.

However, the OTHER side of the coin is that a lot of people own hardware that wasn't supported five micro versions ago (2.6.11) and didn't work well enough to be really acceptable as a main desktop/ laptop system until the last two or three versions, and there are people who are waiting for things in 2.6.17 or so before their hardware will work.

Under the old regime, my laptop would have stopped at "boots, but most things don't work properly" pending the transformation from 2.7.x to 2.8 stable, presumably some time in 2007. Obviously that means I wouldn't have bought this laptop, but instead an older, hard to obtain model with much worse specifications. Most people would have just junked Linux and gone to Windows or OS X.

Secondly it's unclear whether the bugs being fixed in new stable patches are just bugs that would previously have sat quietly in a queue for weeks, or (in the stable kernel series) for months, waiting for a new major release. We're quick to shout "foul" when Microsoft says that dozens of security fixes wrapped up as a Service Pack counts as one patch versus hundreds of separate package updates in a Linux distro, so we should be cautious before assuming that e.g. this week's fixes to SCTP would have been unnecessary in the "stable 2.6 series" world, rather than simply being delayed 12 weeks and then included without fanfare.

Nothing whatsoever prevents you from getting a gang of people together to maintain say 2.6.16-stable indefinitely. If you do it for more than a month or two and look serious about it, Linus will probably even bless it.

Kernel bugs: out of control?

Posted May 11, 2006 9:04 UTC (Thu) by kleptog (subscriber, #1183) [Link]

I think there can be a middle ground. Your argument is based around the fact that an unstable series would take years before releasing a stable 2.8. So let pick something easier.

Simply annouce that no new features will be accepted for two or three months, while a concerted effort is made fix bugs and regressions. A few months is nothing in the timescale of hardware development and would (hopefully) produce something that people can trust.

Whether it works is another question. If people don't like fixing bugs then no amount of time will help. But I think it's something worth trying.

Kernel bugs: out of control?

Posted May 11, 2006 9:38 UTC (Thu) by malor (guest, #2973) [Link] (7 responses)

Well, other than skill, anyway. I can fix makefiles and the occasional include, but that's about as far as I go with C. Not exactly stable-kernel-maintainer quality. :) I know the Debian kernel maintainers are clueful, and they're having a hell of a time, so thinking that I'd do any better is rather silly.

The lack of hardware support might not have been a problem... they backported many drivers to 2.4. Adding drivers doesn't generally interfere with the functionality of the kernel. As is, to get security patches or new drivers, you're forced to take new features too, maybe features you don't particularly want. With a stable kernel, you could probably have both a nice lack of kernel panics AND hardware support for your new devices. It certainly used to happen during 2.4/2.5.

The problem with the backporting is that it's boring work, and the kernel devs don't like to do it. Stability is also boring work, with a similar outcome. The new kernel development model is for THEM, not for the millions of people who built companies and careers based on the original social contract: "This code is as good as we know how to make it. If it breaks, you keep the pieces, but we'll do our damndest to make sure it doesn't break."

Linux 2.2 was incredibly stable; it NEVER fell over. (well, at least with the loads at the time, which were pretty minimal :) ) Best piece of software I've ever run. Every once in awhile, someone would mention a kernel panic on IRC or Slashdot, and the result would usually be mild incredulity... "oh wow, it crashed?! whoah. Dude. Go buy a lottery ticket!"

The first time this thread came up on Slashdot and I posted this rant, more or less, very few people agreed with me. (even though I'd been struggling with bugs for months.) It came up a couple of months ago, and there were a number of people this time chiming in with similar stories. It came up again a week or so ago, and this time the thread was FULL of people complaining. The haters are being marginalized ("you idiot! go run Windows!")... because people are cluing in that this high-speed development cycle is death for stability.

The single best comment I saw on Slashdot said this, roughly: with the kernel in such a state of flux, different parts of it will be stable at different times, and it will never all be stable at the _same_ time.

Kernel bugs: out of control?

Posted May 11, 2006 16:22 UTC (Thu) by smoogen (subscriber, #97) [Link] (2 responses)

Well, I have an idea.. buy yourself a kernel developer to do this work for you. The tenor and tone of your posts seem to be expecting them to do this work for free for you.

Kernel bugs: out of control?

Posted May 11, 2006 18:55 UTC (Thu) by richardr (guest, #14799) [Link]

Well, curiously enough, that is pretty much what I do expect. I'm an end user when it comes to linux
kernels, and what I want to hear is that "it just works". However, I have happily spent money on
distributions I could download for free in the expectation that the money would go to developers to
improve performance and *squash bugs*, and to that extent I have bought (a very small part of) a
kernel developer.

Kernel bugs: out of control?

Posted May 11, 2006 23:13 UTC (Thu) by malor (guest, #2973) [Link]

As the other poster said... I have put a great deal of money (thousands of dollars) into Linux and ancillary products over the years. Far, far more than I've spent on Microsoft products. Part of that money has gone to pay kernel devs at places like Red Hat, and likely has indirectly resulted in the creation of Linux-related jobs. Some of that money goes to LWN.

What I'm asking for here benefits all of us... you, me, AND the kernel devs. Stability and security are what got Linux to this point, to where Linux experience is a good thing to have on a resume, to where you can get good jobs knowing only Linux.

That will not remain true if the fundamental strengths of Linux are lost in a chase for 'development speed', which benefits primarily the developers, and not so much the Rest of World.

Kernel bugs: out of control?

Posted May 11, 2006 19:16 UTC (Thu) by oak (guest, #2786) [Link] (3 responses)

> The problem with the backporting is that it's boring work, and
> the kernel devs don't like to do it. Stability is also boring work,
> with a similar outcome. The new kernel development model is for THEM

Are you argumenting that if the kernel development tools and processes
were more cumbersome for the kernel developers, the code quality would
improve? Let me doubt that...

> Linux 2.2 was incredibly stable; it NEVER fell over.

It also supported a lot less hardware. Note that while the number of
components grows linearly, the possible interactions between them grow
exponentially.

> even though I'd been struggling with bugs for months

How good bug reports you made of them? Bugs cannot be fixed
if they are not known...

Kernel bugs: out of control?

Posted May 11, 2006 23:52 UTC (Thu) by malor (guest, #2973) [Link] (2 responses)

Yes, I think exactly that. Development speed is not the same as code quality. The new process is tuned to let them do more of the 'fun' stuff (writing new code), and force them to do less of the 'unfun' stuff, like making sure things actually work. It's also to force more testers; they've explicitly said one of the reasons they're doing it this way is to force people to test new code.

I don't mind testing code when there's a call for testers (and if I can slot in some time). I do mind being forced to test beta-quality code by them calling it 'stable' and refusing to support code that's more than two months old.

As far as exponentiation goes, you're exactly right... I'm not sure if I hit this idea yet in this thread. What that means is that as the kernel grows, development needs to slow down, to cover all the various interactions. Instead, they're _speeding up_, not testing, and expecting the Rest of World to fix their problems.

I tried to report the APIC bug on that VIA board. I first emailed the ACPI author (got my acronyms confused :) ), who very promptly replied, and politely told me I was talking to the wrong person. Then I tried mailing the APIC maintainers twice, but didn't get a reply. I dropped it after that... probably should have sent it to the catchall address, but forgot it. And now I don't have the board anymore, so a bug report won't be very useful.

The 865 bugs I can't diagnose, because it's all remote, so I haven't even tried to report it. Those machines are production, and I can't afford to take them down for testing. So my bug report wouldn't be very useful. And 2.6.16 has worked well so far, although the unending reboots are painful.

And 2.6.0 through 2.6.8 or so worked great on that same board.... so I really shouldn't have HAD to file bug reports. I was, after all, tracking a 'stable' kernel. Stuff that worked in 2.6.0 should work in 2.6.16.

Kernel bugs: out of control?

Posted May 12, 2006 0:02 UTC (Fri) by malor (guest, #2973) [Link]

Oops, I inserted a paragraph in the wrong place. If you swap the last two paragraphs, it'll be more readable, although the concluding note will be in the wrong place. :)

Kernel bugs: out of control?

Posted May 16, 2006 17:49 UTC (Tue) by oak (guest, #2786) [Link]

> And 2.6.0 through 2.6.8 or so worked great on that same board....
> so I really shouldn't have HAD to file bug reports. I was, after all,
> tracking a 'stable' kernel. Stuff that worked in 2.6.0 should work
> in 2.6.16.

You cannot really expect that unless you know that there's
a regularly executed test setup:
- with the same HW as yours
- with the similar software and same kind of load as yours

For example fixing a bug (for a setup developer has) might make
(e.g. an already existing) bug somewhere else in the code happen
more likely in your setup.

Only testing and error detection inside & outside kernel can
help in catching those. The testing has to be automated,
it should not produce (too many) false positives, and it has
to pinpoint fairly well where the problem happens so that
the bugs can be fixed. Otherwise only alternative developer
has is to resolve the bugs as WORKSFORME.

It would be nice if kernel developers would provide an automated
test-set for people who "live on the bleeding edge" which they could
run on their test setups before deploying the kernel on production
machine. If the test-set outputs an error, you could just forward
it to kernel.org and the automatically produced bug report would
have all the relevant info; your kernel config, HW info, OOPS etc...

If the automated test-set would go through, then you could do your
own tests on the kernel before putting it into real use. And if
those fail, you could propose tests to be added to the automated
test-set so that those kind of problems are caught earlier.

Kernel bugs: out of control?

Posted May 11, 2006 10:08 UTC (Thu) by nix (subscriber, #2304) [Link] (28 responses)

You contradict yourself. First you say that the way to make code stable is to stop adding features (which is, to a degree, true, especially of poorly-structured codebases which don't do interfaces fixes to eliminate many bugs at once); and then you say that quality isn't something that can be retrofitted.

Of course, the process of stopping adding features and starting fixing bugs *is* precisely retrofitting quality (in the extremely crude sense of 'lack of bugs' that you seem to be using).

I've been running vanilla 2.6 on numerous production and testing systems, some under extreme load. I've had a few problems, but all were before 2.6.14, and all got patches from the l-k list with amazing speed.

It has *always* been true that if you have weird, rare, or old hardware that no developer has got, you'll have to be willing to partially maintain the driver yourself, or use a *supported* distro kernel which ships with that driver enabled (and thus they commit to maintain it), or watch it rot: the only difference between the old 2.4 world and the 2.6 one is that it rots faster, because development is faster than it was.

2.4 is stable, yes: it's also far less capable than 2.6. I wouldn't even run it on a firewall anymore, myself; but nothing stops you running it for as long as you like.

The 'oh dear there are too many -stable kernels' stuff is a canard which has been repeatedly demolished: would you rather security fixes were not made, or were delayed for days or weeks after being made? Most of those SIXTEEN PATCHES were only a few lines long, and trivially reviewable by eye. There have been, IIRC, *three* patches addressing non-security-hole-related stability bugs; that's a patch every couple of weeks.

(And as for 'owning the word stable in the market', well, *pfui*. I wasn't aware that owning words was any kernel dev's responsibility, or particularly interesting.)

Kernel bugs: out of control?

Posted May 11, 2006 10:35 UTC (Thu) by malor (guest, #2973) [Link] (27 responses)

The present system is 'we release the code, the distros have to make it work'. Quality can't be retrofit; if it wasn't there to begin with, it can't be added later, especially not by other people. Having a stable kernel will not, in and of itself, make the code better, but one would hope the new emphasis on stability just might.

As far as vanilla 2.6 goes.... 2.6.14 broke _traceroute_. I mean, come on.

2.6.15 as distributed by Debian is completely unusable on the Intel i865 machines I've tested it on; it crashes randomly, within an hour, light or heavy load. Every time, without fail. And because the machines are remote, I can't easily troubleshoot. And it's not like the 865 chipset is, you know, rare.

All versions of 2.6 since 2.6.9 or so have been unstable on my (one, personal) KT333-based board... this particular error cost me several hundred dollars, as I replaced a drive that didn't need replacement. It was actually APIC errors that were introduced around that time. I ended up just replacing the motherboard. (I may end up replacing the OS, too.)

The fact that the patches to 2.6.16 are so trivial just means that the code wasn't properly reviewed before released as 'stable'. And it does not change the fact that I've had to reboot my Debian servers ten times or so in the last month. (Before you start laying into me for using 'unstable'....I can't use the stable kernel, because it doesn't support all the hardware properly, and 2.6.15 in testing crashes after an hour, so the unstable kernel is all I can use.)

Owning words may not be their responsibility, but it's certainly in their self-interest. The existence of OSDL and Linus' present job are a direct outcome of that word ownership. If they lose it, then there will be fewer paid kernel dev jobs created in the future. The more central Linux becomes, the more people trust it and need it in their daily lives, the more jobs to work on it there will be. So it's rather foolish of them _not_ to think about it.

Kernel bugs: out of control?

Posted May 11, 2006 12:26 UTC (Thu) by nix (subscriber, #2304) [Link] (15 responses)

Well, the 2.4 kernels, which you applaud as being so terribly stable: the right word for them would be 'stale'. No devs to speak of ran them, so many old bugs accumulated and were never fixed, and because it diverged so much from the dev tree, many fixes accumulated in the 2.4 tree which were never forward ported!

Simply making a stable tree with no dev tree won't help: the developers will simply use their own private trees, or some other dev's git repository.

Your real complaint appears to be 'it doesn't work on my hardware and I'm not willing to go to the effort needed to track down the problem and get it fixed'. (There are numerous ways to troubleshoot remote machines; look up the network kernel syslog option, for instance. Even panics get dumped out of that.)

And as for the 'trivial patches imply improper review', well, anyone with any experience of software maintenance would know this is untrue: many of the most subtle bugs need very small patches to fix them once you've finally worked out the cause (e.g., one extra lock around some data structure on which multiple concurrent accesses are racing: two or three lines, a sod to find and very hard to reproduce.)

Kernel bugs: out of control?

Posted May 11, 2006 12:48 UTC (Thu) by malor (guest, #2973) [Link] (14 responses)

Well, at this point it's stale, sure. But it wasn't when 2.6 was released. If 2.6 had a maintainer and Linus was off in 2.7, like they've always done before, things would be just fine, as they were in 2.4... after Linus quit messing with it, anyway.

My real complaint is that it DID work on my hardware and then STOPPED working for no discernible reason, and I don't generally have time to do much troubleshooting. Things ceasing to work in a STABLE kernel series is very bad. I'm willing to test beta kernels when there's a general call for that, but I don't do it routinely. At least, I didn't, until the new dev system forced me to. I would have no problem with device breakage after switching stable versions, but if something works in 2.6.0, it should still work in 2.6.16, as far as I'm concerned.

As far as patches go... ok, I'll accept your 'short patches don't mean it wasn't hard to figure out' argument... you're correct there.

It would probably be better to say that 'trivial patches imply lack of QA/testing', rather than lack of review. A trivial patch means the design was right, but the implementation was wrong, and that's the sort of thing that should be caught in testing.

It still doesn't change the fact that Debian users on 2.6.16 have seen one hell of a lot of downtime lately.

Kernel bugs: out of control?

Posted May 11, 2006 13:57 UTC (Thu) by k8to (guest, #15413) [Link] (12 responses)

I am curious about this downtime. Is it related to specific hardware? I
am a Debian user running 2.6.16 with no perceived problems. Well, other
than the swapping caused by firefox ;-)

Kernel bugs: out of control?

Posted May 11, 2006 14:42 UTC (Thu) by malor (guest, #2973) [Link] (11 responses)

Debian has updated the 2.6.16 kernel many times in the past month, requiring a lot of rebooting(ie, downtime). If you haven't been updating, you haven't had downtime, but you've also left unfixed a number of security and DOS issues. Not too important for most home users, but quite a different thing for servers.

Kernel bugs: out of control?

Posted May 11, 2006 15:33 UTC (Thu) by vmole (guest, #111) [Link] (10 responses)

Why are you running debian testing/unstable on production servers?

And you're complaining about security patches? You can't have it both ways: either they remain unpatched, or you have downtime. Those bugs existed long before 2.6.16, that just happens to be the one they're looking at and patching. If they'd called (say) 2.6.8 stable, there would still be those patches. Sure, maybe not the exact same ones, but many the same, and others that have been fixed between 2.6.8 and 2.6.16.

Kernel bugs: out of control?

Posted May 11, 2006 23:59 UTC (Thu) by malor (guest, #2973) [Link] (9 responses)

I'm running testing on production servers because testing is current enough to be useful and essentially never breaks things. I'm just using the kernel from unstable, becuase that's the only one that works in all cases. I actually had to use a Ubuntu kernel for awhile when things were really bad with 2.6.15.

Linux 2.4 always pushed security fixes out right away... I don't remember Marcelo sitting on security patches. He'd accumulate a bunch of non-security stuff and roll it out all at once, but security patches were immediate release. And in the 5.5 years of 2.4's existence, it's had 32 total releases... and 10 of those were when Linus was still tinkering with it. So 22 is more accurate.

22 patches in 5 years, I can handle, particularly since many of them were optional... just new drivers, not security fixes, which meant they could be deployed whenever there was time.

16 patches in five weeks, nearly all of them immediate must-install security fixes... that's not so good.

Kernel bugs: out of control?

Posted May 12, 2006 13:51 UTC (Fri) by vmole (guest, #111) [Link] (7 responses)

I want a kernel that supports the latest hardware
I don't want the kernel to change, or have bugs

Pick one.

Kernel bugs: out of control?

Posted May 16, 2006 16:23 UTC (Tue) by hazelsct (guest, #3659) [Link] (6 responses)

No. It's more like, don't call it "stable" unless/until it is.

Kernel bugs: out of control?

Posted May 18, 2006 6:48 UTC (Thu) by gowen (guest, #23914) [Link] (5 responses)

He's running Debian unstable. And he's complaining that it's unstable. Linus is not the only one struggling with nomenclature.

Kernel 2.6 has a stable branch. The stable branch of 2.6.x is called 2.6.x.y, for large values of y.

Kernel bugs: out of control?

Posted May 18, 2006 9:16 UTC (Thu) by malor (guest, #2973) [Link] (1 responses)

You're not reading what I'm saying. I'm using THE KERNEL from Debian unstable, because the kernel from testing doesn't work at all (2.6.15 crashes within an hour in my 865 machines), and the kernel from stable doesn't support all my hardware. I use nothing else from unstable on production servers. I have exactly one machine running the actual unstable distribution in its entirety, because that one clues me in when there's (yet another) kernel patch.

Debian's kernel is pretty much vanilla 2.6.16. Linus et al call Linux 2.6.16 'stable'.

The kernel devs' expectation that 'the distros' will magically fix all their bugs amounts to simple handwaving, shirking of their fundamental responsibility: when they call it stable, it should BE STABLE.

Software that's supported for only two months is not, pretty much by definition, 'stable'.

Kernel bugs: out of control?

Posted May 21, 2006 17:00 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, stable means `we think it will work'. Length-of-support has nothing to do with it.

Kernel bugs: out of control?

Posted May 18, 2006 9:17 UTC (Thu) by malor (guest, #2973) [Link] (2 responses)

And which 'stable' 2.6 kernel do I choose? And define 'large value of Y'. Bu your definition, 2.6.16.15 should be 'stable', but it lasted THREE DAYS.

Kernel bugs: out of control?

Posted May 18, 2006 9:46 UTC (Thu) by arcticwolf (guest, #8341) [Link] (1 responses)

You're confusing (unintentionally, I assume) two distinct meanings of the word "stable" here. Stable can mean either:

1) Bug-free enough to not crash on most systems encountered in the wild (i.e., "stable" in the sense of "production-ready");
2) Not undergoing changes.

It's important to keep in mind that these are not related to each other. When you say "it lasted THREE DAYS", you apparently mean that it was replaced with a newer patch (.16) three days later - that's the second definition of stable. So, yes, in that sense, 2.6.16.x isn't stable, but that's just because the developers are actually fixing security issues that are found and releasing patches immediately.

Would you rather have them sit on those patches for weeks or months? Well, if you do, you can still have that; nobody's forcing you to apply those new patches.

But in any case, what Andrew Morton talked about was stability in the first sense, and that's a different beast. How long would it have taken for 2.6.16.15 to crash on your boxen? It's hard to say, but I'd guess that unless you'd have been rather unlucky, it would've been more than three days.

So, the answer to your question is: you choose the latest one that's available. Whether you continue to apply newer patches as they come out is your choice, not ours, and complaining that you have downtime when patching security issues in the *kernel* is pretty silly. That's how things are in the real world. (And it's still true that nobody's forcing you to apply anything, so if you'd rather avoid downtime than patch newly-found issues, just don't apply them.)

Kernel bugs: out of control?

Posted May 18, 2006 10:05 UTC (Thu) by malor (guest, #2973) [Link]

It feels like you've read about three sentences of what I've written here, and you're reacting just to that. Most other replies I've put in this thread address these issues. I'd suggest reading them... I'm not going to repeat all of them here.

The strongest objection I have to the current model is that we are forced to take new features with our bugfixes, because they will not support kernels for more than two months. New features = new bugs. New bugs = new patches. New patches = new features. New features = new bugs. And so on.

'Stability', as defined from the point of view of the Linux kernel, should mean:

1) It's maintained with security patches;
2) No fundamental new features are added;
3) Drivers are added, if possible, without violating #2.

In other words.... do it like 2.4 did it, after Marcelo took over. If a new network card comes out, of course you can add the driver to the source tree... it's not going to affect anyone else. If that new driver requires an update to the memory management model of the kernel, then you don't include it in the stable branch, but rather in the dev tree.

I think they might have retrofit the USB system in 2.4... it's been awhile, and I wasn't following it closely, because I didn't need to. I do know that their backports from 2.5 were done without large-scale overhauls of kernel subsystems; they kept the changes focused and very limited. And, by and large, the 2.4 kernel was very stable. It wasn't as solid as 2.2, but it was quite acceptable.

Basically, the kernel devs had the model NAILED during 2.4. This high-speed 2.6 development, on the other hand, is an absolute disaster. These guys are some of the smartest in the business, but they are still human, and they are running into the limitations of their own intelligence. The code has become too complex for them to maintain... it's hard and nasty and difficult work now, and instead of slowing down development, they're ignoring the bugs and SPEEDING UP instead, apparently because that's more fun.

It's significantly less fun for people trying to keep production machines running.

Andrew Morton is most unhappy about the quality of the kernel. That should tell you something.

Kernel bugs: out of control?

Posted May 21, 2006 16:58 UTC (Sun) by nix (subscriber, #2304) [Link]

Most of the security fixes weren't `immediate must-install'. A goodly number related to single drivers or obscure new protocols: SMB/CIFS, SCTP...

... I mean, the SCTP code is, what, a kernel release old? Thus, it has bugs; some of which may be remotely exploitable (by the nature of network protocol code). How terribly shocking.

Kernel bugs: out of control?

Posted May 21, 2006 16:56 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, 2.4 was classifiable as `stale' as soon as most devs weren't running it. 2.5 was unusual becuse the abortive IDE rework made it so unstable that the devs stuck with 2.4; but after that was reverted, 2.4 staled out (to coin a horrible neologism) really rather fast.

Kernel bugs: out of control?

Posted May 11, 2006 13:54 UTC (Thu) by k8to (guest, #15413) [Link] (9 responses)

My impression is that of someone who feels that they know how to engineer
software quality more successfully than the linux kernel developers, but
yet does not know how to engineer software. Perhaps I have made a
mistake in my readings.

In complex software, security problems crop up, patches are released, and
installing them is a certain level of annoyance. If the approach of
continuing to re-engineer interfaces and systems to eliminate categories
of problems offends you, then the linux kernel in general should offend
you, since this has been the mode of operation since day one. There _are_
other free unixes which have a much more conservative approach. They are
not horrible.

Kernel bugs: out of control?

Posted May 11, 2006 14:46 UTC (Thu) by malor (guest, #2973) [Link] (3 responses)

Linux used to be stable while also doing those things in a development branch. It no longer is: the development branch and the mainline kernel are one and the same thing, forcing us all into alpha testing.

No, I'm not a developer, but I have been using Linux a long, LONG time. (in around kernel 0.8 or 0.9). So I'm certainly qualified to comment on the way it used to be (stable) and the way it is now (unstable). The development process and lack of focus on quality would appear to be the cause.

Do you have an alternate explanation?

Kernel bugs: out of control?

Posted May 11, 2006 15:03 UTC (Thu) by k8to (guest, #15413) [Link] (2 responses)

I agree that the kernel development is being handled differently now,
which is resulting in a larger number of releases in the stable line. I
do not agree that this indicates a lower level of quality. I think it is
simply factual that the one does not necessarily imply the other.

Kernel releases with corrected functionality are being created faster
than in the past, enabling users to get these fixes sooner. If users
feel the need to reboot for every one of these updates, then the fixes
may be seen as something of a nuisance. However, the alternative is to
not apply the fixes. In the past, the choice was not available, fixes
were provided less frequently and less rapidly, and so there was a longer
window of vulnerability and no possibility for frequent reboots. You can
simulate the old situation by installing fewer kernels.

Kernel bugs: out of control?

Posted May 13, 2006 20:19 UTC (Sat) by Baylink (guest, #755) [Link] (1 responses)

This sub-thread speaks to a topic near and dear to my heart: what does a version number *mean*?

Let me quote here my contribution to the Wikipedia page on the topic, based on my 20 years of observation of various software packages:

A different approach is to use the major and minor numbers, along with an alphanumeric string denoting the release type, i.e. 'alpha', 'beta' or 'release candidate'. A release train using this approach might look like 0.5, 0.6, 0.7, 0.8, 0.9 == 1.0b1, 1.0b2 (with some fixes), 1.0b3 (with more fixes) == 1.0rc1 (which, if it's stable enough) == 1.0. If 1.0rc1 turns out to have bugs which must be fixed, it turns into 1.0rc2, and so on. The important characteristic of this approach is that the first version of a given level (beta, RC, production) must be identical to the last version of the release below it: you cannot make any changes at all from the last beta to the first RC, or from the last RC to production. If you do, you must roll out another release at that lower level.
The purpose of this is to permit users (or potential adopters) to evaluate how much real-world testing a given build of code has actually undergone. If changes are made between, say, 1.3rc4 and the production release of 1.3, then that release, which asserts that it has had a production-grade level of testing in the real world, in fact contains changes which have not necessarily been tested in the real world at all.

The assertion here seems to be that an even higher level of overloading on version numbering ("even revision kernels are stable") and it's associated 'social contract' are no longer being upheld by the kernel development team.

If that's, in fact, a reasonable interpretation of what's going on, then indeed, it's probably not the best thing. I'm not close enough to kernel development to know the facts, but I do feel equipped to comment on the 'law'.

Kernel bugs: out of control?

Posted May 17, 2006 23:17 UTC (Wed) by k8to (guest, #15413) [Link]

I think your comments on versioning are not far from the mark. The fact
of these "minor" stable relases, eg. 2.6.X.Y, is that they are _smaller_
changes than have ever occurred in the stable series before. It is true
that these smaller changes do not receive widespread real-world
production evaluation, but no non-stable release kernel (rc versions
included) ever receives enough attention to catch even some showstopper
bugs.

So I think you are right to question this change, but the balancing facts
are that the release candidate process for the Linux kenel doesn't seem
very effective, and the changes made in the revision series are
_strongly_ conservative.

It is important to remember that in this particular (highly visible,
highly open) development process, there is very little pressure to
deviate from the conservative perspective in these updates.

Kernel bugs: out of control?

Posted May 11, 2006 19:04 UTC (Thu) by oak (guest, #2786) [Link] (4 responses)

> If the approach of continuing to re-engineer interfaces and
> systems to eliminate categories of problems offends you,
> then the linux kernel in general should offend you, since
> this has been the mode of operation since day one.

This reminds me of the recent change in Glibc, they now
abort programs which do double frees.

Yes, more programs may now be "appear unstable", but I personally
prefer application rather being terminated than silently corrupting
my data when they hobble forward with inconsistent state.
Broken apps should be shot down as soon as possible so that
people know to fix them, this is the Unix way.

If you don't force quality, you don't get it.
You end up with an unmaintainable mess instead.

> There _are_ other free unixes which have a much more
> conservative approach. They are not horrible.

I'm sure the person complaining here would then
complain about the lack of features and HW support...

Kernel bugs: out of control?

Posted May 11, 2006 23:24 UTC (Thu) by malor (guest, #2973) [Link] (3 responses)

I agree with you about forcing quality... that's a great idea. If I thought the new development process would actually DO that, I'd be enthusiastically behind it. Instead, it's just about speed, speed, speed... and avoiding the stuff that's no fun to do, like bugfixing and testing.

Waving your hands in the air and expecting other people to fix your programs is not, in my long experience supporting developers, the way to get it fixed, particularly not properly.

As far as switching OSes goes, I've already stopped using Linux on my firewalls because of the unending stream of security reboots. Netfilter is faster and more featureful than OpenBSD's pf, and its language is more amenable to shell scripting, but the first mission of a firewall is to stay up. I can throw OpenBSD on a firewall and not have to update it again for a couple of years. This means no downtime, which means happy users. I've never seen any Linux kernel that lasted that long without security holes.

FreeBSD is looking better all the time... I've been talking about switching over, but haven't yet. If matters continue as they have, maybe I will. And you'll have one less complaining user, which, from your tone, you may prefer.

Kernel bugs: out of control?

Posted May 15, 2006 4:27 UTC (Mon) by ChristopheC (guest, #28570) [Link] (2 responses)

I think it is unfair to say the kernel developer do not test their patches. However, they can only test them on the few combinations of hardware they have access to.

To discover the bugs, the kernel needs wide-spread testing. But few people are willing to test the development releases (-rc) - the problem has been mentioned countless times on lkml and here on lwn. So they have to release often toge tthe needed coverage. (This is a somewhat simplified explanation, of course)

Kernel bugs: out of control?

Posted May 15, 2006 5:48 UTC (Mon) by malor (guest, #2973) [Link] (1 responses)

2.6.14 broke *traceroute*. Give me a break.

Kernel bugs: out of control?

Posted May 21, 2006 17:05 UTC (Sun) by nix (subscriber, #2304) [Link]

Er, how often do you *run* traceroute? I don't run it so often myself that I'd notice immediately if it broke. It could easily be a week or so between runs...

Kernel bugs: out of control?

Posted May 12, 2006 19:53 UTC (Fri) by chromatic (guest, #26207) [Link]

Quality can't be retrofit; if it wasn't there to begin with, it can't be added later, especially not by other people.

How can this possibly be true? Consider OpenBSD's auditing process, for example.

New bugs? Old Bugs? Many eyes?

Posted May 11, 2006 16:43 UTC (Thu) by southey (guest, #9466) [Link]

The problem is that it is not always clear when the bug was introduced. I am not knowledgable in the kernel but these are latent bugs that either never got exposed by the kernel or the kernel could previously recover from. In some cases (one very recently) adding a new feature actually resulted in finding a bug. Alternatively these new features may also rewrite code that removes old bugs but may also introduce new ones.

I think the real problem is being able to replicate these bugs. As evident from the comments to this article and the one for X.org.

Kernel bugs: out of control?

Posted May 11, 2006 22:56 UTC (Thu) by vonbrand (subscriber, #4458) [Link] (2 responses)

Please, don't compare the 16 surgical patches to 2.6.16 in the stable series with the huge patches to 2.4.x (at a short glance, they seem to average around 1MiB bzipped; compared to 10KiB tops for 2.6.16.y)

Kernel bugs: out of control?

Posted May 11, 2006 23:34 UTC (Thu) by malor (guest, #2973) [Link] (1 responses)

I can remember only three or four 'emergency' patches to 2.4, after Linus branched to 2.5... ones that needed to be installed immediately because they were DOS or security fixes. The size of most of those patches is adding new drivers. When an emergency fix was needed, 2.4 had it out in a day or two, but for the most part, it rolled up a bunch of stuff at once so that people could test the upcoming kernel. I tested a few of those, but never saw any problems with them. Marcelo was wonderfully conservative in his philosophy about patching. (ie, don't add features, fix the old ones, and just add drivers.)

The vast majority of the 16 updates to 2.6.16 have been security-related, and they've required immediate reboots, at least if you're security-conscious. Whether they're 10KiB or 10MiB, they still are primarily to fix security problems. Running a Linux 2.6.16 free of known security holes this month, in other words, has necessitated a reboot every two or three days.

I don't remember that ever happening on ANY earlier kernel, from 0.8 through 2.4, though you're certainly welcome to correct me if I'm misremembering. All security updates to 2.4, as far as I know, came through _really_ fast, so the patch *releases* should be very comparable in terms of total numbers. 16 releases in five or six weeks versus 32 in 5.5 years looks like total shit, IMO. And Linus didn't even go to 2.5 until about Linux 2.4.10, so you could argue that it's really 16 versus 22.

Kernel bugs: out of control?

Posted May 21, 2006 17:09 UTC (Sun) by nix (subscriber, #2304) [Link]

If you're rebooting whenever *any* security fix to 2.6 -stable comes out, you're wasting your own time. Read the changelogs, or preferably the patches: if you're not even compiling in the code which was fixed, there's no point upgrading.

The patches are *short*. Exploit that. :)

Personally, my firewall is a UML-based virtual machine, and the bridge to the external world has no IP address on the host, so that most attacks don't affect the host at all, but are passed straight through to the UML instance. Immediate security fixes are a matter of bouncing that instance: perhaps a minute and a half of network downtime, and most of *that* is ADSL negotation delay. The only annoyance is the dropping of persistent connections.

If you have vast amounts of state on your firewall, such that rebooting it is hard, you're doing something *very* wrong.

Kernel bugs: out of control?

Posted May 11, 2006 11:43 UTC (Thu) by lacostej (guest, #2760) [Link] (2 responses)

I am an optimist kind of guy. I believe the community has the mean to push the limits of software development. If that means have an somewhat less stable kernel, it's OK. No-one's forcing you to upgrade. Things will improve little by little. In the end everybody will gain of having the release process the way it is today. I truly believe in release early release often.

One day, we may be able to have machines with multiple systems that allow us to boot concurrently 2 systems on the same hardware. This allowing us to check the quality of the newer releases. We may also have a real P2P distributed kernel testing process, allowing to detect regression and issues faster.

Kernel bugs: out of control?

Posted May 11, 2006 11:58 UTC (Thu) by malor (guest, #2973) [Link] (1 responses)

Well, actually, yes they are. They won't do bugfixes after two months, so if Linux develops a security issue, I _have to_ upgrade to get it fixed. Debian only supports a few kernels, and the only one that works for me in all situations is the one in unstable. Yes, RHEL does a good job of backporting security fixes, but I don't _run_ RHEL.

With the fixed holes come new features, with new holes of their own, forcing yet more upgrades. And it requires constant admin attention, to see if something in the latest round of new features is going to be a security issue in the target environment. Admin attention is expensive; the software may be free, but this dev system is still costly. It was much easier when you could schedule a block of time every year or two to learn new features.. now it's forcibly done on an interrupt basis. That's very inconvenient, and very problematic for time management and security.

Kernel bugs: out of control?

Posted May 12, 2006 15:45 UTC (Fri) by snitm (guest, #4031) [Link]

You seem like a high maintenance individual ^H^H^H^H^H^H^H^H^H^H environment deserving of RHEL or some other enterprise distro. If cost is an issue you should look at embracing a RHEL clone. CentOS tracks RHEL closely; run CentOS and upgrade to the RHEL4 update kernels as they are released.

Kernel bugs: out of control?

Posted May 11, 2006 13:50 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think the main issue is that kernel developers are relatively rarely put in direct contact with people who are seeing bugs. From my reading of the mailing list, people who report bugs there tend to get a bunch of attention and a resolution. Bugs reported in tracking software tend not to generate discussion. I think that kernel developers are generally interested in fixing bugs and solving people's problems, but they aren't so excited that they'll search databases, track down users who reported issues, and open a dialogue with them. There's a certain reward to making somebody happy, and this can be a good motivation for fixing bugs, but it depends on having contact with an actual individual with a problem who would be happy if you fixed it.

Kernel bugs: out of control?

Posted May 11, 2006 20:50 UTC (Thu) by edstoner (subscriber, #4496) [Link]

Well, this is a little late in coming with a lot of comments before, but...

I have no problems with the way things are. I have a large network to maintain (over 2600 clients) and I don't have any problems. All of our servers run Gentoo, and get HEAVILY used. I've yet to upgrade a kernel from bugs. When I build the boxes (which I do fairly frequently) I put the latest kernel (that is in Gentoo's portage system as vanilla-sources, but it seems to pretty closely track what the real latest kernel is) on them.

About 100 of our client systems run linux, and fairly shortly I'm expecting to switch almost all over to linux. If money is needed to get a full-time "kernel bug fixer/tracker", I herby right now pledge $5,000.00 a year (assuming that they promise to fix any bugs that I run into with the kernel). It is certainly worth that much to my organization. How can there not be 20 other organizations in the US alone where it is worth at least that much? How much does a "kernel bug fixer/tracker" cost?

Kernel bugs: out of control?

Posted May 12, 2006 10:50 UTC (Fri) by scarabaeus (guest, #7142) [Link] (1 responses)

The kernel folks should invest some work into an automated testing framework, one which allows you to run the tests without booting into the kernel. While automated tests, e.g. unit tests, won't make the problem go away, they are incredibly useful for realizing that your own changes to the code break some other part of it.

Actually, I'm a bit surprised that things do work so well without unit tests even though hardly anyone understands the entire kernel. With code of that complexity, usually when you pull at one end, something breaks at the other end. I guess the reason why the dev process still works so well is the intensive peer review of patches.

With the complex interdependencies within the kernel, writing test cases is certainly a challenge. For example, writing a case which tests the scheduler's behaviour in a certain situation won't be easy. But once a "scheduler test framework" is in place, it can be used for future work on the scheduler, so the work will pay off IMHO.

Kernel bugs: out of control?

Posted May 21, 2006 17:12 UTC (Sun) by nix (subscriber, #2304) [Link]

The problem is that a large number of the stability problems are in specific drivers and in the interaction of drivers with hardware... and you can't test that without having the hardware.

Most things are amenable to automated testing, but kernels are one of those things that aren't entirely so. (The non-driver parts, sure: I can imagine a UML-based kernel core testsuite, for instance. But the driver parts are where the nastiest bugs often lie.)

A positive spin on the kernel bugs issue

Posted May 12, 2006 18:01 UTC (Fri) by pr1268 (guest, #24648) [Link]

I'm not too sure anyone else sees the current rate of kernel bug fixes in the positive light I see them:

The sheer fact that bug fixes are coming faster is indication that people are adopting Linux in increasing numbers. The (relatively) few users of Linux of several years past found (relatively) few reasons to complain about a kernel bug. Fewer bugs were noticed therefore fewer bugs needed fixing.

Fast forward to the present. As pervasive as Linux usage is these days, doesn't it stand to reason that more bugs will get noticed? That's GOOD! Think about how the process works. As long as the bug reports keep flowing in, and the kernel developers keep troubleshooting and fixing the bugs, then the rate at which the bugs are being noticed really shouldn't matter. The kernel has grown in size consistently since before 1.0, and the number of bug reports has grown correspondingly.

I'd almost be more concerned if bug reports slowed to a trickle (or stopped completely). With a piece of software as large, powerful, and complex as the Linux kernel, this would surely indicate lack of usage or apathy. Seeing the bug reports tells me that people are using Linux more frequently, and doing the responsible thing of reporting a bug when they encounter one.

Kernel bugs: out of control?

Posted May 13, 2006 6:05 UTC (Sat) by error27 (subscriber, #8346) [Link]

I think that the 2.6.x releases aren't viewed as "major". The 2.6.0 release was major and people spent tons of time in tracking down bugs.

Back then the bug tracking wasn't in place so people had to hand compile lists of bugs and it was pretty hit or miss. These days people have bugzillas. The kernel.org is pretty worthless, but Fedora's is pretty decent. Suse's bugzilla is still revving up but that will be useful too.

I've got 3 main issues with 2.6.16. For aacraid, there is new debug code in Fedora and -mm that calls BUG(). With mptscsi there was a massive rewrite and now it doesn't work with my nStor. Sky2 is another complete rewrite that doesn't work on my hardware. The sk98lin driver that sky2 is replacing wasn't that great. The code was rough, the Makefile were astonishingly bad and it had issues with bonding. The sky2 developer is active so I'm happy about the progress being made.

Kernel bugs: out of control?

Posted May 13, 2006 15:32 UTC (Sat) by dps (guest, #5725) [Link]

Quite a few bugs affect only one kernels with a specific feature, for exmaple the recent smbfs bug requires you to use smbfs and have a cracked server. If there is an alsa or module exploit my (linux) firewall is not affected becaue it supports neother those features nor any devices not part of the box.

At least my firewall would not be content without the stateful inspection features of iptables. Without this I suspect the firewall would be more complex and provide less protection.

A new version of the mm integer overflow bugs or ping of doom would be much more exciting.

Measure first, fix later

Posted May 15, 2006 14:00 UTC (Mon) by walles (guest, #954) [Link] (1 responses)

As said in the article, since there's no good measurement of how buggy the kernel is (or what BUG() macros that trigger) the kernel's bugginess can't be neither quantified nor measurably improved.

Currently, when I've seen Linux systems panic, they have printed a bunch of information to screen and then given me the option to re-boot.

What *should* happen IMO, to make these things measurable, is to store that information somewhere it can survive the re-boot. At a convenient point in time, the up-and-running re-booted system should ask the admin if (s)he wants the bug to be registered in some central repository.

This way we'd have *a lot* more statistics on what usually goes wrong inside the kernel. And what parts need fixing the most.

That said, for me, Linux kernels usually work very well. But just because I don't have any problems doesn't mean nobody has them...

Measure first, fix later

Posted May 15, 2006 21:18 UTC (Mon) by Richard_J_Neill (subscriber, #23093) [Link]

This would be a very good use for those (almost) redundant floppy drives. After a kernel panic, you shouldn't touch the HDD in case you make it worse. But writing to an FDD would be a great solution.

*Warning* you'd have to make this activated manually, because it would trash anything that was already on a floppy belonging to an unsuspecting user.

Kernel bugs: out of control?

Posted May 18, 2006 21:04 UTC (Thu) by quintesse (guest, #14569) [Link]

So maybe Tanenbaum was right afterall and we should switch to a microkernel? *grin*

http://www.cs.vu.nl/~ast/reliable-os/