Kernel bugs: out of control?

Posted May 11, 2006 8:51 UTC (Thu) by tialaramex (subscriber, #21167)
In reply to: Kernel bugs: out of control? by malor
Parent article: Kernel bugs: out of control?

That's definitely one side of the coin, and if its the only side that resonates with what you do (e.g. if you have a pile of 3-5 year old web servers running a patched Linux) then I'm glad you're able to enjoy 2.4 series kernels and that you feel so strongly about quality.

However, the OTHER side of the coin is that a lot of people own hardware that wasn't supported five micro versions ago (2.6.11) and didn't work well enough to be really acceptable as a main desktop/ laptop system until the last two or three versions, and there are people who are waiting for things in 2.6.17 or so before their hardware will work.

Under the old regime, my laptop would have stopped at "boots, but most things don't work properly" pending the transformation from 2.7.x to 2.8 stable, presumably some time in 2007. Obviously that means I wouldn't have bought this laptop, but instead an older, hard to obtain model with much worse specifications. Most people would have just junked Linux and gone to Windows or OS X.

Secondly it's unclear whether the bugs being fixed in new stable patches are just bugs that would previously have sat quietly in a queue for weeks, or (in the stable kernel series) for months, waiting for a new major release. We're quick to shout "foul" when Microsoft says that dozens of security fixes wrapped up as a Service Pack counts as one patch versus hundreds of separate package updates in a Linux distro, so we should be cautious before assuming that e.g. this week's fixes to SCTP would have been unnecessary in the "stable 2.6 series" world, rather than simply being delayed 12 weeks and then included without fanfare.

Nothing whatsoever prevents you from getting a gang of people together to maintain say 2.6.16-stable indefinitely. If you do it for more than a month or two and look serious about it, Linus will probably even bless it.

Kernel bugs: out of control?

Posted May 11, 2006 9:04 UTC (Thu) by kleptog (subscriber, #1183) [Link]

I think there can be a middle ground. Your argument is based around the fact that an unstable series would take years before releasing a stable 2.8. So let pick something easier.

Simply annouce that no new features will be accepted for two or three months, while a concerted effort is made fix bugs and regressions. A few months is nothing in the timescale of hardware development and would (hopefully) produce something that people can trust.

Whether it works is another question. If people don't like fixing bugs then no amount of time will help. But I think it's something worth trying.

Kernel bugs: out of control?

Posted May 11, 2006 9:38 UTC (Thu) by malor (guest, #2973) [Link] (7 responses)

Well, other than skill, anyway. I can fix makefiles and the occasional include, but that's about as far as I go with C. Not exactly stable-kernel-maintainer quality. :) I know the Debian kernel maintainers are clueful, and they're having a hell of a time, so thinking that I'd do any better is rather silly.

The lack of hardware support might not have been a problem... they backported many drivers to 2.4. Adding drivers doesn't generally interfere with the functionality of the kernel. As is, to get security patches or new drivers, you're forced to take new features too, maybe features you don't particularly want. With a stable kernel, you could probably have both a nice lack of kernel panics AND hardware support for your new devices. It certainly used to happen during 2.4/2.5.

The problem with the backporting is that it's boring work, and the kernel devs don't like to do it. Stability is also boring work, with a similar outcome. The new kernel development model is for THEM, not for the millions of people who built companies and careers based on the original social contract: "This code is as good as we know how to make it. If it breaks, you keep the pieces, but we'll do our damndest to make sure it doesn't break."

Linux 2.2 was incredibly stable; it NEVER fell over. (well, at least with the loads at the time, which were pretty minimal :) ) Best piece of software I've ever run. Every once in awhile, someone would mention a kernel panic on IRC or Slashdot, and the result would usually be mild incredulity... "oh wow, it crashed?! whoah. Dude. Go buy a lottery ticket!"

The first time this thread came up on Slashdot and I posted this rant, more or less, very few people agreed with me. (even though I'd been struggling with bugs for months.) It came up a couple of months ago, and there were a number of people this time chiming in with similar stories. It came up again a week or so ago, and this time the thread was FULL of people complaining. The haters are being marginalized ("you idiot! go run Windows!")... because people are cluing in that this high-speed development cycle is death for stability.

The single best comment I saw on Slashdot said this, roughly: with the kernel in such a state of flux, different parts of it will be stable at different times, and it will never all be stable at the _same_ time.

Kernel bugs: out of control?

Posted May 11, 2006 16:22 UTC (Thu) by smoogen (subscriber, #97) [Link] (2 responses)

Well, I have an idea.. buy yourself a kernel developer to do this work for you. The tenor and tone of your posts seem to be expecting them to do this work for free for you.

Kernel bugs: out of control?

Posted May 11, 2006 18:55 UTC (Thu) by richardr (guest, #14799) [Link]

Well, curiously enough, that is pretty much what I do expect. I'm an end user when it comes to linux
kernels, and what I want to hear is that "it just works". However, I have happily spent money on
distributions I could download for free in the expectation that the money would go to developers to
improve performance and *squash bugs*, and to that extent I have bought (a very small part of) a
kernel developer.

Kernel bugs: out of control?

Posted May 11, 2006 23:13 UTC (Thu) by malor (guest, #2973) [Link]

As the other poster said... I have put a great deal of money (thousands of dollars) into Linux and ancillary products over the years. Far, far more than I've spent on Microsoft products. Part of that money has gone to pay kernel devs at places like Red Hat, and likely has indirectly resulted in the creation of Linux-related jobs. Some of that money goes to LWN.

What I'm asking for here benefits all of us... you, me, AND the kernel devs. Stability and security are what got Linux to this point, to where Linux experience is a good thing to have on a resume, to where you can get good jobs knowing only Linux.

That will not remain true if the fundamental strengths of Linux are lost in a chase for 'development speed', which benefits primarily the developers, and not so much the Rest of World.

Kernel bugs: out of control?

Posted May 11, 2006 19:16 UTC (Thu) by oak (guest, #2786) [Link] (3 responses)

> The problem with the backporting is that it's boring work, and
> the kernel devs don't like to do it. Stability is also boring work,
> with a similar outcome. The new kernel development model is for THEM

Are you argumenting that if the kernel development tools and processes
were more cumbersome for the kernel developers, the code quality would
improve? Let me doubt that...

> Linux 2.2 was incredibly stable; it NEVER fell over.

It also supported a lot less hardware. Note that while the number of
components grows linearly, the possible interactions between them grow
exponentially.

> even though I'd been struggling with bugs for months

How good bug reports you made of them? Bugs cannot be fixed
if they are not known...

Kernel bugs: out of control?

Posted May 11, 2006 23:52 UTC (Thu) by malor (guest, #2973) [Link] (2 responses)

Yes, I think exactly that. Development speed is not the same as code quality. The new process is tuned to let them do more of the 'fun' stuff (writing new code), and force them to do less of the 'unfun' stuff, like making sure things actually work. It's also to force more testers; they've explicitly said one of the reasons they're doing it this way is to force people to test new code.

I don't mind testing code when there's a call for testers (and if I can slot in some time). I do mind being forced to test beta-quality code by them calling it 'stable' and refusing to support code that's more than two months old.

As far as exponentiation goes, you're exactly right... I'm not sure if I hit this idea yet in this thread. What that means is that as the kernel grows, development needs to slow down, to cover all the various interactions. Instead, they're _speeding up_, not testing, and expecting the Rest of World to fix their problems.

I tried to report the APIC bug on that VIA board. I first emailed the ACPI author (got my acronyms confused :) ), who very promptly replied, and politely told me I was talking to the wrong person. Then I tried mailing the APIC maintainers twice, but didn't get a reply. I dropped it after that... probably should have sent it to the catchall address, but forgot it. And now I don't have the board anymore, so a bug report won't be very useful.

The 865 bugs I can't diagnose, because it's all remote, so I haven't even tried to report it. Those machines are production, and I can't afford to take them down for testing. So my bug report wouldn't be very useful. And 2.6.16 has worked well so far, although the unending reboots are painful.

And 2.6.0 through 2.6.8 or so worked great on that same board.... so I really shouldn't have HAD to file bug reports. I was, after all, tracking a 'stable' kernel. Stuff that worked in 2.6.0 should work in 2.6.16.

Kernel bugs: out of control?

Posted May 12, 2006 0:02 UTC (Fri) by malor (guest, #2973) [Link]

Oops, I inserted a paragraph in the wrong place. If you swap the last two paragraphs, it'll be more readable, although the concluding note will be in the wrong place. :)

Kernel bugs: out of control?

Posted May 16, 2006 17:49 UTC (Tue) by oak (guest, #2786) [Link]

> And 2.6.0 through 2.6.8 or so worked great on that same board....
> so I really shouldn't have HAD to file bug reports. I was, after all,
> tracking a 'stable' kernel. Stuff that worked in 2.6.0 should work
> in 2.6.16.

You cannot really expect that unless you know that there's
a regularly executed test setup:
- with the same HW as yours
- with the similar software and same kind of load as yours

For example fixing a bug (for a setup developer has) might make
(e.g. an already existing) bug somewhere else in the code happen
more likely in your setup.

Only testing and error detection inside & outside kernel can
help in catching those. The testing has to be automated,
it should not produce (too many) false positives, and it has
to pinpoint fairly well where the problem happens so that
the bugs can be fixed. Otherwise only alternative developer
has is to resolve the bugs as WORKSFORME.

It would be nice if kernel developers would provide an automated
test-set for people who "live on the bleeding edge" which they could
run on their test setups before deploying the kernel on production
machine. If the test-set outputs an error, you could just forward
it to kernel.org and the automatically produced bug report would
have all the relevant info; your kernel config, HW info, OOPS etc...

If the automated test-set would go through, then you could do your
own tests on the kernel before putting it into real use. And if
those fail, you could propose tests to be added to the automated
test-set so that those kind of problems are caught earlier.