User: Password:
|
|
Subscribe / Log in / New account

Kernel bugs: out of control?

Kernel bugs: out of control?

Posted May 11, 2006 10:35 UTC (Thu) by malor (guest, #2973)
In reply to: Kernel bugs: out of control? by nix
Parent article: Kernel bugs: out of control?

The present system is 'we release the code, the distros have to make it work'. Quality can't be retrofit; if it wasn't there to begin with, it can't be added later, especially not by other people. Having a stable kernel will not, in and of itself, make the code better, but one would hope the new emphasis on stability just might.

As far as vanilla 2.6 goes.... 2.6.14 broke _traceroute_. I mean, come on.

2.6.15 as distributed by Debian is completely unusable on the Intel i865 machines I've tested it on; it crashes randomly, within an hour, light or heavy load. Every time, without fail. And because the machines are remote, I can't easily troubleshoot. And it's not like the 865 chipset is, you know, rare.

All versions of 2.6 since 2.6.9 or so have been unstable on my (one, personal) KT333-based board... this particular error cost me several hundred dollars, as I replaced a drive that didn't need replacement. It was actually APIC errors that were introduced around that time. I ended up just replacing the motherboard. (I may end up replacing the OS, too.)

The fact that the patches to 2.6.16 are so trivial just means that the code wasn't properly reviewed before released as 'stable'. And it does not change the fact that I've had to reboot my Debian servers ten times or so in the last month. (Before you start laying into me for using 'unstable'....I can't use the stable kernel, because it doesn't support all the hardware properly, and 2.6.15 in testing crashes after an hour, so the unstable kernel is all I can use.)

Owning words may not be their responsibility, but it's certainly in their self-interest. The existence of OSDL and Linus' present job are a direct outcome of that word ownership. If they lose it, then there will be fewer paid kernel dev jobs created in the future. The more central Linux becomes, the more people trust it and need it in their daily lives, the more jobs to work on it there will be. So it's rather foolish of them _not_ to think about it.


(Log in to post comments)

Kernel bugs: out of control?

Posted May 11, 2006 12:26 UTC (Thu) by nix (subscriber, #2304) [Link]

Well, the 2.4 kernels, which you applaud as being so terribly stable: the right word for them would be 'stale'. No devs to speak of ran them, so many old bugs accumulated and were never fixed, and because it diverged so much from the dev tree, many fixes accumulated in the 2.4 tree which were never forward ported!

Simply making a stable tree with no dev tree won't help: the developers will simply use their own private trees, or some other dev's git repository.

Your real complaint appears to be 'it doesn't work on my hardware and I'm not willing to go to the effort needed to track down the problem and get it fixed'. (There are numerous ways to troubleshoot remote machines; look up the network kernel syslog option, for instance. Even panics get dumped out of that.)

And as for the 'trivial patches imply improper review', well, anyone with any experience of software maintenance would know this is untrue: many of the most subtle bugs need very small patches to fix them once you've finally worked out the cause (e.g., one extra lock around some data structure on which multiple concurrent accesses are racing: two or three lines, a sod to find and very hard to reproduce.)

Kernel bugs: out of control?

Posted May 11, 2006 12:48 UTC (Thu) by malor (guest, #2973) [Link]

Well, at this point it's stale, sure. But it wasn't when 2.6 was released. If 2.6 had a maintainer and Linus was off in 2.7, like they've always done before, things would be just fine, as they were in 2.4... after Linus quit messing with it, anyway.

My real complaint is that it DID work on my hardware and then STOPPED working for no discernible reason, and I don't generally have time to do much troubleshooting. Things ceasing to work in a STABLE kernel series is very bad. I'm willing to test beta kernels when there's a general call for that, but I don't do it routinely. At least, I didn't, until the new dev system forced me to. I would have no problem with device breakage after switching stable versions, but if something works in 2.6.0, it should still work in 2.6.16, as far as I'm concerned.

As far as patches go... ok, I'll accept your 'short patches don't mean it wasn't hard to figure out' argument... you're correct there.

It would probably be better to say that 'trivial patches imply lack of QA/testing', rather than lack of review. A trivial patch means the design was right, but the implementation was wrong, and that's the sort of thing that should be caught in testing.

It still doesn't change the fact that Debian users on 2.6.16 have seen one hell of a lot of downtime lately.

Kernel bugs: out of control?

Posted May 11, 2006 13:57 UTC (Thu) by k8to (subscriber, #15413) [Link]

I am curious about this downtime. Is it related to specific hardware? I
am a Debian user running 2.6.16 with no perceived problems. Well, other
than the swapping caused by firefox ;-)

Kernel bugs: out of control?

Posted May 11, 2006 14:42 UTC (Thu) by malor (guest, #2973) [Link]

Debian has updated the 2.6.16 kernel many times in the past month, requiring a lot of rebooting(ie, downtime). If you haven't been updating, you haven't had downtime, but you've also left unfixed a number of security and DOS issues. Not too important for most home users, but quite a different thing for servers.

Kernel bugs: out of control?

Posted May 11, 2006 15:33 UTC (Thu) by vmole (guest, #111) [Link]

Why are you running debian testing/unstable on production servers?

And you're complaining about security patches? You can't have it both ways: either they remain unpatched, or you have downtime. Those bugs existed long before 2.6.16, that just happens to be the one they're looking at and patching. If they'd called (say) 2.6.8 stable, there would still be those patches. Sure, maybe not the exact same ones, but many the same, and others that have been fixed between 2.6.8 and 2.6.16.

Kernel bugs: out of control?

Posted May 11, 2006 23:59 UTC (Thu) by malor (guest, #2973) [Link]

I'm running testing on production servers because testing is current enough to be useful and essentially never breaks things. I'm just using the kernel from unstable, becuase that's the only one that works in all cases. I actually had to use a Ubuntu kernel for awhile when things were really bad with 2.6.15.

Linux 2.4 always pushed security fixes out right away... I don't remember Marcelo sitting on security patches. He'd accumulate a bunch of non-security stuff and roll it out all at once, but security patches were immediate release. And in the 5.5 years of 2.4's existence, it's had 32 total releases... and 10 of those were when Linus was still tinkering with it. So 22 is more accurate.

22 patches in 5 years, I can handle, particularly since many of them were optional... just new drivers, not security fixes, which meant they could be deployed whenever there was time.

16 patches in five weeks, nearly all of them immediate must-install security fixes... that's not so good.

Kernel bugs: out of control?

Posted May 12, 2006 13:51 UTC (Fri) by vmole (guest, #111) [Link]

  1. I want a kernel that supports the latest hardware
  2. I don't want the kernel to change, or have bugs

Pick one.

Kernel bugs: out of control?

Posted May 16, 2006 16:23 UTC (Tue) by hazelsct (guest, #3659) [Link]

No. It's more like, don't call it "stable" unless/until it is.

Kernel bugs: out of control?

Posted May 18, 2006 6:48 UTC (Thu) by gowen (guest, #23914) [Link]

He's running Debian unstable. And he's complaining that it's unstable. Linus is not the only one struggling with nomenclature.

Kernel 2.6 has a stable branch. The stable branch of 2.6.x is called 2.6.x.y, for large values of y.

Kernel bugs: out of control?

Posted May 18, 2006 9:16 UTC (Thu) by malor (guest, #2973) [Link]

You're not reading what I'm saying. I'm using THE KERNEL from Debian unstable, because the kernel from testing doesn't work at all (2.6.15 crashes within an hour in my 865 machines), and the kernel from stable doesn't support all my hardware. I use nothing else from unstable on production servers. I have exactly one machine running the actual unstable distribution in its entirety, because that one clues me in when there's (yet another) kernel patch.

Debian's kernel is pretty much vanilla 2.6.16. Linus et al call Linux 2.6.16 'stable'.

The kernel devs' expectation that 'the distros' will magically fix all their bugs amounts to simple handwaving, shirking of their fundamental responsibility: when they call it stable, it should BE STABLE.

Software that's supported for only two months is not, pretty much by definition, 'stable'.

Kernel bugs: out of control?

Posted May 21, 2006 17:00 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, stable means `we think it will work'. Length-of-support has nothing to do with it.

Kernel bugs: out of control?

Posted May 18, 2006 9:17 UTC (Thu) by malor (guest, #2973) [Link]

And which 'stable' 2.6 kernel do I choose? And define 'large value of Y'. Bu your definition, 2.6.16.15 should be 'stable', but it lasted THREE DAYS.

Kernel bugs: out of control?

Posted May 18, 2006 9:46 UTC (Thu) by arcticwolf (guest, #8341) [Link]

You're confusing (unintentionally, I assume) two distinct meanings of the word "stable" here. Stable can mean either:

1) Bug-free enough to not crash on most systems encountered in the wild (i.e., "stable" in the sense of "production-ready");
2) Not undergoing changes.

It's important to keep in mind that these are not related to each other. When you say "it lasted THREE DAYS", you apparently mean that it was replaced with a newer patch (.16) three days later - that's the second definition of stable. So, yes, in that sense, 2.6.16.x isn't stable, but that's just because the developers are actually fixing security issues that are found and releasing patches immediately.

Would you rather have them sit on those patches for weeks or months? Well, if you do, you can still have that; nobody's forcing you to apply those new patches.

But in any case, what Andrew Morton talked about was stability in the first sense, and that's a different beast. How long would it have taken for 2.6.16.15 to crash on your boxen? It's hard to say, but I'd guess that unless you'd have been rather unlucky, it would've been more than three days.

So, the answer to your question is: you choose the latest one that's available. Whether you continue to apply newer patches as they come out is your choice, not ours, and complaining that you have downtime when patching security issues in the *kernel* is pretty silly. That's how things are in the real world. (And it's still true that nobody's forcing you to apply anything, so if you'd rather avoid downtime than patch newly-found issues, just don't apply them.)

Kernel bugs: out of control?

Posted May 18, 2006 10:05 UTC (Thu) by malor (guest, #2973) [Link]

It feels like you've read about three sentences of what I've written here, and you're reacting just to that. Most other replies I've put in this thread address these issues. I'd suggest reading them... I'm not going to repeat all of them here.

The strongest objection I have to the current model is that we are forced to take new features with our bugfixes, because they will not support kernels for more than two months. New features = new bugs. New bugs = new patches. New patches = new features. New features = new bugs. And so on.

'Stability', as defined from the point of view of the Linux kernel, should mean:

1) It's maintained with security patches;
2) No fundamental new features are added;
3) Drivers are added, if possible, without violating #2.

In other words.... do it like 2.4 did it, after Marcelo took over. If a new network card comes out, of course you can add the driver to the source tree... it's not going to affect anyone else. If that new driver requires an update to the memory management model of the kernel, then you don't include it in the stable branch, but rather in the dev tree.

I think they might have retrofit the USB system in 2.4... it's been awhile, and I wasn't following it closely, because I didn't need to. I do know that their backports from 2.5 were done without large-scale overhauls of kernel subsystems; they kept the changes focused and very limited. And, by and large, the 2.4 kernel was very stable. It wasn't as solid as 2.2, but it was quite acceptable.

Basically, the kernel devs had the model NAILED during 2.4. This high-speed 2.6 development, on the other hand, is an absolute disaster. These guys are some of the smartest in the business, but they are still human, and they are running into the limitations of their own intelligence. The code has become too complex for them to maintain... it's hard and nasty and difficult work now, and instead of slowing down development, they're ignoring the bugs and SPEEDING UP instead, apparently because that's more fun.

It's significantly less fun for people trying to keep production machines running.

Andrew Morton is most unhappy about the quality of the kernel. That should tell you something.

Kernel bugs: out of control?

Posted May 21, 2006 16:58 UTC (Sun) by nix (subscriber, #2304) [Link]

Most of the security fixes weren't `immediate must-install'. A goodly number related to single drivers or obscure new protocols: SMB/CIFS, SCTP...

... I mean, the SCTP code is, what, a kernel release old? Thus, it has bugs; some of which may be remotely exploitable (by the nature of network protocol code). How terribly shocking.

Kernel bugs: out of control?

Posted May 21, 2006 16:56 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, 2.4 was classifiable as `stale' as soon as most devs weren't running it. 2.5 was unusual becuse the abortive IDE rework made it so unstable that the devs stuck with 2.4; but after that was reverted, 2.4 staled out (to coin a horrible neologism) really rather fast.

Kernel bugs: out of control?

Posted May 11, 2006 13:54 UTC (Thu) by k8to (subscriber, #15413) [Link]

My impression is that of someone who feels that they know how to engineer
software quality more successfully than the linux kernel developers, but
yet does not know how to engineer software. Perhaps I have made a
mistake in my readings.

In complex software, security problems crop up, patches are released, and
installing them is a certain level of annoyance. If the approach of
continuing to re-engineer interfaces and systems to eliminate categories
of problems offends you, then the linux kernel in general should offend
you, since this has been the mode of operation since day one. There _are_
other free unixes which have a much more conservative approach. They are
not horrible.

Kernel bugs: out of control?

Posted May 11, 2006 14:46 UTC (Thu) by malor (guest, #2973) [Link]

Linux used to be stable while also doing those things in a development branch. It no longer is: the development branch and the mainline kernel are one and the same thing, forcing us all into alpha testing.

No, I'm not a developer, but I have been using Linux a long, LONG time. (in around kernel 0.8 or 0.9). So I'm certainly qualified to comment on the way it used to be (stable) and the way it is now (unstable). The development process and lack of focus on quality would appear to be the cause.

Do you have an alternate explanation?

Kernel bugs: out of control?

Posted May 11, 2006 15:03 UTC (Thu) by k8to (subscriber, #15413) [Link]

I agree that the kernel development is being handled differently now,
which is resulting in a larger number of releases in the stable line. I
do not agree that this indicates a lower level of quality. I think it is
simply factual that the one does not necessarily imply the other.

Kernel releases with corrected functionality are being created faster
than in the past, enabling users to get these fixes sooner. If users
feel the need to reboot for every one of these updates, then the fixes
may be seen as something of a nuisance. However, the alternative is to
not apply the fixes. In the past, the choice was not available, fixes
were provided less frequently and less rapidly, and so there was a longer
window of vulnerability and no possibility for frequent reboots. You can
simulate the old situation by installing fewer kernels.

Kernel bugs: out of control?

Posted May 13, 2006 20:19 UTC (Sat) by Baylink (guest, #755) [Link]

This sub-thread speaks to a topic near and dear to my heart: what does a version number *mean*?

Let me quote here my contribution to the Wikipedia page on the topic, based on my 20 years of observation of various software packages:

A different approach is to use the major and minor numbers, along with an alphanumeric string denoting the release type, i.e. 'alpha', 'beta' or 'release candidate'. A release train using this approach might look like 0.5, 0.6, 0.7, 0.8, 0.9 == 1.0b1, 1.0b2 (with some fixes), 1.0b3 (with more fixes) == 1.0rc1 (which, if it's stable enough) == 1.0. If 1.0rc1 turns out to have bugs which must be fixed, it turns into 1.0rc2, and so on. The important characteristic of this approach is that the first version of a given level (beta, RC, production) must be identical to the last version of the release below it: you cannot make any changes at all from the last beta to the first RC, or from the last RC to production. If you do, you must roll out another release at that lower level.

The purpose of this is to permit users (or potential adopters) to evaluate how much real-world testing a given build of code has actually undergone. If changes are made between, say, 1.3rc4 and the production release of 1.3, then that release, which asserts that it has had a production-grade level of testing in the real world, in fact contains changes which have not necessarily been tested in the real world at all.

The assertion here seems to be that an even higher level of overloading on version numbering ("even revision kernels are stable") and it's associated 'social contract' are no longer being upheld by the kernel development team.

If that's, in fact, a reasonable interpretation of what's going on, then indeed, it's probably not the best thing. I'm not close enough to kernel development to know the facts, but I do feel equipped to comment on the 'law'.

Kernel bugs: out of control?

Posted May 17, 2006 23:17 UTC (Wed) by k8to (subscriber, #15413) [Link]

I think your comments on versioning are not far from the mark. The fact
of these "minor" stable relases, eg. 2.6.X.Y, is that they are _smaller_
changes than have ever occurred in the stable series before. It is true
that these smaller changes do not receive widespread real-world
production evaluation, but no non-stable release kernel (rc versions
included) ever receives enough attention to catch even some showstopper
bugs.

So I think you are right to question this change, but the balancing facts
are that the release candidate process for the Linux kenel doesn't seem
very effective, and the changes made in the revision series are
_strongly_ conservative.

It is important to remember that in this particular (highly visible,
highly open) development process, there is very little pressure to
deviate from the conservative perspective in these updates.

Kernel bugs: out of control?

Posted May 11, 2006 19:04 UTC (Thu) by oak (guest, #2786) [Link]

> If the approach of continuing to re-engineer interfaces and
> systems to eliminate categories of problems offends you,
> then the linux kernel in general should offend you, since
> this has been the mode of operation since day one.

This reminds me of the recent change in Glibc, they now
abort programs which do double frees.

Yes, more programs may now be "appear unstable", but I personally
prefer application rather being terminated than silently corrupting
my data when they hobble forward with inconsistent state.
Broken apps should be shot down as soon as possible so that
people know to fix them, this is the Unix way.

If you don't force quality, you don't get it.
You end up with an unmaintainable mess instead.


> There _are_ other free unixes which have a much more
> conservative approach. They are not horrible.

I'm sure the person complaining here would then
complain about the lack of features and HW support...

Kernel bugs: out of control?

Posted May 11, 2006 23:24 UTC (Thu) by malor (guest, #2973) [Link]

I agree with you about forcing quality... that's a great idea. If I thought the new development process would actually DO that, I'd be enthusiastically behind it. Instead, it's just about speed, speed, speed... and avoiding the stuff that's no fun to do, like bugfixing and testing.

Waving your hands in the air and expecting other people to fix your programs is not, in my long experience supporting developers, the way to get it fixed, particularly not properly.

As far as switching OSes goes, I've already stopped using Linux on my firewalls because of the unending stream of security reboots. Netfilter is faster and more featureful than OpenBSD's pf, and its language is more amenable to shell scripting, but the first mission of a firewall is to stay up. I can throw OpenBSD on a firewall and not have to update it again for a couple of years. This means no downtime, which means happy users. I've never seen any Linux kernel that lasted that long without security holes.

FreeBSD is looking better all the time... I've been talking about switching over, but haven't yet. If matters continue as they have, maybe I will. And you'll have one less complaining user, which, from your tone, you may prefer.

Kernel bugs: out of control?

Posted May 15, 2006 4:27 UTC (Mon) by ChristopheC (guest, #28570) [Link]

I think it is unfair to say the kernel developer do not test their patches. However, they can only test them on the few combinations of hardware they have access to.

To discover the bugs, the kernel needs wide-spread testing. But few people are willing to test the development releases (-rc) - the problem has been mentioned countless times on lkml and here on lwn. So they have to release often toge tthe needed coverage. (This is a somewhat simplified explanation, of course)

Kernel bugs: out of control?

Posted May 15, 2006 5:48 UTC (Mon) by malor (guest, #2973) [Link]

2.6.14 broke *traceroute*. Give me a break.

Kernel bugs: out of control?

Posted May 21, 2006 17:05 UTC (Sun) by nix (subscriber, #2304) [Link]

Er, how often do you *run* traceroute? I don't run it so often myself that I'd notice immediately if it broke. It could easily be a week or so between runs...

Kernel bugs: out of control?

Posted May 12, 2006 19:53 UTC (Fri) by chromatic (guest, #26207) [Link]

Quality can't be retrofit; if it wasn't there to begin with, it can't be added later, especially not by other people.

How can this possibly be true? Consider OpenBSD's auditing process, for example.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds