LWN.net Logo

Possible changes to longterm kernel maintenance

Possible changes to longterm kernel maintenance

Posted Aug 15, 2011 20:24 UTC (Mon) by raven667 (subscriber, #5198)
In reply to: Possible changes to longterm kernel maintenance by welinder
Parent article: Possible changes to longterm kernel maintenance

The kernel doesn't change its userspace ABI, that is intentially kept very stable, so while I agree that for large complex userspace software following the development head can put you in dependancy hell, that problem doesn't really exist for the kernel. The kernel is tied to certain userspace components that are required to change on updates but those components generally don't have complex dependancies and so aren't going to drive A needing B and C where D also requires C=v1.2 which requires D to be bumped which then requires, blah blah blah. (actually the fact that modern package managers can model this kind of interaction at all is sickly amazing).

I'm not convinced that the status quo of having vendors keep the version number string unchanging while they muck around with the internals backporting security fixes and features from future kernels is actually less disruptive than just QA'ing newer kernel releases periodically and shipping that.


(Log in to post comments)

Possible changes to longterm kernel maintenance

Posted Aug 18, 2011 15:27 UTC (Thu) by mrshiny (subscriber, #4266) [Link]

The kernel doesn't intentionally break it's ABI but that does happen. And if you think driver regressions don't happen in new kernels, you're sadly mistaken. When I was running Fedora 8, the Intel Wireless driver broke so often that I had to reconfigure Yum to keep a large number of old kernels around, because I could never be sure which one would work for me. And video drivers break too.

And if widely-used consumer hardware can have such obvious regressions, you can't tell me that less-well-tested specialty enterprise hardware won't also, at least occasionally, suffer regressions.

Possible changes to longterm kernel maintenance

Posted Aug 18, 2011 23:28 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I think we are talking past one another. I am not suggesting that regressions don't happen and that they blindly ship new releases, damn the torpedoes full speed ahead. The enterprise distributions already pull new versions of the major network and storage drivers, backport them to the older kernel release then QA and ship that. If there are regressions in the drivers then they still have the same problem.

What I'm suggesting is to be less afraid to update the version number and track a kernel.org stable release and update to new kernel.org stable releases over the lifetime of the product rather than maintaining a private stable release that never shows a version number change over the product lifecycle. I've found that endlessly confusing to admins, auditors and consultants, trying to match up feature documentation and bugfixes between kernel.org and vendor kernels.

The enterprise vendors already have a large QA process before shipping new kernel features in service packs, is there really a meaningful difference in the number of regressions between a large number of private backports and custom integrations or just re-basing on upstream stable releases periodically. Would it really be harmful to have the enterprise vendor just take ownership of maintaing a stable kernel.org release so that patches and whatnot in that release are documented and the version number is bumped along with other kernel.org releases?

For example I'd like to dispense with the fiction that labeling the RHEL6 kernel 2.6.32 is really a meaningful description because it is _not_ accurate. If they said it's 2.6.32.45 maybe that would be better.

Possible changes to longterm kernel maintenance

Posted Aug 19, 2011 0:54 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

Having the enterprise distros collaborate on the stable kernel branch is probably not a bad idea. But its lifespan is typically far less than an enterprise distro, on the order of years. So Red Hat still needs to maintain old kernels or, as you say, fully QA a new kernel.

My point is that no matter what the kernel.org people say, the kernel does have regressions often. This is particularly a problem because the drivers must ship with the kernel. This means that if you have a regression anywhere in the codebase that applies to your computer/workload, you need to upgrade or downgrade the whole kernel. This is why people are afraid of changing their kernel. For over a year I was unable to upgrade my kernel because one or two of the drivers I used were unstable from release to release. I didn't have the choice of pinning one driver version that worked, while upgrading all the rest of the kernel.

This is why the stable kernel series, or a stable kernel maintained by the enterprise distros is a needed thing. Customers don't want the new kernel to break something. Even a small bug in a single driver somewhere might severely disrupt operations after an upgrade. So upgrading the whole kernel is a risk because it all changes.

I don't subscribe to the "stable api nonsense" idea. I've been personally bitten by the fact that every driver in the kernel is permanently locked together in one big tarball with the kernel itself. I know I'm not the only one. The enterprise distros don't want this to happen to their customers, so they have no choice but to maintain a specific kernel.

Possible changes to longterm kernel maintenance

Posted Aug 19, 2011 1:52 UTC (Fri) by foom (subscriber, #14868) [Link]

And let me just relate my experience, as a perhaps representative example. I'm running mostly CentOS "2.6.18" right now. Upstream 2.6.22 was also used successfully in the past, but it seemed better to stay on a maintained kernel.

I've tried upgrading to a new kernel multiple times, because the new kernels do have some new features I'd like to use (actual new features, not just drivers). But each time I've tried upgrading, I've hit major regressions in my workload. Finding a new kernel version that works is not actually the most important work to do, so there's usually significant delays for someone to get around to figuring out what's wrong, or trying again with a different version.

The following versions have been attempted since 2.6.22 (to the best of my recollection this is right)
2.6.25: File corruption when using writev (introduced in 2.6.23!).
2.6.26: I forget what was wrong with the initial attempt, but something was broken. The debian lenny 2.6.26 appears to work now, though not using it much (since most prod boxes are running centos).
2.6.29: (or thereabouts) introduced serious performance degradation in our workload due (I think) to disk page-in behavior changes in mmaped files. Didn't really attempt to track down, got distracted by other stuff.
2.6.31: file corruption when using writev in an ext3 fs. (different bug from before)
2.6.32: works, but still worse performance than RH18 and Deb26, though not as bad as 29. Same behavior with Deb2.6.32.
2.6.38: Currently running on a few systems. Performance seems to be better. Haven't found anything critically wrong yet...so far so good?

So anyways, in *my* experience, new upstream kernel versions have a dismal success rate, while new patch releases of a working stable (distro) kernel version have a 100% success rate.

I expect it's not really the case that no releases between 22 and 38 worked, I just never managed to hit one -- unlucky.

Possible changes to longterm kernel maintenance

Posted Aug 19, 2011 2:49 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Doing regression testing and QA work on kernel.org isn't something that the average admin wants to be spending a lot of time on, I agree, which is why people pay the big enterprise distributers to do that for them. For example redhat shipped based on 2.6.18, do you think there was some push to make that a golden perfect release, different than the standards for 2.6.22-38 that you tested, or do you think it achieved its stability for your workload via QA and testing by redhat? Was it really more like a 2.6.18.n release with the major problems fixed that became 2.6.?.n++ as time went on. Do you think that the amount of change between the original shipping RH 2.6.18 and the current RH 2.6.18 given the backported drivers, any infrastructure the new drivers depend on, wholly new subsystems, backported fixes and infrastructure changes are any different than the changes between 2.6.18 and say 2.6.32 if you gave 2.6.32 the same kind of QA and testing that went into stabilizing 2.6.18 for the enterprise customers.

The vendors and kernel.org have been converging for years, since the low point of RH 2.4.21 which was basically _nothing_ like kernel.org 2.4.21. It'll be years until there is enough market pressure to justify working on making a new RHEL7, I've known people who are barely getting off RHEL3, but I think it's a worthwhile question to ask whether the next major enterprise version shouldn't just totally converge with kernel.org or at least periodically re-base and do the full QA/test cycle rather than try to maintain and ever increasing diff off of some random kernel version that is no more or less pristine and bug-free than any other version.

Possible changes to longterm kernel maintenance

Posted Aug 19, 2011 13:04 UTC (Fri) by foom (subscriber, #14868) [Link]

> Do you think that the amount of change between the original shipping RH 2.6.18 and the current RH 2.6.18 [...] are any different than the changes between 2.6.18 and say 2.6.32 if you gave 2.6.32 the same kind of QA and testing that went into stabilizing 2.6.18 for the enterprise customers.

Yes, I do. I think the changes RH makes in their stable updates are significantly smaller than the changes that upstream makes in new releases, and thus significantly less likely to introduce regressions to existing users. And thus less costly for RH to test and qualify, as well.

Possible changes to longterm kernel maintenance

Posted Aug 20, 2011 19:08 UTC (Sat) by BenHutchings (subscriber, #37955) [Link]

The kernel doesn't change its userspace ABI, that is intentially kept very stable...

This is simply not true. Parts of procfs and sysfs are quite deliberately changed or removed. This generally happens after a deprecation period of years, but userland isn't always updated fast enough (and in the case of proprietary programs there is nothing that distributions can do about it).

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds