The whole idea of separate long-term maintained software releases is, in my usual most humble opinion, quite broken. (The technical term would be "fubar".)
A fear of regressions is of course real, but better addressed by test-driven development, code review, better quality assurance, and other such mechanisms.
The fear of depreciated functionality (ABIs, APIs, drivers) is, of course, real too: but the solution "backport selected fixes" has got it exactly backwards. Instead, the functionality should be *forward-ported* for the duration.
External large patchsets that mess with internals are another problem; that is not readily fixed, but I doubt that backporting fixes them much better. They're not community players, and the community doesn't really benefit from them. So it makes sense to off-load their cost back to them; let them maintain their own backports. But be willing to help them with forward-porting, merging, or even give reasonable consideration to them maintaining an API/ABI they can rely on.
The (real or imagined) need for a "stable" base very much indicates a failure of the development process to me. It is time to fix that, instead of perpetuating the problem.
(And I'd be grateful if instead of responding to implementation details that I am missing here and which I assure you I am quite acutely aware of since this post would otherwise be not so succinct, critics would instead focus on the strategy, not the tactics. Thanks. ;-)
Posted Aug 14, 2011 15:01 UTC (Sun) by vonbrand (subscriber, #4458)
[Link]
Please read Documentation/stable_api_nonsense.txt before going on about "stability guarantees." The problems stated there are quite real, see for example the links to Raymond Chen's blog on Windows at Dan's data
Possible changes to longterm kernel maintenance
Posted Aug 14, 2011 15:25 UTC (Sun) by lmb (subscriber, #39048)
[Link]
Thankfully, I didn't say anything about eternally stable APIs.
Possible changes to longterm kernel maintenance
Posted Aug 14, 2011 19:16 UTC (Sun) by giraffedata (subscriber, #1954)
[Link]
It isn't clear what strategy you're advocating. Can you be more specific?
But you do seem to be saying there shouldn't be a release stream that, for a year or two, doesn't see any changes to add features; and you didn't address the primary reason people do that: every code change carries a risk of unintended regression. Many applications are themselves stable so would not benefit from new features. Hence, a code change to add a feature would be a net loss.
Are you maybe proposing that code changes to the mainline to add features not have bugs?
Possible changes to longterm kernel maintenance
Posted Aug 14, 2011 19:44 UTC (Sun) by lmb (subscriber, #39048)
[Link]
Backports for "just bugfixes" also carry the risk of unintended regressions. e.g., side effects that are either present both in the tip, or occur just in the backport because the changeset interacts with other patches that have since been applied but not backported. Or even whole fixes that would be applicable to the user base but have not yet been identified and thus not backported (yet). Or that many many users have run kernels leading up to the current tip, but the userbase testing the combination of patches in the backported environment is usually much smaller.
The whole notion that "backports are safer" is, well, a viable business model, but not necessarily sound engineering practice, at least if performed at any non-trivial scale.
Code changes to mainline carry a risk of introducing regressions or new bugs, sure. But so do backports. What I'm proposing is to strengthen tip against regressions by improved QA and process.
Another fallacy is that, because upgrades are scary, you want to do them less often. But that doesn't work out - the delta *keeps* getting larger, and the amount of time that passes during which you *didn't* pay attention does too. The cost does not go down, the effort to get it all working again actually *increases* and needs to be paid in much larger bills than if one had a reasonably fine grained continuous policy.
I already believe that, since code quality *is* getting better over time faster than it is getting worse, that upgrading is generally the safer choice - at least if the regression tests pass. (I'm not saying that we're doing all that we can or should, sure, there are things that can be improved.) But people still cling to the enterprisey-mindset.
Guys and gals, if the enterprisey mindeset worked and was overall the better choice, we'd still be running Solaris, IRIX, UnixWare and the like.
Possible changes to longterm kernel maintenance
Posted Aug 15, 2011 0:11 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
it would really help if there was support for older kernels for a little longer than there currently is for those cases where a companies regression test does run into a problem with the new kernel.
Possible changes to longterm kernel maintenance
Posted Aug 18, 2011 18:48 UTC (Thu) by vonbrand (subscriber, #4458)
[Link]
You know that if you come up with the manpower to do it, it will happen. Otherwise, good luck...
Possible changes to longterm kernel maintenance
Posted Aug 15, 2011 13:38 UTC (Mon) by NAR (subscriber, #1313)
[Link]
I already believe that, since code quality *is* getting better over time faster than it is getting worse
Even if it's true, I do think the improvement is not monotonous. And it's really hard to explain costumers that "in the name of forward progress we've just broke your system and you won't be able to get any work done for a week on the system you've paid thousands of dollars".
Guys and gals, if the enterprisey mindeset worked and was overall the better choice, we'd still be running Solaris, IRIX, UnixWare and the like.
Actually we still run some of them. And we run stable distributions, for example on the server I'm working the kernel is more than 3 years old.
Possible changes to longterm kernel maintenance
Posted Aug 15, 2011 20:16 UTC (Mon) by raven667 (subscriber, #5198)
[Link]
In a previous existence we just started looking at migrating workloads from 2.4.21 (RHEL3 when it went unsupported) that hadn't even been run on a 2.6.x system before. The user space changes were more than the kernel changes though.
I run RH derived systems and I think the differences between kernel releases in the 2.6.x and now 3.x world are becoming less and less transformative and disruptive as the kernel is now a very mature software project. I feel somewhat ambivalent between running RH 2.6.18 or RH 2.6.32 (or even the older RH 2.6.9) which is very unlike moving from RH 2.4.21 to 2.6.x which was so clearly better.
I run a vendor kernel and not kernel.org so I'm not following the head of development but from reading lwn it seems that newer releases have higher required quality and fewer and less severe regressions than kernels say 5 years ago. I don't see a technical reason why vendors couldn't move to a new 3.x version during each major service pack release and run 3.x.y for security-only updates during a particular release. As long as the kernel is run in the wild and put through their QA it shouldn't be that different from the current case of back porting.
The big problem is selling that change to enterprise customers who don't want to see the version number change even though that is the reality of what is going on as the kernel gets backported major new features at new service pack release levels.
Possible changes to longterm kernel maintenance
Posted Aug 14, 2011 23:00 UTC (Sun) by welinder (guest, #4699)
[Link]
Them pesky customers! :-)
> The (real or imagined) need for a "stable" base very much indicates a
> failure of the development process to me.
Most users don't want to upgrade their software daily -- "the software
is here for me, not the other way around". Users will happily wait
years and long as the old photoshop (gimp, whatever) does what they
need.
But security updates are different: through no fault of the user,
he has to update because he is screwed if he doesn't update soon.
If software A needs to be updated and requires updates to software
B, C, and D, then pretty soon the menus in a program have changed
and two others are broken until the authors update them to work with
the newer libraries.
So clearly a stable base is needed -- for the kernel as well as other
parts of the software stack. Distributions fill this role reasonably
well.
> A fear of regressions is of course real, but better addressed by
> test-driven development
Lovely in theory, but for the kernel with bugs that often depend on
insanely complex interactions between multiple programs and/or
machines, I don't think anyone has a workable inkling about how to
even begin doing that.
> code review, better quality assurance, and other such mechanisms.
Infinite, highly skilled man power is a pipe dream. (There wouldn't
be any need for security updates if it wasn't.)
Possible changes to longterm kernel maintenance
Posted Aug 15, 2011 20:24 UTC (Mon) by raven667 (subscriber, #5198)
[Link]
The kernel doesn't change its userspace ABI, that is intentially kept very stable, so while I agree that for large complex userspace software following the development head can put you in dependancy hell, that problem doesn't really exist for the kernel. The kernel is tied to certain userspace components that are required to change on updates but those components generally don't have complex dependancies and so aren't going to drive A needing B and C where D also requires C=v1.2 which requires D to be bumped which then requires, blah blah blah. (actually the fact that modern package managers can model this kind of interaction at all is sickly amazing).
I'm not convinced that the status quo of having vendors keep the version number string unchanging while they muck around with the internals backporting security fixes and features from future kernels is actually less disruptive than just QA'ing newer kernel releases periodically and shipping that.
Possible changes to longterm kernel maintenance
Posted Aug 18, 2011 15:27 UTC (Thu) by mrshiny (subscriber, #4266)
[Link]
The kernel doesn't intentionally break it's ABI but that does happen. And if you think driver regressions don't happen in new kernels, you're sadly mistaken. When I was running Fedora 8, the Intel Wireless driver broke so often that I had to reconfigure Yum to keep a large number of old kernels around, because I could never be sure which one would work for me. And video drivers break too.
And if widely-used consumer hardware can have such obvious regressions, you can't tell me that less-well-tested specialty enterprise hardware won't also, at least occasionally, suffer regressions.
Possible changes to longterm kernel maintenance
Posted Aug 18, 2011 23:28 UTC (Thu) by raven667 (subscriber, #5198)
[Link]
I think we are talking past one another. I am not suggesting that regressions don't happen and that they blindly ship new releases, damn the torpedoes full speed ahead. The enterprise distributions already pull new versions of the major network and storage drivers, backport them to the older kernel release then QA and ship that. If there are regressions in the drivers then they still have the same problem.
What I'm suggesting is to be less afraid to update the version number and track a kernel.org stable release and update to new kernel.org stable releases over the lifetime of the product rather than maintaining a private stable release that never shows a version number change over the product lifecycle. I've found that endlessly confusing to admins, auditors and consultants, trying to match up feature documentation and bugfixes between kernel.org and vendor kernels.
The enterprise vendors already have a large QA process before shipping new kernel features in service packs, is there really a meaningful difference in the number of regressions between a large number of private backports and custom integrations or just re-basing on upstream stable releases periodically. Would it really be harmful to have the enterprise vendor just take ownership of maintaing a stable kernel.org release so that patches and whatnot in that release are documented and the version number is bumped along with other kernel.org releases?
For example I'd like to dispense with the fiction that labeling the RHEL6 kernel 2.6.32 is really a meaningful description because it is _not_ accurate. If they said it's 2.6.32.45 maybe that would be better.
Possible changes to longterm kernel maintenance
Posted Aug 19, 2011 0:54 UTC (Fri) by mrshiny (subscriber, #4266)
[Link]
Having the enterprise distros collaborate on the stable kernel branch is probably not a bad idea. But its lifespan is typically far less than an enterprise distro, on the order of years. So Red Hat still needs to maintain old kernels or, as you say, fully QA a new kernel.
My point is that no matter what the kernel.org people say, the kernel does have regressions often. This is particularly a problem because the drivers must ship with the kernel. This means that if you have a regression anywhere in the codebase that applies to your computer/workload, you need to upgrade or downgrade the whole kernel. This is why people are afraid of changing their kernel. For over a year I was unable to upgrade my kernel because one or two of the drivers I used were unstable from release to release. I didn't have the choice of pinning one driver version that worked, while upgrading all the rest of the kernel.
This is why the stable kernel series, or a stable kernel maintained by the enterprise distros is a needed thing. Customers don't want the new kernel to break something. Even a small bug in a single driver somewhere might severely disrupt operations after an upgrade. So upgrading the whole kernel is a risk because it all changes.
I don't subscribe to the "stable api nonsense" idea. I've been personally bitten by the fact that every driver in the kernel is permanently locked together in one big tarball with the kernel itself. I know I'm not the only one. The enterprise distros don't want this to happen to their customers, so they have no choice but to maintain a specific kernel.
Possible changes to longterm kernel maintenance
Posted Aug 19, 2011 1:52 UTC (Fri) by foom (subscriber, #14868)
[Link]
And let me just relate my experience, as a perhaps representative example. I'm running mostly CentOS "2.6.18" right now. Upstream 2.6.22 was also used successfully in the past, but it seemed better to stay on a maintained kernel.
I've tried upgrading to a new kernel multiple times, because the new kernels do have some new features I'd like to use (actual new features, not just drivers). But each time I've tried upgrading, I've hit major regressions in my workload. Finding a new kernel version that works is not actually the most important work to do, so there's usually significant delays for someone to get around to figuring out what's wrong, or trying again with a different version.
The following versions have been attempted since 2.6.22 (to the best of my recollection this is right)
2.6.25: File corruption when using writev (introduced in 2.6.23!).
2.6.26: I forget what was wrong with the initial attempt, but something was broken. The debian lenny 2.6.26 appears to work now, though not using it much (since most prod boxes are running centos).
2.6.29: (or thereabouts) introduced serious performance degradation in our workload due (I think) to disk page-in behavior changes in mmaped files. Didn't really attempt to track down, got distracted by other stuff.
2.6.31: file corruption when using writev in an ext3 fs. (different bug from before)
2.6.32: works, but still worse performance than RH18 and Deb26, though not as bad as 29. Same behavior with Deb2.6.32.
2.6.38: Currently running on a few systems. Performance seems to be better. Haven't found anything critically wrong yet...so far so good?
So anyways, in *my* experience, new upstream kernel versions have a dismal success rate, while new patch releases of a working stable (distro) kernel version have a 100% success rate.
I expect it's not really the case that no releases between 22 and 38 worked, I just never managed to hit one -- unlucky.
Possible changes to longterm kernel maintenance
Posted Aug 19, 2011 2:49 UTC (Fri) by raven667 (subscriber, #5198)
[Link]
Doing regression testing and QA work on kernel.org isn't something that the average admin wants to be spending a lot of time on, I agree, which is why people pay the big enterprise distributers to do that for them. For example redhat shipped based on 2.6.18, do you think there was some push to make that a golden perfect release, different than the standards for 2.6.22-38 that you tested, or do you think it achieved its stability for your workload via QA and testing by redhat? Was it really more like a 2.6.18.n release with the major problems fixed that became 2.6.?.n++ as time went on. Do you think that the amount of change between the original shipping RH 2.6.18 and the current RH 2.6.18 given the backported drivers, any infrastructure the new drivers depend on, wholly new subsystems, backported fixes and infrastructure changes are any different than the changes between 2.6.18 and say 2.6.32 if you gave 2.6.32 the same kind of QA and testing that went into stabilizing 2.6.18 for the enterprise customers.
The vendors and kernel.org have been converging for years, since the low point of RH 2.4.21 which was basically _nothing_ like kernel.org 2.4.21. It'll be years until there is enough market pressure to justify working on making a new RHEL7, I've known people who are barely getting off RHEL3, but I think it's a worthwhile question to ask whether the next major enterprise version shouldn't just totally converge with kernel.org or at least periodically re-base and do the full QA/test cycle rather than try to maintain and ever increasing diff off of some random kernel version that is no more or less pristine and bug-free than any other version.
Possible changes to longterm kernel maintenance
Posted Aug 19, 2011 13:04 UTC (Fri) by foom (subscriber, #14868)
[Link]
> Do you think that the amount of change between the original shipping RH 2.6.18 and the current RH 2.6.18 [...] are any different than the changes between 2.6.18 and say 2.6.32 if you gave 2.6.32 the same kind of QA and testing that went into stabilizing 2.6.18 for the enterprise customers.
Yes, I do. I think the changes RH makes in their stable updates are significantly smaller than the changes that upstream makes in new releases, and thus significantly less likely to introduce regressions to existing users. And thus less costly for RH to test and qualify, as well.
Possible changes to longterm kernel maintenance
Posted Aug 20, 2011 19:08 UTC (Sat) by BenHutchings (subscriber, #37955)
[Link]
The kernel doesn't change its userspace ABI, that is intentially kept very stable...
This is simply not true. Parts of procfs and sysfs are quite deliberately changed or removed. This generally happens after a deprecation period of years, but userland isn't always updated fast enough (and in the case of proprietary programs there is nothing that distributions can do about it).