LWN.net Logo

KS2007: The distributor panel

By Jonathan Corbet
September 6, 2007
LWN.net Kernel Summit 2007 coverage
The opening session at the 2007 Linux kernel developers summit was intended to be a forum in which distributor kernel maintainers could talk about how the development community could better support them. That topic did come up, but not before a lengthy discussion of what the community would like to see from the distributors. In particular, the panel members (Greg Kroah-Hartman, Kyle McMartin, Dave Jones, and Deepak Saxena) got an earful on the difficulties created by enterprise distributions and their stability policies.

There is some real unhappiness in the community about the long lead times required to get code into enterprise kernels. These lead times are the result of pressure from customers, who don't want things to change in their deployed systems - unless the change is a feature they need, of course. They want kernels to be certified by the distributors, and by the vendors whose products run on top of those kernels. If, as has often been expressed, we want to shorten lead times and have enterprise customers running something closer to current kernels, we have to build confidence that kernel updates in stable environments are safe.

One much-criticized distributor practice is the policy of avoiding kernel ABI changes over the lifetime of a distribution release. The mainline kernel, of course, has no such policy, so imposing it on enterprise kernels puts a big drag on the progress of those kernels - and strongly limits things which can be subsequently added. It makes the distributors and vendors do a lot of backporting work which would be better spent improving current kernels - and the releases which result from that backporting work have as much change as an updated kernel would.

Still, raising confidence in current kernels is not going to be an easy task. Dave Jones noted that, every time he rebases the Fedora kernel, at least one user-space package breaks. Deepak Saxena said that there are very similar issues in the embedded world: customers generally don't want completely new kernels for systems which have been certified with older software. He also noted that performance regressions can be a big problem with kernel upgrades; these kinds of problems, often, do not come up until a kernel has been put into a specific real-world situation and are very hard to test for. Partly in response to this problem, some big enterprise customers are increasingly running their internal regression tests on current mainline kernels. The tests generally cannot be shared, but the regressions which turn up can be reported; this kind of work can, over time, help to build confidence in mainline kernels.

Companies like Intel are also regularly running regression tests and reporting any bugs found. There are concerns, though, that all this testing can only help so much. It tends not to be visible to users, and, thus, fails to help build confidence. It was said that this testing happens too late, that more testing should be done on the -mm kernels. And it is very hard to find workload-dependent problems with regression testing, so there may always be surprises lurking for those who deploy the wrong kernel.

The discussion moved on to the topic of out-of-tree patches carried by the enterprise distributors. Many of these are patches which seem doomed to never find their way into the mainline, mostly because significant problems have been identified with them. Examples include utrace (currently shipped by Red Hat) and AppArmor (in SUSE kernels). These in-house developments are seen as a problem by some, though others see such code as an issue for the distributors and their customers only.

Then things went quickly back to attempts to impose a stable ABI on enterprise kernels. There were assertions that this policy exists to make life easier for purveyors of binary-only modules, but there is more to it than that. Episodes of working code breaking when recompiled - perhaps due to a change of compiler - are not unheard of. Nervous customers want to be able to continue to run working code unchanged, without even the need to rebuild it.

Ingo Molnar pointed out that system upgrades are a very emotional decision. There is the ever-present fear that an upgrade could break a previously-working system, but there is also the gratification that comes with a welcome new feature. In general, the enterprise kernel problem can be mitigated by reducing the level of fear that customers feel. One thing that could be done to that end is to create an option for enterprise Linux customers to run current mainline kernels if they wish.

Toward the end of the allotted hour, the discussion actually turned to the question of what the community can do to help the distributors. Dave Jones complained about his list of bugs (1500 of them) with nothing being done about them. When developers do respond, they often ask reporters to try current -rc kernels in the hope that the problem has gone away. There is not enough effort going into actually figuring out what the cause of the bug is.

In some cases, the situation is bad enough that some developers fear that we are losing the users who would otherwise be some of our best testers. Perversely, the result of this is fewer bug reports - but that is not the same as fewer bugs. What is really needed, says Ingo Molnar, is some sort of metrics on how much testing of kernels is really happening. We need positive reports as well as bug reports. If the number of testers drop, it will be immediately apparent that there is a problem.

Kyle McMartin said that the biggest problem is drivers - there are a lot of drivers that users want, but which are not in mainline kernels. Squashfs was mentioned as a perennially out-of-tree module that everybody uses. There is a fear that the bar for the merging of drivers has been set too high, causing developers to choose to just keep their code out of the mainline. The time required to get code into the mainline is nondeterministic and the process is scary. Maybe, especially in the case of drivers, it should be made easier to get code into the mainline.

This was a controversial idea; a really bad driver can create obscure problems all over the kernel. But the fact of the matter is that an ugly driver is more likely to be fixed in-tree than out. Linus asserted that anytime a distributor ships an out-of-tree driver, the process has failed. We are failing our users, missing an opportunity to get the drivers improved, and driving away testers who need those drivers just to get going. So, if Linus has his way, we may see drivers having an easier time getting into the mainline in the future.


(Log in to post comments)

KS2007: The distributor panel

Posted Sep 6, 2007 9:29 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>and the releases which result from that backporting work have as much change as an updated kernel would.

I have to add that backporting is also a risky task. Remember the RH-NPTL noise..

>What is really needed, says Ingo Molnar, is some sort of metrics on how much testing of kernels is really happening.

Andrea Arcangeli wrote klive /just/ for that. A pity the number of users is a bit low.

KS2007: The distributor panel

Posted Sep 6, 2007 10:28 UTC (Thu) by maks (subscriber, #32426) [Link]

it would be really cool to see squashfs go in soon - also wireless drivers were/are still too many oot. Even alsa is behind these days, current snd-hda is the new ata_piix. Latest alsa patch is said to work much better.

afaik the hch comments for the last squashfs merge request where handled on, the squashfs patch would just need to drop the compat stuff for the old format.

regarding the bugs suspend seems still too hard to debug and yes Debian also gets it's list of unhandled bugs. Current push for Testing/Unstable user is too ask them to reproduce on latest -rc images, which are available via apt. If it is then still reproducible we ask them to fill a report in bugzilla.kernel.org. bugs.debian.org is able to check upstream status of a bugzilla bug if marked as forwarded with bugnr.

KS2007: The distributor panel

Posted Sep 6, 2007 15:37 UTC (Thu) by mgross (guest, #38112) [Link]

Has anyone considered imposing a kenrel ABI stability cycle for the upstream linux kernel? Something like one release / year is alowed to chage such things.

I worry that no middle growned was reached in this discussion. Keep in mind that for everyone but developers the OSV's *are* Linux in practical terms.

KS2007: The distributor panel

Posted Sep 6, 2007 17:48 UTC (Thu) by mgb (guest, #3226) [Link]

Kernel hackers want a permanent playground rather than producing stable releases as in the past. That's their decision to make but blaming the distros for the consequences of that decision is not appropriate.

KS2007: The distributor panel

Posted Sep 7, 2007 12:50 UTC (Fri) by hmh (subscriber, #3838) [Link]

When you factor in bad design bugs that made it into mainline and we need a way to be able to fix them, you will see it is not just a matter of a "playground" for developers. The playground is probably where people are when they send to mainline badly designed APIs and ABIs.

Now, regressions and ABI breakage to *userspace* are a major pain, but one we can't just do without completely. Userspace ABI changes need to happen sometimes, or we will end up as another Microsoft Windows, loaded with past mistakes and design errors.

There are ways to lessen, or at least share that pain better, though. Such ABI breaks should at least be strongly documented, and tested for. Yes, break it. But know that you are breaking it, exactly how you are breaking it, and document it explicitly in a easy-to-find, standard place.

KS2007: The distributor panel

Posted Sep 7, 2007 18:23 UTC (Fri) by mrshiny (subscriber, #4266) [Link]

The thing is, ABI changes don't need to be forbidden, they should just happen less frequently. With the modern 2.6 kernel, EVERY SINGLE RELEASE can potentially contain ABI changes. At least in the 2.0/2.2/2.4 days there were more guarantees of stability for the lifetime of the kernel. This could be enough to get drivers out the door and into customer hands and simultaneously into the tree.

As another article mentioned, even if a vendor gets their driver into the tree it might be months or years before this driver gets into customer hands. In the meantime, for anyone currently running Linux, the driver may as well not exist because basically nobody runs kernel.org kernels. The only option is if the vendor backports the driver to popular kernels, of which there are dozens.

People keep comparing ABI stability to Windows, but they forget some things: 1. Windows isn't necessarily ABI-compatible from release to release, but it is supposed to be for the lifetime of one release including service packs.
2. Even if Windows maintains driver ABI forever, Linux doesn't need to do the same thing. But right now there is no ABI and not even a stable API (i.e. even recompiling the code may not work). Despite GregKH's arguments to the contrary I still think the kernel devs have made a mistake in abandoning API/ABI stability. We don't need to keep bad APIs or bugs around forever, but _some_ stability would go a long way.

KS2007: The distributor panel

Posted Sep 7, 2007 19:54 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

Despite GregKH's arguments to the contrary I still think the kernel devs have made a mistake in abandoning API/ABI stability. We don't need to keep bad APIs or bugs around forever, but _some_ stability would go a long way.

Pray tell, how much is "some"? Whatever you decide, some parties will be unhappy...

KS2007: The distributor panel

Posted Sep 7, 2007 20:42 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

quote:
The thing is, ABI changes don't need to be forbidden, they should just happen less frequently. With the modern 2.6 kernel, EVERY SINGLE RELEASE can potentially contain ABI changes. At least in the 2.0/2.2/2.4 days there were more guarantees of stability for the lifetime of the kernel. This could be enough to get drivers out the door and into customer hands and simultaneously into the tree.

did you actually use the 2.0/2.2/2.4 kernels that you are talking about? I did (I've been using Linux in production environments since around 2.0.30 and on my own systems prior to 1.0) and there was no more ABI stability between different releases then there is with the 2.6 kernel, every release changed something.

now the 2.6 development is much faster, so the amount of changes that took place in 2.0 in a year are taking place in a few months in 2.6, but they are spread across a larger code base as well.

and you are ignoring the fact that in the 1.2/2.0/2.2/2.4 days the kernels from RedHat, SuSE, and Linus were frequently not compatible with each other. you had to change userspace noticably when moving from one kernel series to another (the NPTL stuff is RedHat is the biggest example, but far from the only one).

the other big problem from the 1.2/2.0/2.2/2.4 days was the fact that you frequently couldn't run a stable kernel on your hardware, the driver (or a _working_ driver) was commonly only available in the development series (and developers, including leading ones like Alan Cox would tell you to run the development kernel rather then the stable kernel). I ended up deploying a 2.1.166 kernel on a production box that I mailed across the country to the data center because I needed the 3com NIC driver to work.

I'll take the current rate of change over these sorts of problems any day.

KS2007: The distributor panel

Posted Sep 8, 2007 20:52 UTC (Sat) by malor (guest, #2973) [Link]

This is precisely correct. This is the real problem. The Linux devs no longer have reliability as the central, core goal. It's A goal, but third- or fourth-tier... the primary goal is writing new code, making it a fun thing for developers to play with. When it went to the fast-release cycle, they declared their permanent derision for end users, the people who have to actually use this stuff in real life. In real life, stuff has to work, and the only way to make sure it does is with testing and slow development. A two- or three-month release cycle is a gigantic middle finger to the people that need to trust the code.

When the 2.6 kernels started to fall to shit, a lot of us complained. We were told that we weren't supposed to run kernel.org kernels, and and that the distributions would take care of testing and bugfixing. Instead of actually DEALING with the problem and shipping reliable code, they waved their hands in the air and expected others to handle it for them.

Well, guess what? Others are indeed handling it, but not how the poor kernel devs want. Awwww. The way they've found to deal with it is by not running kernel.org code; it's by using older releases and very, very carefully thinking about what to backport. The kernel people are mad because they got exactly what they asked for. The distributions are bugfixing and providing their customers a product that can actually be used.

Nobody in their RIGHT MINDS runs code straight from Linus anymore, because it's a seething, bug-infested mass of crap. The release cycle is ridiculously fast, the code quality has fallen to shit, and nobody can even be bothered to fix bugs.

And they DARE to complain that people aren't using their most recent product? This is insane. Ship code that WORKS and people will use it. Stuffing crap in a crate and expecting it to turn into gold by lots of user testing just infuriates your users, whose jobs depend on this garbage working. Every time.

Reliability is what made Linux successful in the first place; you could trust the 2.2 kernels with your life. Early 2.4 was terrible, and never got as good as 2.2, but by late in that cycle it was quite robust. 2.6 has been a steaming bag of crap.

Speed of change is not a useful metric of progress. Speed of RELIABLE change might be, but that's going to take a total shakeup of the broken dev process in place now.

KS2007: The distributor panel

Posted Sep 24, 2007 7:16 UTC (Mon) by turpie (guest, #5219) [Link]

I think part of the problem is Andrew Mortons -mm kernels have gotten out of control. At some stage they were a testing stage for almost ready code before it was sent to mainline. People didn't mind running it on their system as it was only slightly less stable than mainline, but now there is a much greater chance of problems so people can't be bothered using it. What is needed is a "stuff that will be sent to Linus for the next -rc1 release" branch to go between the current anything goes -mm and Linus trees.

KS2007: The distributor panel

Posted Sep 13, 2007 20:39 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

So, if Linus has his way...

Linus always gets his way, by definition. After all, the Linus kernel is the mainline kernel. What his way is, though, may change based on strong external inputs. ;-)


Copyright © 2007, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds