Making stable kernels more stable

By Jonathan Corbet
October 24, 2018

Improving the quality of stable kernel releases is a perennial subject at the Kernel and Maintainers Summit events, and this year was no exception. This session, led by Fedora kernel maintainer Laura Abbott, discussed a range of ideas but found no silver bullets. There is, it seems, not much that can be done to create better stable kernels except to perform more and better testing.

Abbott's objective in running this session was to discuss ideas for reducing regressions in stable kernels. Those kernels are, after all, supposed to be stable; if they break, users will suffer and their trust in the entire process will be reduced. In the discussions prior to the summit, she had suggested that perhaps stable releases should sit in a release-candidate state for one week prior to release as a way of shaking out any bugs; that idea was not particularly well received. But we should do something, she said; if we are going to tell people that they should be running stable kernels, those people should not need to employ "an army of engineers" to debug those kernels. The stable kernels we are releasing now, she said, are not ready for production use.

Peter Zijlstra started the discussion with an assertion that the problem will never be solved. The only way anybody can ever really know that a kernel will work for their particular combination of hardware and production workload is to try it. Rafael Wysocki said that there is a fundamental conflict here: users want fixes to be aggressively applied to stable kernels, but they also want those kernels to be mature. The end result, Jiri Kosina said, is that the distributors are not using the stable kernel releases anymore.

Ted Ts'o told the group that part of the problem is that the long-term support (LTS) kernels are too successful, so the regular stable kernels are not being used anymore. The support period for those kernels is simply too short. Supporting them for a longer period would help, but that would, of course, increase the amount of work required. So the non-LTS kernels are unlikely to ever be useful for distributors. Those that have tried to use them (he mentioned CoreOS in particular) have ended up shipping regressions to users, who were naturally displeased with what they got.

Greg Kroah-Hartman, the maintainer for most of the stable kernels out there, noted that CoreOS never told him about the problems it was having, so there was not much that he could have done about them. Other stable-kernel users have a different experience. Google, for example, runs each release candidate through "a zillion tests" and, as a result, is able to push updates out to users quickly. But, it was pointed out, obtaining this kind of result requires operating a large test infrastructure. Linaro is building something like it, Kroah-Hartman said, and Red Hat too. This is the only way the use of stable kernels by a distributor can really work, he said.

Abbott pointed out that big companies have the resources to put together this kind of infrastructure, but that is not true of all would-be stable-kernel users. Sasha Levin said that the KernelCI testing project is evolving to the point where small groups should be able to make use of it. Kroah-Hartman said that KernelCI is a Linux-Foundation project now, and that it is working to add more tests; Mark Brown cautioned that KernelCI still needs resources to be able to grow, though. ~~and that it's a bit too soon to advertise it as being ready for widespread use~~

When Ts'o asked Abbott about the bugs reported by Fedora users, she replied that most of them turn up either in the graphics drivers or the KVM virtualization subsystem. Graphics, she noted, has been getting better recently; Kroah-Hartman replied that KVM is "a black hole" in this regard. Linus Torvalds said that Intel graphics, in particular, has improved a lot recently, but there is more to graphics than Intel. Abbott added that AMD graphics seems to be the source of many recent regressions.

Returning to one of her original points, Abbott asked whether companies need to be active in the kernel community to be able to use the stable releases effectively; Kroah-Hartman responded that not all users are active kernel contributors. Zijlstra said that companies don't need experts; they just need to test their workloads on the release candidates and report any bugs they find. Ts'o thought that the core problem might be a documentation issue; if users knew that they needed to test the release candidates, they might do more of it.

Kees Cook, instead, said that if the community is seeing holes that bugs are slipping through, the right response would be to add tests that might catch them — assuming such tests exist. Paul McKenney pointed out that a lot of the existing tests out there are proprietary; in such cases, it's up to the company that owns the tests to run them and report the results. Some companies do indeed do that, Kroah-Hartman said.

Arnd Bergmann observed that more patches seem to be going into the stable releases than was once the case. Kroah-Hartman said that a lot of work has gone into getting maintainers to tag fixes for the stable releases; that work is bearing fruit. But, Bergmann said, many of those "fixes" appear to be bending the rules that had been put in place for the stable kernels. The rules, Kroah-Hartman responded, are there to allow the maintainers to say "no" to specific patches, but he will generally accept a much broader range of patches for stable releases if the maintainers agree. Bergmann asked whether the rules stretch to adding fixes for warnings generated by new compilers; Kroah-Hartman said "no", that the line has to be drawn somewhere. Fixes to disable those warnings in stable-kernel builds might be accepted, though.

Toward the end, Kroah-Hartman was asked if he uses the "Fixes" tag to select patches for backporting to the stable releases; he answered that he does not have the time to do that. Levin's automatic patch-selection code can make use of it, though. Ts'o said that he has started getting CVE numbers for applicable patches for a novel reason: the presence of a CVE number will cause others to do the work backporting the patches to older kernels for him. With regard to the original topic, though, the conclusion reached by the group was clear enough: if we want better stable-kernel releases, there is really no substitute for better testing.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the Maintainers Summit.]

Index entries for this article
Kernel	Development model/Stable tree
Conference	Kernel Maintainers Summit/2018

Making stable kernels more stable

Posted Oct 24, 2018 6:23 UTC (Wed) by luya (subscriber, #50741) [Link] (5 responses)

AMD definitely needs more love in term of better hardware especially the newer models. Example is https://bugzilla.kernel.org/show_bug.cgi?id=198715 where ACPI codes broke both touchscreen and stylus support on HP Envy x360. My experience with one of kernel developers shows how much behind the process really is especially on ACPI side. Main reason is the lack of manpower and accessibly for users willing to test possible solution.

Making stable kernels more stable

Posted Oct 24, 2018 6:42 UTC (Wed) by cpitrat (subscriber, #116459) [Link] (4 responses)

There's an easy solution: give free hardware to kernel devs !

Making stable kernels more stable

Posted Oct 24, 2018 8:06 UTC (Wed) by gioele (subscriber, #61675) [Link] (1 responses)

> There's an easy solution: give free hardware to kernel devs !

How exactly?

For years I've been doing the opposite: buying hardware that I know kernel devs use privately. I (and probably many others) will be very happy to donate a copy of my laptop or desktop to any competent kernel dev. No strings attached.

BTW, ages ago I set up a (successful) pledge.me to buy the exact cell phone I had at the time to Michal Čihař, the main developer of Gammu/Wammu. That cell phone model went from unsupported to fully supported in a few weeks. ;)

Making stable kernels more stable

Posted Oct 24, 2018 10:02 UTC (Wed) by cpitrat (subscriber, #116459) [Link]

As the comment was talking about 'better hardware', I was thinking of constructors providing their newest model for free to kernel developers. This shouldn't be hard to find interested ones. This can also work for donating copy of your own hardware as long as you're not using low-end ones. It's obviously harder to find someone interested in receiving and using a 200$ laptop.

Making stable kernels more stable

Posted Oct 26, 2018 20:44 UTC (Fri) by xtifr (guest, #143) [Link] (1 responses)

It's my understanding that AMD not only gives/loans hardware to, but *pays the salary of* more than one kernel dev.

Making stable kernels more stable

Posted Oct 27, 2018 7:32 UTC (Sat) by cpitrat (subscriber, #116459) [Link]

Yes but of some working on stable kernels ?

Making stable kernels more stable

Posted Oct 24, 2018 7:34 UTC (Wed) by mjthayer (guest, #39183) [Link] (5 responses)

> In the discussions prior to the summit, she had suggested that perhaps stable releases should sit in a release-candidate state for one week prior to release as a way of shaking out any bugs; that idea was not particularly well received.

People have different levels of tolerance for bugs. I use Ubuntu and usually upgrade during the beta period because it is easier to get bugs fixed then, and to know what our users who use Ubuntu will run into. If there is no release candidate state then people who are more sensitive should wait for a while before using a release. Making use of these different tolerance levels (in fact the usual alpha-beta-release cycle) is still an effective way of keeping things stable. On a finer grain, people with higher stability requirements could also hold back non-critical stable kernel updates until they got more testing, while people with less critical needs could use them right away.

On a different note, there shear size of the kernel must be a big problem for stability. I always wonder whether people will try a more micro-kernel-like approach some day. I think that these days many of the problems of micro-kernels have been solved, but that the gains have not yet justified the pain of reworking what we have.

Making stable kernels more stable

Posted Oct 24, 2018 10:16 UTC (Wed) by dgm (subscriber, #49227) [Link] (4 responses)

> On a different note, there shear size of the kernel must be a big problem for stability. I always wonder whether people will try a more micro-kernel-like approach some day.

This is a different kind of stability. Having subsystems run at lower privilege would not make them less prone to regressions. It only gives the microkernel the opportunity to restart them without failing completely, but this is only so much useful.

Take for instance the KVM subsystem or the graphic drivers, that were metioned in the article. Losing KVM would crash the running virtual machines, which we can assume are the key part of the system for the user, so no difference here. Also, I'm not sure about graphics cards. Are desktop environments prepared to cope with losing the grahics context?

Making stable kernels more stable

Posted Oct 24, 2018 12:11 UTC (Wed) by mjthayer (guest, #39183) [Link] (1 responses)

Actually I was thinking less of the process separation and more of the conceptual separation, assuming that it is easier to keep several smaller code-bases stable than one big one. I might be wrong there of course, particularly if the added communication complexity outweighs the benefits.

Making stable kernels more stable

Posted Oct 24, 2018 20:55 UTC (Wed) by k8to (guest, #15413) [Link]

Communication is something to wrangle and have problems, but I'd say it's more about binding between components that can happen readily over a well-ordered communication path.

I definitely have seen wins when a lot of strategies are employed together. There are some systems built in erlang where many benefits were reaped by state management, controlled communication, obliviousness about local vs remote communication by default, and many other things combining to give more managability to the process. It's harder achieve those wins when the system doesn't give you the discipline tools to help ensure those things are done.

I'm suspicious that an operating system may be too low-level to use fancy tooling to help you get all those wins though. I'm definitely convinced that complex piles of C talking over messaging busses implementing a kernel does not give you a simple system or any easy wins.

Making stable kernels more stable

Posted Oct 31, 2018 16:16 UTC (Wed) by anton (subscriber, #25547) [Link] (1 responses)

Are desktop environments prepared to cope with losing the grahics context?

X applications certainly know how to redraw a window when you uniconize it. twm has a "restart" action that redraws all windows (with uniconizing and reiconizing if necessary). I use this when my Intel-based X-Terminal decides to make everything black.

My experience concerning Intel and AMD is that I have graphics problems on Intel (HD Graphics 520/500 on Skylake and Apollo Lake), while AMD (Juniper XT) works flawlessly. This may have to do with the age of the hardware (Juniper XT is from 2009, HD 5xx from 2015), but my experience is certainly the opposite of what others have stated.

Making stable kernels more stable

Posted Oct 31, 2018 17:29 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Losing the graphics context is not like unmapping a window. When a window is unmapped you still have all the resources (e.g. pixmaps) which were previously registered with the server. For that matter the window itself still exists and things like OpenGL contexts which were generated from it remain intact; all you need to do when the window is uniconified is redraw.

Losing the graphics context is more like losing the connection to the X server. Most applications aren't prepared to deal with that gracefully. There is also the extra complication that the context includes *hardware* resources which are no longer available, and which may have been mapped directly into the application's address space. The backing for XShm mappings doesn't suddenly cease to exist even if you do lose your connection to the server.

Making stable kernels more stable

Posted Oct 24, 2018 10:35 UTC (Wed) by broonie (subscriber, #7078) [Link] (2 responses)

KernelCI is definitely ready for widespread use and in fact has been in widespread use for several use running builds and boots - some of the new functionality that Greg was talking about with running tests is in development and not available for use yet, and some of the additional resources (builders for example) that are expected to arrive soon and will be needed to deploy that aren't yet there.

It'd be great if the story could be updated to reflect this, what's there at the minute is very misleading and not what I said.

Making stable kernels more stable

Posted Oct 24, 2018 10:42 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

So I took out what I believe was the offending part of that sentence; hopefully things are better now. Apologies for any confusion.

Making stable kernels more stable

Posted Oct 24, 2018 11:25 UTC (Wed) by broonie (subscriber, #7078) [Link]

That's great, thanks!

Making stable kernels more stable

Posted Oct 24, 2018 13:17 UTC (Wed) by sashal (✭ supporter ✭, #81842) [Link] (1 responses)

With regards to distros not using -stable, I think it's important to differentiate community distros and enterprise distros.

Community distros such as Debian and Fedora track the stable tree, because it aligns with their model.

However, enterprise distros don't do that since it doesn't fit in their business model for a few reasons:

1. They don't have a reason or incentive to update anything which they don't think their customers use and pay them to maintain.
2. They might want to backport entire drivers based on customer requests, which essentially forks their kernel tree and makes stable updates more tricky to use in that tree.
3. The appearance of a slow moving kernel is better than one that receives 1000's of commits every month. Customers see few changes as a good thing which means for them that the kernel is "stable".

Making stable kernels more stable

Posted Oct 24, 2018 18:16 UTC (Wed) by nilsmeyer (guest, #122604) [Link]

> 3. The appearance of a slow moving kernel is better than one that receives 1000's of commits every month. Customers see few changes as a good thing which means for them that the kernel is "stable"

If you still have a similar release cadence, for example due to security issues, there really isn't much gained by having a "stable" enterprise kernel. The only thing stable seems to be the major version.

Making stable kernels more stable

Posted Oct 24, 2018 16:04 UTC (Wed) by iabervon (subscriber, #722) [Link]

Regardless of what you call it in each period, I think there's value in having announcements of a kernel both before and after Google runs zillions of tests on it.

More information about Intel GFX CI

Posted Oct 25, 2018 14:13 UTC (Thu) by mupuf (subscriber, #86890) [Link]

If you are interested about Intel's CI, please check out the talk Arek and I gave at FOSDEM earlier this year: https://archive.fosdem.org/2018/schedule/event/intel_ci/

Here is my latest take on the job we are doing on the Intel CI:

Linux's development model has been described by Eric S. Raymond as being
akin to a bazaar, where any developer can make changes to Linux as long
as they strictly improve the state of Linux, without regressing any
application that currently runs on it. This allows Linux users to update
their kernels and benefit from the work of all developers, without
having to fix anything in their applications when a new version comes.
Unfortunately, it is impossible for developers to try their changes on
all the different hardware and userspace combination being used in the wild.

Typically, a developer will mostly test the feature he/she is working on
with the hardware at hand before submitting the patch for review. Once
reviewed, the patch can land in a staging repository controlled by the
maintainer of the subsystem the patch is changing. Validation of the
staging tree is then performed ahead of sending these changes to Linus
Torvalds (or one of his maintainers). Regressions caught at this point
require to bisect the issue, which is time consuming and usually done by
a separate team, which may become a bottleneck. Sometimes they let
regressions through, hoping to be able to fix them during the -rc cycles.

To address this bottleneck, the developer should be responsible for
validating the change completely. This leads to a virtuous cycle as not
only developers can rework their patches until they do not break
anything (saving the time of other people), but they also become more
aware of the interaction their changes have on userspace, which improves
their understanding of the driver which leads to better future patches.

To enable putting the full cost of integration on developers, validation
needs to become 100% automated, have 100% code/HW coverage of the
userspace usecases, and provide timely validation results to even the
most time-pressured developers. To reach these really ambitious
objectives, driver developers and validation engineers need to be
considered as one team. The CI system developers need to provide a
system capable of reaching the objectives, and driver developers need to
develop a test suite capable of reaching the goal of having 100% code
coverage of the whole driver on the CI system provided to them.

Finally, this increase in understanding of how validation is done allows
developers to know if their patch series will be properly validated,
which reduces the risk of letting regressions land in Linux.

The devil however lies in the details, so in this talk, we will explain
how we are going from theory to practice, what is our current status and
what we are doing to get closer to our ambitious goal! We will describe
the current developer workflow and demonstrate how we empowered
developers by providing timely testing as a transparent service to
anyone sending patches to our mailing lists.

One thing missing from this wall of text is: we need to create an open source toolbox for testing. And if we could have an hackable open source infrastructure with a canonical deployment for the project that would run HW-independent tests, along with hosting all the results, we would already have a good start. Then we could have external CI farms providing the results for HW-dependent tests. More work needed there!

Making stable kernels more stable

Posted Oct 26, 2018 13:54 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

For what it's worth, I have contacted Laura and Greg about KVM, and both have been very helpful and appreciative. It turns out that Fedora QA is not only depending on KVM, they are also running it nested, and under CentOS KVM at that! We will definitely add some kind of testing to RHEL and CentOS so that we don't cause trouble for Fedora.

We might also look into adding a pointer to kvm-unit-tests.git as a submodule in the Linux tree, to ease testing on part of the stable tree maintainers. Thanks Jon for reporting this, it's very useful for maintainers that didn't make it to the summit.

Making stable kernels more stable

Posted Oct 31, 2018 16:58 UTC (Wed) by anton (subscriber, #25547) [Link]

I am wondering whether, for software like the Linux kernel which promises not to break the user experience, it is really necessary to run ancient kernels with only bug fixes applied. I.e., we should all be able to switch to a relatively recent kernel version, because it behaves the same way as the ancient one, and if it does not, the kernel maintainers will fix it. And indeed, upgrading the kernel while keeping the rest of the system the same has worked for me (I have not tried the 1993 Yggdrasil distribution with a present-day kernel, though:-).

Of course, if you have high stability requirements, you will not want to run a freshly released kernel (bugs happen), but after a few months, the kernel should be mature enough, and maintaining ancient kernels should be unnecessary. I.e., one (or, transitionally, two) stable branches should be enough.

There is other software (such as gcc) that does not make such promises, and for this software we actually need ancient versions.

Unfortunately, the mainstream distributions tend to treat all software alike: either no rolling release or all rolling release.