A discussion on stable kernel workflow issues

By Jonathan Corbet
November 1, 2016

2016 Kernel Summit

The opening session at the 2016 Kernel Summit, led by Jiri Kosina, had to do with the process of creating stable kernel updates. There is, he said, a bit of a disconnect between what the various parties involved want, and that has led to trouble for the consumers of the stable kernel releases.

Jiri's point of view was centered on his role as a distribution kernel maintainer. Consumers like him want a number of things from the stable kernel releases, including fixes for user-visible functional problems, fixes for bugs that crash the system, and fixes for severe performance regressions. What they do not want are new features or minor performance improvements; the latter have often been shown to regress performance for other workloads.

Perhaps the biggest thing in the "don't want" column, though, is something that has caused quite a bit of trouble in the past: fixes for bugs that are not actually present. There have been a number of cases of bogus "fixes" that have broken things, causing big headaches for distributors, who must spend a lot of time figuring out what has gone wrong. Just because a patch applies cleanly to an older kernel does not mean that it actually belongs there, but that distinction often seems to get lost.

Part of that, perhaps, is a result of what the producers of stable kernels want: a process that scales. Stable releases are done by a small group of developers; they don't have a lot of time to spend on each proposed fix. They want to include all of the fixes that make sense, but depend on others to tell them when fixes actually do make sense.

From his point of view as a distribution maintainer, Jiri said that one of the biggest issues with the stable process is that it is not clear why specific patches got into stable. The "CC: stable" tag applied to patches is an opt-in mechanism, and the decision is often made by people who are neither the author of the patch nor the maintainer of the relevant subsystem. The review process is far too lenient; it is a passive approval approach that lets stuff get in unless somebody goes out of their way to block it. Maintainers often respond to proposed stable inclusions with the equivalent of "oh yeah, whatever" without really thinking about the problem. The review barrier is too low; somebody needs to be thinking about the semantics of every stable patch.

James Bottomley noted that the typical maintainer response is to think that they already reviewed a specific patch when they first accepted it; they aren't sure what else they should do when a patch comes around in a stable release. Mark Brown added that maintainers often can't remember what a particular ancient kernel looked like, so it is hard for them to say whether a specific patch makes sense there or not.

Christoph Hellwig pointed out that not all stable kernels are equal. The current review process is reasonable for recent kernels, he said, but it does not work as well for the long-term stable releases. Stable kernel maintainer Greg Kroah-Hartman asked what "good" meant in general. There are always going to be bugs, what is the acceptable rate when five or six patches are being applied to a two-year-old kernel every day?

Jiri said that there is always room for improvement. One possible way to make things better might be to require a Fixes: tag for every patch going into a stable kernel release. That tag identifies the bug that a patch is meant to fix; without it, the stable maintainers don't know if a patch fixes a bug that is actually present in an older kernel. An alternative might be a new tag specifically related to inclusion into a stable kernel; it would mean "I have thought about this responsibly." That tag should specify the version(s) of the kernel it should be applied to. There could also be a tag for patches that should not be considered for stable releases.

Ben Herrenschmidt asked about whether the stable kernels should include patches to make new hardware work. Jiri responded that those patches aren't really wanted, but he didn't care too much if they don't break things. Greg added that addition of new device IDs and quirks is pretty common in the stable kernels. Laura Abbot noted that there is often a fine line between a "feature" and a "bug fix." The direct rendering subsystem, for example, often comes up with large changes to make specific hardware features perform well; without them, things run slowly. DRM subsystem bug-fix patches can also be quite large.

Rafael Wysocki said that, often, a maintainer knows that things broke, but is not sure of which change caused the problem. Jiri responded that, in that case, the fix should not go into the stable releases. Ben wondered how a problem could be fixed if the developer didn't know what caused the problem. Linus chimed in to note that requiring a Fixes: tag could lead to the addition of bogus tags when the maintainer doesn't know. Maintainers should not do that, he said, but Mark worried that it could happen as a "hoop-jumping" exercise.

Jiri said that, one way or another, somebody has to go through the exercise of thinking about which stable kernels a patch should be applied to. The result can take the form of either a Fixes: tag or a list of relevant versions, as long as that work has been done.

Dan Williams suggested that developers should add more test cases to demonstrate bugs and prove their fixes. These tests could be run against older kernels to show whether they need the fix — and whether the fix works as expected. Steve Rostedt said he will often add tests in response to bugs in his code; he suggested the addition of a new tag listing a test to run for a patch. But Linus was against the addition of new tags, especially for passing information to other humans. That sort of thing should just be explained in the commit message, he said. If we have too many tags, their usage will be inconsistent at best.

Linus went on to point out a specific case of things going wrong. The stable trees recently shipped a three-line patch that made a trivial root exploit possible on older CPUs. He had suggested removing the patch from the stable series, but the stable maintainers all chose to apply a 100-line fix instead. That was, he said, the wrong decision. If a stable tree takes a patch that breaks things, the response should not be to add more patches. Greg said that the stable maintainers would, as a general rule, rather have the same changes that the mainline has so that the bugs are all the same.

Ted Ts'o said that there are a lot of device vendors out there who are not actually using the stable releases; instead, they are cherry-picking patches that look relevant. If, in this case, they got the problematic three-line patch but missed the subsequent fix, the results would be bad. In general, he said, the group had not yet talked much about how the stable kernels are actually being used. Rafael said that broken patches should simply be reverted from the stable kernels rather than fixed further; that way, it will be clear to others where the broken patch was. But Ben protested that patches often come in a series; reverting just one can break things further.

Andy Lutomirski said that the fixup patch in this case was partly his work, even though it didn't have his name on it when it was applied. That fix should never have been applied to the stable kernels, he said; stable maintainers, in general, should never apply complex changes without asking first.

James said that the failure this time around was in the review process, but Linus replied that the patch was, in fact, "fine and correct." The problem is that it depended on another change that went into the 4.6 kernel. He doesn't blame that specific patch for the trouble; the bug was not obvious, and the backport was clean. It would have taken a "superhuman" to realize that the patch would be problematic in 4.1 without the exception-handling change that was applied in 4.6. The failure was when the patch was revealed to have broken things; at that point, it should have been reverted immediately. Mistakes will happen and the stable kernels will not be perfect; but, he said, those kernels are too eager to accept patches, and to accept more rather than reverting a patch that went wrong.

As the discussion wound down, perhaps the one solid conclusion was that problematic patches should indeed be reverted rather than fixed; that policy was immediately applied in subsequent stable kernel releases (example). There will likely be more pressure for each patch destined for the stable releases to carry a Fixes: tag or some version information making it clear that the relevant bug is present. And, hopefully, stable releases will be a little more stable in the future.

Index entries for this article
Kernel	Development model/Stable tree
Conference	Kernel Summit/2016

A discussion on stable kernel workflow issues

Posted Nov 1, 2016 21:42 UTC (Tue) by spender (guest, #23067) [Link] (12 responses)

So now we have the answer to the "a bug" fixed in 4.4.30:

"The stable trees recently shipped a three-line patch that made a trivial root exploit possible on older CPUs."

This is ridiculous.

-Brad

A discussion on stable kernel workflow issues

Posted Nov 1, 2016 22:14 UTC (Tue) by alavaliant (subscriber, #104924) [Link]

I wonder what the 'older CPUs' types are? The number of machines that can be exploited would differ greatly depending on if the effected cpu classes are ones older than 2 years or ones older than 10 years.

A discussion on stable kernel workflow issues

Posted Nov 1, 2016 22:22 UTC (Tue) by bronson (subscriber, #4806) [Link] (1 responses)

And it's ridiculous that this is how the word gets out. I'm really struggling to understand the muddled thinking and communication that have gone into the last few days of stable kernel development.

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 13:24 UTC (Wed) by rich0 (guest, #55509) [Link]

I agree that there needs to be a better way to communicate things like this.

I was working on updating a system to 4.4.29, and then I heard from somebody on IRC I think that somebody had mentioned at the summit to not bother as it had some issue and 4.4.30 was going to be released shortly.

When the -announce list says "upgrade your systems" and the "wait, don't upgrade your systems" is part of the Q&A at some conference, there is a problem somewhere.

I fully get that sometimes there is going to be a vulnerability and you can't always wait to do updates. However, it seems like the last few weeks have had far more updates than can be explained by this. Preventing bugs from ever making their way into mainline is one challenge, but avoiding regressions in stable seems like another.

A discussion on stable kernel workflow issues

Posted Nov 1, 2016 22:22 UTC (Tue) by spender (guest, #23067) [Link] (8 responses)

And to spell it all out:

For the 4.4.x -stable series, the vulnerability affected 4.4.22 to 4.4.29 -- all 4.4 stable kernels from September 24th onward, including the ones everyone updated to to fix DirtyCOW.

Presumably the "older CPUs" refers to pre-Ivy Bridge which lacked SMEP, as the broken fixup results in a jump to a userland address. This post: http://www.mail-archive.com/linux-kernel@vger.kernel.org/... shows a reproducer and the OOPS illustrating the userland instruction pointer address. Grsecurity's i386 KERNEXEC and x64 UDEREF would both prevent this from being exploitable, mmap_min_addr would do nothing.

-Brad

A discussion on stable kernel workflow issues

Posted Nov 1, 2016 23:19 UTC (Tue) by spender (guest, #23067) [Link] (2 responses)

To be specific, 4.4.29 is not vulnerable. It includes the exception table handling patch. For whatever reason (I guess we would have to be present at the kernel summit to know) they decided to revert both patches and leave the infoleak there for 4.4.30 instead.

-Brad

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 0:05 UTC (Wed) by jake (editor, #205) [Link] (1 responses)

> For whatever reason (I guess we would have to be present at the kernel summit to know)
> they decided to revert both patches and leave the infoleak there for 4.4.30 instead.

I think the answer you seek is in the article you are commenting on, Brad.

jake

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 0:15 UTC (Wed) by spender (guest, #23067) [Link]

Unless the answer is dogmatic adherence to a new policy of reverting things that are fixed perfectly well, it's not.

-Brad

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 8:00 UTC (Wed) by citypw (guest, #82661) [Link] (3 responses)

The truth of another "silent fix" ( again-_-)! Awesome, it seems vanilla kernel doesn't care if GNU/Linux users( individual/enterprise/etc) have important digital asset on those 0ld good machines without PaX/Grsecurity's KERNEXEC/UDEREF or SMEP/SMAP. Lacking mitigation to prevent 0ld shitty ret2usr exploit is one thing, while "a trivial root exploit" is another. Trivial root exploit? I wouldn't say any privilege escalation exploit is trivial;-)

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 8:25 UTC (Wed) by intgr (subscriber, #39733) [Link] (2 responses)

> Trivial root exploit? I wouldn't say any privilege escalation exploit is trivial;-)

I assumed "trivial" means trivially easy to exploit... I doubt anyone meant to say that it's unimportant.

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 10:18 UTC (Wed) by citypw (guest, #82661) [Link]

oops, guess I misunderstand it...s0rry guys...

A discussion on stable kernel workflow issues

Posted Nov 15, 2016 13:46 UTC (Tue) by Wol (subscriber, #4433) [Link]

In maths, "trivial" means "easy to reproduce". Doesn't mean it's easy to work out or discover in the first place.

So something like RSA is trivial. BREAKING RSA is "trivial" (in the mathematical sense - it's a well understood problem with a well understood solution). But it's almost impossible to actually do.

Cheers,
Wol

A discussion on stable kernel workflow issues

Posted Nov 3, 2016 12:08 UTC (Thu) by spender (guest, #23067) [Link]

Since the above comment is now being linked to from oss-sec regarding a CVE request for the vuln (though oddly, for the infoleak itself and not the bad fix that results in privilege escalation) I want to add that I've verified that the KERNEXEC plugin also prevents exploitation.

I do like though that the infoleak will receive a CVE, since simply not fixing the infoleak (which could have been done very easily in combination with reverting the "problematic" commits) is unacceptable. We have a very elegant fix for it in grsecurity that doesn't rely on either commit.

Also, since this covering up of security vulnerabilities has really gotten out of hand, we have finally started listing on our website an added benefit for customers that was previously unpublicized: we routinely send out notices for silently-fixed vulnerabilities with detailed analysis and how grsecurity affects exploitation. Mails were sent out to customers for this vulnerability as well as DirtyCOW for instance, the day the commits were public. I will note the information we provided answered at least one of solar's "challenges" and stood in constrast to much of the optimistic misinformation that was being spread about the vulnerability.

Greg KH has publicly alluded that he wants to damage our ability to make a living from our work. Greg, I would be happy for you to make this item of our services irrelevant by committing in the future to stop lying to your users.

-Brad

Rant about unprofessional workflow

Posted Nov 2, 2016 10:58 UTC (Wed) by NAR (subscriber, #1313) [Link] (8 responses)

I worked on projects where the rule was that no commit should go into the main repository without a reference to the bug or task tracker. All commits should be either fixes for bugs (in this case the reference should point to the bug in question) or implementation of new feature (in this case the reference should point to the task tracker describing the new feature). Code cleanups, refactorings, etc. were considered "new feature" for this purpose. There was a field on the bug tracker ("introduced at") where the developer who fixed the bug had to state which version introduced the bug. This helped greatly the decision when we needed to backport a fix or not. Also the bug description obviously described if this is a severe bug (e.g. crash, security problem, etc.) or not.

Granted, this needs discipline. No cowboy-style commit-and-forget programming. For example if a developer finds a bug during code review or implementing some other feature, he had to create the bug report and go through the process - but in the long run it was really useful. I guess the kernel developers don't think that there's a bug or issue tracker that can cope with their workloads (hint: write one).

Rant about unprofessional workflow

Posted Nov 4, 2016 16:22 UTC (Fri) by imMute (guest, #96323) [Link] (7 responses)

>Granted, this needs discipline. No cowboy-style commit-and-forget programming. For example if a developer finds a bug during code review or implementing some other feature, he had to create the bug report and go through the process - but in the long run it was really useful.

And here's where people would just sit on uncommitted code because its too much work having to fill out a bug report on code that is still on a feature branch. If it's a bug found by the second set of eyes to see the code, then I don't think a bug report is worth it - it's just needless paperwork.

Rant about unprofessional workflow

Posted Nov 6, 2016 14:36 UTC (Sun) by NAR (subscriber, #1313) [Link] (6 responses)

It should not be much work to submit a bug report. Less than 5 minutes. And it usually helps to see the bug description in text, because the developer might realize what else is affected (or it's not a bug at all or some logs are missing, etc.). And it really helps a year later when somebody tries to find out why that particular change went into the codebase.

I don't really understand the comment about feature branch: if it's a bug in the new (not yet released) feature, then there should be already an entry in the issue tracker about that feature. If it's a bug in an old (already released) feature, then it should be present on the "main" branch too, so it warrants a new bug entry.

Rant about unprofessional workflow

Posted Nov 7, 2016 12:19 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (5 responses)

Every substantive change, sure, maybe a mention in an issue tracker is important, but for issues like malloc/delete mismatches, whitespace fixes, doc typo fixes, and the like, it is just extra work. That's why "every commit requires an issue" is block headed. Maybe every topic/merge request, but certainly not every commit.

Rant about unprofessional workflow

Posted Nov 7, 2016 14:34 UTC (Mon) by excors (subscriber, #95769) [Link] (1 responses)

If you require every commit to reference an issue, then for trivial commits it might take a couple of minutes to copy a few lines from the commit message into the issue tracker which nobody will ever look at again, so it's wasted time. (For multiple closely-related commits you could combine them into a single issue, like "Fix coding style in xyzzy driver", so it's really an overhead per group of related work, not per commit.)

But if you don't require every commit to reference an issue (and enforce that requirement), someone will occasionally make an important substantive commit without an issue, which will turn out to introduce some bugs or need backporting or need some documentation written about its rationale etc, and someone will have to spend hours searching through mailing lists and LWN articles trying to find all the possibly-years-old discussions about that commit (earlier versions, code reviews, related bug reports, etc). That's also wasted time, which could have been saved if the discussion about the commit had been tracked in a single place (even just as a collection of links to mailing list threads).

If you require only substantive commits to reference an issue, you'll waste time deciding and then arguing with people over whether a commit is really substantive (e.g. a malloc/free bug could be a serious security vulnerability so surely that counts?), and you'll still sometimes fail to track a commit that should have been tracked.

I think it's unclear which approach wastes the most time overall, but the most predictable is to require an issue for every commit - it's a mildly irritating overhead when writing patches, but it's easy to get into the habit of using the issue tracker for everything, and it can save some rare but substantial headaches in the future.

Rant about unprofessional workflow

Posted Nov 7, 2016 18:03 UTC (Mon) by bronson (subscriber, #4806) [Link]

Furthermore, if you're iterating so fast that creating a tracker entry is an unbearable overhead... I'm not sure your work belongs in the stable branch.

Rant about unprofessional workflow

Posted Nov 7, 2016 16:31 UTC (Mon) by NAR (subscriber, #1313) [Link] (2 responses)

Whitespace, typo fixes, merge errors - they don't warrant issue tracker entries. The malloc/delete mismatches could be real, user facing errors (e.g. destructor not called), these should be tracked.

Rant about unprofessional workflow

Posted Nov 7, 2016 17:27 UTC (Mon) by madscientist (subscriber, #16861) [Link] (1 responses)

Merge errors should be tagged with the tracker entry which introduced the merge error, so that an attempt to cherry-pick the broken commit has a chance of noticing the commit that fixes the error as well.

For whitespace cleanups and typos, you can have a catch-all tracker entries (as mentioned above) for "whitespace cleanup" and "comment typos" that is just used for all those types of issues. It incentivizes people to break these out into separate commits anyway, which is a good thing.

When I introduced a similar requirement locally (using Git hooks on the server to reject pushes where the comment didn't provide the proper information) I originally thought it would be good to have an "out" where you could use a special tag like "trivial" or something and avoid creating a tracker entry. Others in the team were against this and wanted to require a real tracker entry for every commit so that's what we did, and it has been fine.

Rant about unprofessional workflow

Posted Nov 10, 2016 18:02 UTC (Thu) by Wol (subscriber, #4433) [Link]

The trouble is, (a) managing the kernel is like herding cats, and (b) many of them are blind :-) Without a spec it's too easy to make a mistake, and nobody wants to specify or document what *should* happen, so all too often it doesn't ...

Cheers,
Wol

A discussion on stable kernel workflow issues

Posted Nov 2, 2016 11:16 UTC (Wed) by daenzer (subscriber, #7050) [Link] (3 responses)

My experience with the stable kernel workflow hasn't been bad overall, but there have been exceptions.

My worst experience so far was when I received "stable review patch" e-mails for backporting e64c952efb8e0c15ae82cec8e455ab4910690ef1 ("drm/radeon: disable runtime pm on PX laptops without dGPU power control") to 4.4 and 4.5. I followed up as soon as I could, reporting that a regression had been bisected to that commit, so it would be better to hold off on backporting it. Greg KH's response to that was "Why not just provide stable with the bugfix as well?", but there was no such fix yet at that time. The change was backported anyway, which resulted in at least one additional bug report (and probably more people unnecessarily running into the regression).

It left me wondering what the purpose of those "stable review patch" e-mails was.

A discussion on stable kernel workflow issues

Posted Nov 3, 2016 0:42 UTC (Thu) by syrjala (subscriber, #47399) [Link] (2 responses)

One thing I'd like to see is some kind of central database of commits that should *not* be backported to stable, even if they have the cc:stable tag. We've had a few cases in i915 land where the cc:stable was possibly applied a bit too optimistically, or perhaps the patch really seemed like stable material at the time but was later found to cause a regression. It's quite tedious having to NAK the same patch over and over again when various people proceed to backport it to whatever stable branch they maintain. One NAK should be enough to block it from all stable trees IMO.

A discussion on stable kernel workflow issues

Posted Nov 3, 2016 13:24 UTC (Thu) by jani (subscriber, #74547) [Link] (1 responses)

Indeed. I think I'm starting to lean towards preferring Fixes: tags over Cc: stable at the time of applying patches, and doing separate stable backport requests later, when we've had more exposure to the candidate fixes. The stable backport requests could be scripted using the Fixes: tags, checking that the later commits don't contain reverts of the "fix" or commits fixing the "fix".

A discussion on stable kernel workflow issues

Posted Nov 6, 2016 18:48 UTC (Sun) by blackwood (guest, #44174) [Link]

Greg apparently already reviews all patches with "Fixes:" tags and cherry-picks for stable what he deems appropriate ...