A discussion on stable kernel workflow issues
Jiri's point of view was centered on his role as a distribution kernel maintainer. Consumers like him want a number of things from the stable kernel releases, including fixes for user-visible functional problems, fixes for bugs that crash the system, and fixes for severe performance regressions. What they do not want are new features or minor performance improvements; the latter have often been shown to regress performance for other workloads.
Perhaps the biggest thing in the "don't want" column, though, is something that has caused quite a bit of trouble in the past: fixes for bugs that are not actually present. There have been a number of cases of bogus "fixes" that have broken things, causing big headaches for distributors, who must spend a lot of time figuring out what has gone wrong. Just because a patch applies cleanly to an older kernel does not mean that it actually belongs there, but that distinction often seems to get lost.
Part of that, perhaps, is a result of what the producers of stable kernels want: a process that scales. Stable releases are done by a small group of developers; they don't have a lot of time to spend on each proposed fix. They want to include all of the fixes that make sense, but depend on others to tell them when fixes actually do make sense.
From his point of view as a distribution maintainer, Jiri said that one of
the biggest issues with the stable process is that it is not clear why
specific patches got into stable. The "CC: stable" tag applied to patches
is an opt-in mechanism, and the decision is often made by people who are
neither the author of the patch nor the maintainer of the relevant
subsystem. The review process is far too lenient; it is a passive approval
approach that lets stuff get in unless somebody goes out of their way to
block it. Maintainers often respond to proposed stable inclusions with the
equivalent of "oh yeah, whatever" without really thinking about the
problem. The review barrier is too low; somebody needs to be thinking
about the semantics of every stable patch.
James Bottomley noted that the typical maintainer response is to think that they already reviewed a specific patch when they first accepted it; they aren't sure what else they should do when a patch comes around in a stable release. Mark Brown added that maintainers often can't remember what a particular ancient kernel looked like, so it is hard for them to say whether a specific patch makes sense there or not.
Christoph Hellwig pointed out that not all stable kernels are equal. The current review process is reasonable for recent kernels, he said, but it does not work as well for the long-term stable releases. Stable kernel maintainer Greg Kroah-Hartman asked what "good" meant in general. There are always going to be bugs, what is the acceptable rate when five or six patches are being applied to a two-year-old kernel every day?
Jiri said that there is always room for improvement. One possible way to make things better might be to require a Fixes: tag for every patch going into a stable kernel release. That tag identifies the bug that a patch is meant to fix; without it, the stable maintainers don't know if a patch fixes a bug that is actually present in an older kernel. An alternative might be a new tag specifically related to inclusion into a stable kernel; it would mean "I have thought about this responsibly." That tag should specify the version(s) of the kernel it should be applied to. There could also be a tag for patches that should not be considered for stable releases.
Ben Herrenschmidt asked about whether the stable kernels should include patches to make new hardware work. Jiri responded that those patches aren't really wanted, but he didn't care too much if they don't break things. Greg added that addition of new device IDs and quirks is pretty common in the stable kernels. Laura Abbot noted that there is often a fine line between a "feature" and a "bug fix." The direct rendering subsystem, for example, often comes up with large changes to make specific hardware features perform well; without them, things run slowly. DRM subsystem bug-fix patches can also be quite large.
Rafael Wysocki said that, often, a maintainer knows that things broke, but is not sure of which change caused the problem. Jiri responded that, in that case, the fix should not go into the stable releases. Ben wondered how a problem could be fixed if the developer didn't know what caused the problem. Linus chimed in to note that requiring a Fixes: tag could lead to the addition of bogus tags when the maintainer doesn't know. Maintainers should not do that, he said, but Mark worried that it could happen as a "hoop-jumping" exercise.
Jiri said that, one way or another, somebody has to go through the exercise of thinking about which stable kernels a patch should be applied to. The result can take the form of either a Fixes: tag or a list of relevant versions, as long as that work has been done.
Dan Williams suggested that developers should add more test cases to demonstrate bugs and prove their fixes. These tests could be run against older kernels to show whether they need the fix — and whether the fix works as expected. Steve Rostedt said he will often add tests in response to bugs in his code; he suggested the addition of a new tag listing a test to run for a patch. But Linus was against the addition of new tags, especially for passing information to other humans. That sort of thing should just be explained in the commit message, he said. If we have too many tags, their usage will be inconsistent at best.
Linus went on to point out a specific case of things going wrong. The stable trees recently shipped a three-line patch that made a trivial root exploit possible on older CPUs. He had suggested removing the patch from the stable series, but the stable maintainers all chose to apply a 100-line fix instead. That was, he said, the wrong decision. If a stable tree takes a patch that breaks things, the response should not be to add more patches. Greg said that the stable maintainers would, as a general rule, rather have the same changes that the mainline has so that the bugs are all the same.
Ted Ts'o said that there are a lot of device vendors out there who are not actually using the stable releases; instead, they are cherry-picking patches that look relevant. If, in this case, they got the problematic three-line patch but missed the subsequent fix, the results would be bad. In general, he said, the group had not yet talked much about how the stable kernels are actually being used. Rafael said that broken patches should simply be reverted from the stable kernels rather than fixed further; that way, it will be clear to others where the broken patch was. But Ben protested that patches often come in a series; reverting just one can break things further.
Andy Lutomirski said that the fixup patch in this case was partly his work, even though it didn't have his name on it when it was applied. That fix should never have been applied to the stable kernels, he said; stable maintainers, in general, should never apply complex changes without asking first.
James said that the failure this time around was in the review process, but Linus replied that the patch was, in fact, "fine and correct." The problem is that it depended on another change that went into the 4.6 kernel. He doesn't blame that specific patch for the trouble; the bug was not obvious, and the backport was clean. It would have taken a "superhuman" to realize that the patch would be problematic in 4.1 without the exception-handling change that was applied in 4.6. The failure was when the patch was revealed to have broken things; at that point, it should have been reverted immediately. Mistakes will happen and the stable kernels will not be perfect; but, he said, those kernels are too eager to accept patches, and to accept more rather than reverting a patch that went wrong.
As the discussion wound down, perhaps the one solid conclusion was that
problematic patches should indeed be reverted rather than fixed; that
policy was immediately applied in subsequent
stable kernel releases (example). There
will likely be more pressure for each
patch destined for the stable releases to carry a Fixes: tag or
some version information making it clear that the relevant bug is present.
And, hopefully, stable releases will be a little more stable in the
future.
Index entries for this article | |
---|---|
Kernel | Development model/Stable tree |
Conference | Kernel Summit/2016 |
Posted Nov 1, 2016 21:42 UTC (Tue)
by spender (guest, #23067)
[Link] (12 responses)
"The stable trees recently shipped a three-line patch that made a trivial root exploit possible on older CPUs."
This is ridiculous.
-Brad
Posted Nov 1, 2016 22:14 UTC (Tue)
by alavaliant (subscriber, #104924)
[Link]
Posted Nov 1, 2016 22:22 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (1 responses)
Posted Nov 2, 2016 13:24 UTC (Wed)
by rich0 (guest, #55509)
[Link]
I was working on updating a system to 4.4.29, and then I heard from somebody on IRC I think that somebody had mentioned at the summit to not bother as it had some issue and 4.4.30 was going to be released shortly.
When the -announce list says "upgrade your systems" and the "wait, don't upgrade your systems" is part of the Q&A at some conference, there is a problem somewhere.
I fully get that sometimes there is going to be a vulnerability and you can't always wait to do updates. However, it seems like the last few weeks have had far more updates than can be explained by this. Preventing bugs from ever making their way into mainline is one challenge, but avoiding regressions in stable seems like another.
Posted Nov 1, 2016 22:22 UTC (Tue)
by spender (guest, #23067)
[Link] (8 responses)
For the 4.4.x -stable series, the vulnerability affected 4.4.22 to 4.4.29 -- all 4.4 stable kernels from September 24th onward, including the ones everyone updated to to fix DirtyCOW.
Presumably the "older CPUs" refers to pre-Ivy Bridge which lacked SMEP, as the broken fixup results in a jump to a userland address. This post: http://www.mail-archive.com/linux-kernel@vger.kernel.org/... shows a reproducer and the OOPS illustrating the userland instruction pointer address. Grsecurity's i386 KERNEXEC and x64 UDEREF would both prevent this from being exploitable, mmap_min_addr would do nothing.
-Brad
Posted Nov 1, 2016 23:19 UTC (Tue)
by spender (guest, #23067)
[Link] (2 responses)
-Brad
Posted Nov 2, 2016 0:05 UTC (Wed)
by jake (editor, #205)
[Link] (1 responses)
I think the answer you seek is in the article you are commenting on, Brad.
jake
Posted Nov 2, 2016 0:15 UTC (Wed)
by spender (guest, #23067)
[Link]
-Brad
Posted Nov 2, 2016 8:00 UTC (Wed)
by citypw (guest, #82661)
[Link] (3 responses)
Posted Nov 2, 2016 8:25 UTC (Wed)
by intgr (subscriber, #39733)
[Link] (2 responses)
I assumed "trivial" means trivially easy to exploit... I doubt anyone meant to say that it's unimportant.
Posted Nov 2, 2016 10:18 UTC (Wed)
by citypw (guest, #82661)
[Link]
Posted Nov 15, 2016 13:46 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
So something like RSA is trivial. BREAKING RSA is "trivial" (in the mathematical sense - it's a well understood problem with a well understood solution). But it's almost impossible to actually do.
Cheers,
Posted Nov 3, 2016 12:08 UTC (Thu)
by spender (guest, #23067)
[Link]
I do like though that the infoleak will receive a CVE, since simply not fixing the infoleak (which could have been done very easily in combination with reverting the "problematic" commits) is unacceptable. We have a very elegant fix for it in grsecurity that doesn't rely on either commit.
Also, since this covering up of security vulnerabilities has really gotten out of hand, we have finally started listing on our website an added benefit for customers that was previously unpublicized: we routinely send out notices for silently-fixed vulnerabilities with detailed analysis and how grsecurity affects exploitation. Mails were sent out to customers for this vulnerability as well as DirtyCOW for instance, the day the commits were public. I will note the information we provided answered at least one of solar's "challenges" and stood in constrast to much of the optimistic misinformation that was being spread about the vulnerability.
Greg KH has publicly alluded that he wants to damage our ability to make a living from our work. Greg, I would be happy for you to make this item of our services irrelevant by committing in the future to stop lying to your users.
-Brad
Posted Nov 2, 2016 10:58 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (8 responses)
Granted, this needs discipline. No cowboy-style commit-and-forget programming. For example if a developer finds a bug during code review or implementing some other feature, he had to create the bug report and go through the process - but in the long run it was really useful. I guess the kernel developers don't think that there's a bug or issue tracker that can cope with their workloads (hint: write one).
Posted Nov 4, 2016 16:22 UTC (Fri)
by imMute (guest, #96323)
[Link] (7 responses)
And here's where people would just sit on uncommitted code because its too much work having to fill out a bug report on code that is still on a feature branch. If it's a bug found by the second set of eyes to see the code, then I don't think a bug report is worth it - it's just needless paperwork.
Posted Nov 6, 2016 14:36 UTC (Sun)
by NAR (subscriber, #1313)
[Link] (6 responses)
I don't really understand the comment about feature branch: if it's a bug in the new (not yet released) feature, then there should be already an entry in the issue tracker about that feature. If it's a bug in an old (already released) feature, then it should be present on the "main" branch too, so it warrants a new bug entry.
Posted Nov 7, 2016 12:19 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (5 responses)
Posted Nov 7, 2016 14:34 UTC (Mon)
by excors (subscriber, #95769)
[Link] (1 responses)
But if you don't require every commit to reference an issue (and enforce that requirement), someone will occasionally make an important substantive commit without an issue, which will turn out to introduce some bugs or need backporting or need some documentation written about its rationale etc, and someone will have to spend hours searching through mailing lists and LWN articles trying to find all the possibly-years-old discussions about that commit (earlier versions, code reviews, related bug reports, etc). That's also wasted time, which could have been saved if the discussion about the commit had been tracked in a single place (even just as a collection of links to mailing list threads).
If you require only substantive commits to reference an issue, you'll waste time deciding and then arguing with people over whether a commit is really substantive (e.g. a malloc/free bug could be a serious security vulnerability so surely that counts?), and you'll still sometimes fail to track a commit that should have been tracked.
I think it's unclear which approach wastes the most time overall, but the most predictable is to require an issue for every commit - it's a mildly irritating overhead when writing patches, but it's easy to get into the habit of using the issue tracker for everything, and it can save some rare but substantial headaches in the future.
Posted Nov 7, 2016 18:03 UTC (Mon)
by bronson (subscriber, #4806)
[Link]
Posted Nov 7, 2016 16:31 UTC (Mon)
by NAR (subscriber, #1313)
[Link] (2 responses)
Posted Nov 7, 2016 17:27 UTC (Mon)
by madscientist (subscriber, #16861)
[Link] (1 responses)
For whitespace cleanups and typos, you can have a catch-all tracker entries (as mentioned above) for "whitespace cleanup" and "comment typos" that is just used for all those types of issues. It incentivizes people to break these out into separate commits anyway, which is a good thing.
When I introduced a similar requirement locally (using Git hooks on the server to reject pushes where the comment didn't provide the proper information) I originally thought it would be good to have an "out" where you could use a special tag like "trivial" or something and avoid creating a tracker entry. Others in the team were against this and wanted to require a real tracker entry for every commit so that's what we did, and it has been fine.
Posted Nov 10, 2016 18:02 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Nov 2, 2016 11:16 UTC (Wed)
by daenzer (subscriber, #7050)
[Link] (3 responses)
My worst experience so far was when I received "stable review patch" e-mails for backporting e64c952efb8e0c15ae82cec8e455ab4910690ef1 ("drm/radeon: disable runtime pm on PX laptops without dGPU power control") to 4.4 and 4.5. I followed up as soon as I could, reporting that a regression had been bisected to that commit, so it would be better to hold off on backporting it. Greg KH's response to that was "Why not just provide stable with the bugfix as well?", but there was no such fix yet at that time. The change was backported anyway, which resulted in at least one additional bug report (and probably more people unnecessarily running into the regression).
It left me wondering what the purpose of those "stable review patch" e-mails was.
Posted Nov 3, 2016 0:42 UTC (Thu)
by syrjala (subscriber, #47399)
[Link] (2 responses)
Posted Nov 3, 2016 13:24 UTC (Thu)
by jani (subscriber, #74547)
[Link] (1 responses)
Posted Nov 6, 2016 18:48 UTC (Sun)
by blackwood (guest, #44174)
[Link]
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
> they decided to revert both patches and leave the infoleak there for 4.4.30 instead.
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
Wol
A discussion on stable kernel workflow issues
I worked on projects where the rule was that no commit should go into the main repository without a reference to the bug or task tracker. All commits should be either fixes for bugs (in this case the reference should point to the bug in question) or implementation of new feature (in this case the reference should point to the task tracker describing the new feature). Code cleanups, refactorings, etc. were considered "new feature" for this purpose. There was a field on the bug tracker ("introduced at") where the developer who fixed the bug had to state which version introduced the bug. This helped greatly the decision when we needed to backport a fix or not. Also the bug description obviously described if this is a severe bug (e.g. crash, security problem, etc.) or not.
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Rant about unprofessional workflow
Wol
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues
A discussion on stable kernel workflow issues