Kernel development
Brief items
Kernel release status
The current development kernel is 4.9-rc3, released on October 29. "It turns out that the bug that we thought was due to the new virtually mapped stacks during the rc2 release wasn't due to that at all, but a block request queuing race condition. So people who turned off the new feature weren't actually avoiding it at all." The new feature appears to be solid, but more testing is always welcome.
The October 30 known regressions list has 14 entries.
Stable updates have had a busy week. 4.8.5 and 4.4.28 were released on October 28, 4.8.6 and 4.4.29 were released on October 31, and 4.4.30 came out later the same day.
Gregg: DTrace for Linux 2016
Brendan Gregg celebrates the capabilities of Linux kernel tracing with BPF. "With the final major capability for BPF tracing (timed sampling) merging in Linux 4.9-rc1, the Linux kernel now has raw capabilities similar to those provided by DTrace, the advanced tracer from Solaris. As a long time DTrace user and expert, this is an exciting milestone! On Linux, you can now analyze the performance of applications and the kernel using production-safe low-overhead custom tracing, with latency histograms, frequency counts, and more."
What comes after ‘iptables’? Its successor, of course: `nftables` (RH blog)
The Red Hat Developers Blog is running an introduction to the nftables packet filtering system. "nftables implements a set of instructions, called expressions, which can exchange data by storing or loading it in a number of registers. In other words, the nftables core can be seen as a virtual machine. Applications like the nftables front end-tool nft can use the expressions offered by the kernel to mimic the old iptables matches while gaining more flexibility."
Formatted kernel documentation at kernel.org
For the last couple of release cycles, the kernel's ongoing transition to the Sphinx documentation system has left kernel.org behind. Thanks to some work by Konstantin Ryabitsev, that situation has now been remedied, and kernel.org has the formatted documentation generated from the current -rc kernel. The DocBook-generated documents remain available for as long as DocBook stays in use. (For those interested in the linux-next version of the documentation, the version on LWN's server is usually up to date; it currently has the changes that are queued for 4.10.)Preemption latency of real-time Linux systems (OSADL)
The Open Source Automation Development Lab site has an article describing the use of the cyclictest utility to track down latency problems in a realtime kernel. "When cyclictest is invoked it creates a defined number of real-time threads of given priority and affinity. Every thread starts a loop during which a timed alarm is installed and waited for. Whenever the alarm wakes up the thread, the difference between the expected and the effective time is calculated and entered into a histogram with a granularity of one microsecond."
Quote of the week
That's the number of mails related to this project sent to LKML, according to my archive. About 1/3 of those mails are the postings of the patchsets alone.I cannot tell how many offlist mails have been sent around in total on this matter, but at least in my personal mail are close to hundred.
Beers: Uncountable
This applies to both the number of beers consumed and the number of beers owed.
I'm pretty happy with the final outcome of these patches and I want to say thanks to everyone!
Kernel development news
The 2016 Kernel Summit
The 2016 Linux Kernel Summit was held October 31 to November 2 in Santa Fe, New Mexico, USA, alongside the Linux Plumbers Conference. As usual for recent years, the summit was broken into an invitation-only core day and an open "technical topics" day; the latter was planned alongside the Plumbers tracks.LWN was present for the core-day discussions; the topics discussed there were:
- Stable kernel workflow issues. What
are the problems with the community's stable kernel releases, and how
can things be made better?
- Group maintainership models: different
ways to share the work of subsystem maintenance across a group of
people.
- Development process issues: what is
Linus unhappy about? With a significant emphasis on bug tracking.
- The future of the Kernel Summit. The
development community has changed since the first Summit in 2001; now
the event itself will be changing too.
- Kernel hardening: we have actually
made some progress on increasing the kernel's ability to protect
itself in the last year, but there is a lot to be done still.
- The kernel thread freezer is said to
be out of control. What is the problem and how can it be fixed?
- Documentation; there is a big
transition underway with the kernel's documentation, and some
questions in need of answers.
- Tracepoint challenges: the number of tracepoints is growing rapidly; are we painting ourselves into an ABI corner? With a guest appearance by Batman.
Sessions from the technical day
There were relatively few sessions in this track; much of the interesting discussion moved to a wider forum in the Linux Plumbers Conference.
- Virtual-memory topics: a short but
intense discussion on how the virtual-memory subsystem must evolve to
function properly on current and upcoming systems.
- The perils of printk(); the kernel's message-printing function has a surprising number of problems to address.
Notes posted elsewhere
- Complex dependencies (Luis Rodriguez)
- Task isolation (Chris Metcalf)
- Audio (Liam Girdwood)
Group photo
See also: Len Brown's photos from the Kernel Summit.
[Thanks to LWN subscribers for supporting our travel to the event.]
A discussion on stable kernel workflow issues
The opening session at the 2016 Kernel Summit, led by Jiri Kosina, had to do with the process of creating stable kernel updates. There is, he said, a bit of a disconnect between what the various parties involved want, and that has led to trouble for the consumers of the stable kernel releases.Jiri's point of view was centered on his role as a distribution kernel maintainer. Consumers like him want a number of things from the stable kernel releases, including fixes for user-visible functional problems, fixes for bugs that crash the system, and fixes for severe performance regressions. What they do not want are new features or minor performance improvements; the latter have often been shown to regress performance for other workloads.
Perhaps the biggest thing in the "don't want" column, though, is something that has caused quite a bit of trouble in the past: fixes for bugs that are not actually present. There have been a number of cases of bogus "fixes" that have broken things, causing big headaches for distributors, who must spend a lot of time figuring out what has gone wrong. Just because a patch applies cleanly to an older kernel does not mean that it actually belongs there, but that distinction often seems to get lost.
Part of that, perhaps, is a result of what the producers of stable kernels want: a process that scales. Stable releases are done by a small group of developers; they don't have a lot of time to spend on each proposed fix. They want to include all of the fixes that make sense, but depend on others to tell them when fixes actually do make sense.
From his point of view as a distribution maintainer, Jiri said that one of
the biggest issues with the stable process is that it is not clear why
specific patches got into stable. The "CC: stable" tag applied to patches
is an opt-in mechanism, and the decision is often made by people who are
neither the author of the patch nor the maintainer of the relevant
subsystem. The review process is far too lenient; it is a passive approval
approach that lets stuff get in unless somebody goes out of their way to
block it. Maintainers often respond to proposed stable inclusions with the
equivalent of "oh yeah, whatever" without really thinking about the
problem. The review barrier is too low; somebody needs to be thinking
about the semantics of every stable patch.
James Bottomley noted that the typical maintainer response is to think that they already reviewed a specific patch when they first accepted it; they aren't sure what else they should do when a patch comes around in a stable release. Mark Brown added that maintainers often can't remember what a particular ancient kernel looked like, so it is hard for them to say whether a specific patch makes sense there or not.
Christoph Hellwig pointed out that not all stable kernels are equal. The current review process is reasonable for recent kernels, he said, but it does not work as well for the long-term stable releases. Stable kernel maintainer Greg Kroah-Hartman asked what "good" meant in general. There are always going to be bugs, what is the acceptable rate when five or six patches are being applied to a two-year-old kernel every day?
Jiri said that there is always room for improvement. One possible way to make things better might be to require a Fixes: tag for every patch going into a stable kernel release. That tag identifies the bug that a patch is meant to fix; without it, the stable maintainers don't know if a patch fixes a bug that is actually present in an older kernel. An alternative might be a new tag specifically related to inclusion into a stable kernel; it would mean "I have thought about this responsibly." That tag should specify the version(s) of the kernel it should be applied to. There could also be a tag for patches that should not be considered for stable releases.
Ben Herrenschmidt asked about whether the stable kernels should include patches to make new hardware work. Jiri responded that those patches aren't really wanted, but he didn't care too much if they don't break things. Greg added that addition of new device IDs and quirks is pretty common in the stable kernels. Laura Abbot noted that there is often a fine line between a "feature" and a "bug fix." The direct rendering subsystem, for example, often comes up with large changes to make specific hardware features perform well; without them, things run slowly. DRM subsystem bug-fix patches can also be quite large.
Rafael Wysocki said that, often, a maintainer knows that things broke, but is not sure of which change caused the problem. Jiri responded that, in that case, the fix should not go into the stable releases. Ben wondered how a problem could be fixed if the developer didn't know what caused the problem. Linus chimed in to note that requiring a Fixes: tag could lead to the addition of bogus tags when the maintainer doesn't know. Maintainers should not do that, he said, but Mark worried that it could happen as a "hoop-jumping" exercise.
Jiri said that, one way or another, somebody has to go through the exercise of thinking about which stable kernels a patch should be applied to. The result can take the form of either a Fixes: tag or a list of relevant versions, as long as that work has been done.
Dan Williams suggested that developers should add more test cases to demonstrate bugs and prove their fixes. These tests could be run against older kernels to show whether they need the fix — and whether the fix works as expected. Steve Rostedt said he will often add tests in response to bugs in his code; he suggested the addition of a new tag listing a test to run for a patch. But Linus was against the addition of new tags, especially for passing information to other humans. That sort of thing should just be explained in the commit message, he said. If we have too many tags, their usage will be inconsistent at best.
Linus went on to point out a specific case of things going wrong. The stable trees recently shipped a three-line patch that made a trivial root exploit possible on older CPUs. He had suggested removing the patch from the stable series, but the stable maintainers all chose to apply a 100-line fix instead. That was, he said, the wrong decision. If a stable tree takes a patch that breaks things, the response should not be to add more patches. Greg said that the stable maintainers would, as a general rule, rather have the same changes that the mainline has so that the bugs are all the same.
Ted Ts'o said that there are a lot of device vendors out there who are not actually using the stable releases; instead, they are cherry-picking patches that look relevant. If, in this case, they got the problematic three-line patch but missed the subsequent fix, the results would be bad. In general, he said, the group had not yet talked much about how the stable kernels are actually being used. Rafael said that broken patches should simply be reverted from the stable kernels rather than fixed further; that way, it will be clear to others where the broken patch was. But Ben protested that patches often come in a series; reverting just one can break things further.
Andy Lutomirski said that the fixup patch in this case was partly his work, even though it didn't have his name on it when it was applied. That fix should never have been applied to the stable kernels, he said; stable maintainers, in general, should never apply complex changes without asking first.
James said that the failure this time around was in the review process, but Linus replied that the patch was, in fact, "fine and correct." The problem is that it depended on another change that went into the 4.6 kernel. He doesn't blame that specific patch for the trouble; the bug was not obvious, and the backport was clean. It would have taken a "superhuman" to realize that the patch would be problematic in 4.1 without the exception-handling change that was applied in 4.6. The failure was when the patch was revealed to have broken things; at that point, it should have been reverted immediately. Mistakes will happen and the stable kernels will not be perfect; but, he said, those kernels are too eager to accept patches, and to accept more rather than reverting a patch that went wrong.
As the discussion wound down, perhaps the one solid conclusion was that problematic patches should indeed be reverted rather than fixed; that policy was immediately applied in subsequent stable kernel releases (example). There will likely be more pressure for each patch destined for the stable releases to carry a Fixes: tag or some version information making it clear that the relevant bug is present. And, hopefully, stable releases will be a little more stable in the future.
Group maintainership models
Traditionally, kernel subsystem maintainership is a solitary job, but there has been a steady increase in the number of subsystems that are using some sort of group model instead. At the 2016 Kernel Summit, Darren Hart and Daniel Vetter talked about how these models work in practice and what their experiences might have to offer other subsystem maintainers.
Darren started by noting that there are a number of motivations behind
group maintainership, starting with the fact that the work, for a busy
subsystem, can often be more than one person can handle. Some sort of load
balancing can help to keep maintainers from burning out. Group models are
also more robust in the face of vacations, illness, or simply a day job
that gets busy. Dan Williams added that group maintainership can also be a
good way to develop new maintainers for the future.
There are, Darren said, two models of group maintainership seen in the kernel community. One of them is the "hands off" model, as exemplified by the arm-soc tree maintained by Arnd Bergmann and Olof Johansson. They manage a single repository, using an IRC channel to take a "lock" when they are ready to apply some changes. They maintain a log file, Olof said, so that they can always see what the other has done.
The other model is "delegation," usually seen in subsystems that use the patchwork patch-management subsystem. Patchwork can delegate the handling of each patch to a specific maintainer; Darren would like to start making more use of it. Mauro Carvalho Chehab said that this is the approach used in the media subsystem; there are two maintainers, and patches are automatically delegated by patchwork. Rafael Wysocki added that the power-management subsystem also uses it; in this case, the power-management mailing list is shared between multiple subsystems, so the automatic delegation in patchwork helps to sort out changes as they arrive.
Daniel Vetter talked a bit about the multiple-committer model used in the i915 graphics driver subsystem; it was a shortened version of the talk that was covered in this article in October. He had been working in a two-person team (with Jani Nikula) for three years, but wanted more help. He had plenty of reviewers, but couldn't find anybody else willing to be named as a co-maintainer. Patch submitters wanted to deal with the maintainer rather than with other reviewers, so he and Jani were becoming a bottleneck in the process.
In response, they decided to try out a group model where many committers
have the ability
to commit changes to the repository. It is generally working well, though
there has been "some fallout." The way that the tree is managed, with
fixes being cherry-picked into another tree, creates trouble with
linux-next; they have some ideas for how to improve that interaction.
Developers are also occasionally confused when a seemingly
random person accepts their patches.
James Bottomley asked what the essential difference is between a committer and a maintainer in this model; Daniel answered that committers work internally, while the maintainer deals with the rest of the world. Committers in general don't want a lot of external visibility — they don't want to be listed in the MAINTAINERS file — so the solution is to call them something other than "maintainers." Ben Herrenschmidt observed that the maintainer's real job, in this model, is to accept the blame when things go wrong.
Olof asked if Daniel had observed problems with developers shopping patches around trying to find an accommodating committer. Daniel responded that, in general, he trusts his committers to say "no." There had been a couple of cases involving managers who have tried to get patches merged that way; it seems to happen once with every new manager. His response is to set up a meeting with that manager and explain how things need to be done. When asked if arm-soc had that problem, Olof responded that their model, where they deal with submaintainers rather than taking patches from developers directly, tends to keep that from happening.
The final part of the discussion centered on the workflow issues in the i915 subsystem that can cause Git to send patches multiple times — the core of the difficulty with linux-next. Daniel said that the tooling is not up to the job, but Linus responded that the workflow the group was using sounded "really nasty." What i915 is using, he said, is the submaintainer model; he should be taking pull requests from those maintainers rather than sharing a repository with them. Daniel said he is not against the submaintainer model, but it would create some coordination issues in this case; the nature of that driver (and DRM drivers in general) has a lot of developers working on the same files simultaneously.
Linus insisted, though, that, with the right habits, the submaintainer model works. Maintainers should make use of topic branches and avoid back merges with upstream trees. Daniel agreed that the i915 model would not work well for proper subsystems, but for a "leaf" like the i915 driver, it works well. The session wound down at that point.
Development process issues
The Kernel Summit traditionally includes a session where Linus Torvalds and the assembled developers discuss how the development process is working and whether there are any issues in need of resolution; the 2016 event was no exception. The picture that emerged is one of a process that is working reasonably well and developers who are mostly content. There are always things that can be better, though, especially when it comes to bug tracking.Linus started off by saying that there are no serious problems that he can see. There are specific subsystems that are occasionally problematic; he shouts at them, and they generally do better. In recent times, he hasn't even had to shout all that often.
One thing that does bother him is developers who send him fixes in the -rc2
or -rc3 time frame for things that never worked in the first place. If
something never worked, then the fact that it doesn't work now is not a
regression, so the fixes should just wait for the next merge window. Those
fixes are, after all, essentially development work. When they arrive he
usually accepts them, but it's annoying and adds noise to the process.
They add to his "are we getting ready for a release?" stress. In
general, he said, if a fix applies to a feature that is not currently being
used, it should wait for the next development cycle.
Overall, though, he said, things are working smoothly. The largest kernel release ever (measured by the number of commits) is currently in progress. This cycle has been a little painful, but size has nothing to do with it; instead, he ran into a ten-year-old bug that took a lot of work to track down. It is one of those things that happens occasionally, rather than any sort of process issue.
He suggested that one reason for the size of the 4.9 development cycle is the pre-announcement by Greg Kroah-Hartman that it would be a long-term-support release.
When asked if he should be more vocal about the above-mentioned mid-cycle fixes, Linus added that they are not really a huge issue for him. Additionally, some fixes are fine, especially for really new code. For certain areas, such as cellphones that have a short shelf life, it makes sense to push (and fix) drivers aggressively. Laptop support was also mentioned; he would like non-technical users to be able to install Linux on their new laptop that they just bought, so those kind of fixes are welcome almost anytime. But that is not what he's seeing; instead, he sees fixes for enterprise features that, due to the conservative nature of that sector, are not likely to be used for some time yet.
Bugged by Bugzilla
The bulk of the discussion, though, related to the kernel Bugzilla instance hosted on kernel.org. Laura Abbott said that bug reporting is a problem, in that users who report bugs never quite know what kind of response they are going to get. Some subsystem maintainers watch the Bugzilla and respond to issues there; others want nothing to do with it. As a result, users often have a poor experience, and are often subjected to shot-in-the-dark attempts to track down bugs from people who are not closely related to the subsystem in question.
James Bottomley said that the root of the problem is that the Bugzilla has no integration with email, which is the primary means by which kernel developers communicate. Perhaps it's time to look into using a different tool? Al Viro added that he couldn't imagine being paid enough to deal with Bugzilla. Linus said that there are groups that are more accepting and make use of it; they tend to be happy with it. But the rest of the community tends to hate it.
Darren Hart said that the Bugzilla is a good bridge between developers and users. That said, there is not much that he (as the x86 platform maintainer) can do with most of the bugs filed there. He only has five laptops, so he will be unable to reproduce most of the problems that have been reported. Some of the bug reporters can be convinced to move to email, but not all of them will do that. If the Bugzilla goes away, he said, we will lose some useful information.
There was some talk of modifying the Bugzilla to refer users to the appropriate mailing list for their problem. But kernel.org administrator Konstantin Ryabitsev said that he wants to avoid local modifications to Bugzilla if at all possible. Tweaks make the upgrade path much more complicated; he would rather remain "as vanilla as possible." He suggested, instead, that somebody should be hired to do bug triage.
Those upgrades, James said, are contributing to the current problems; the Bugzilla used to have better email integration, but that was lost in an upgrade. Konstantin said there was little alternative; kernel.org lacks the staff to backport security patches into a custom Bugzilla deployment.
Len Brown said that he likes the Bugzilla system well enough. It is not the best tool, but it can be made to work. Staying on top of things helps a lot; the power-management developers have managed to close over 3,000 bugs in recent years. The most important bugs, he said, are the most recent ones. If a developer responds immediately to a bug, there is a good chance of getting useful information back. A month later, it probably isn't going to work. Old bugs should be scrubbed aggressively; if a reporter doesn't reply to requests for information, the bug should just be closed. That keeps the list short and manageable.
The Bugzilla is a good place to collect ancillary information (such as screen shots) from bug reporters. Some developers said that email works well for that, but Linus said he would much rather deal with Bugzilla than try to fish through email archives for information. Len also said that it's generally not a good idea to put the best developers on Bugzilla duty. Instead, put a new developer there; it's a good opportunity for them to learn, and for everybody else to see how they work.
Linus said that he gets emails from the Bugzilla, and that he thinks it works OK much of the time. Many bug reports start in distribution bug trackers, though. So one ends up hopping through various links in different trackers; it is a painful process and he hates it. In the end, he would rather not see the Bugzilla be the primary bug tracking system for the kernel.
James asked if the kernel needed a dedicated person for bug management. Len replied that it would take somebody who is really good; such a person could also be a subsystem maintainer. What's really needed, he said, is a community to take on this task. Ted Ts'o said that a "bug ombudsperson team" would be a good thing to have, but it would need to be prepared to grow; as they get better at dealing with bugs, usage of the bug tracker will go up. He expressed doubts that this kind of work could be funded, it is hard to put together a business case for it. Konstantin suggested that perhaps Core Infrastructure Initiative (CII) funds could be found for kernel-bug management.
Len said that it would be nice to augment the Bugzilla to obtain other relevant information, such as the kernel version the user is running. Ben Hutchings noted that Debian's reportbug tool can run package-specific scripts to obtain the needed information. Laura added that Fedora's automatic bug reporting has useful information, but is very noisy. The best reports, she said, come directly from users who took the time to prepare them.
There was talk of eliminating the kernel's Bugzilla entirely. One approach would be to direct users to their distribution's trackers; that would not be helpful for users running mainline kernels, though. An alternative would be to replace it with a set of subsystem-specific trackers for the subsystems that are interested. The problem there, though, is that the relevant subsystem often changes as a bug is understood; moving an entry between separate trackers would be painful.
Konstantin said that kernel.org will soon be upgrading to Bugzilla 5; it is fairly different, he said, with a nicer user interface. He suggested doing that upgrade first, then perhaps seeking an intern from CII. Then, at least, we would get to a point where users who file bugs will see a response. And that was more or less how the session closed; the next step will be to see how the upgrade of the kernel's Bugzilla goes.
The future of the Kernel Summit
The first Kernel Summit was held in 2001. A lot has changed since then. In particular, said Summit organizer Ted Ts'o, the kernel development community has grown considerably over that time. Those changes are now leading to changes in how the Summit itself is run.The growth in the kernel and its community, Ted said, means that it has become nearly impossible to discuss technical issues at the Summit. There is just no way to be sure that all of the right people are in the room. Meanwhile, Linus has been increasingly interested in the process-oriented discussions that have tended to dominate recent gatherings. But the Summit, as it is currently organized, isn't necessarily the best group for those discussions either.
So the 2017 Kernel Summit will be different. It will be held in Prague, co-located with the Open Source Summit Europe (formerly LinuxCon Europe). It will be a short half-day event, with far fewer people present. In particular, the attendees are likely to be approximately thirty top-level subsystem maintainers, chosen directly by Linus. They will generally be the maintainers he pulls directly from. And, naturally, this event will focus mostly on process-oriented issues.
Ted said that he still thinks it is important to have a broader gathering
of kernel developers, though. Often the hallway discussions that result
from simply having developers in the same place are the most important part
of the event. He also said that, over the years, managers have been
trained to think that it is important to send developers to the Kernel
Summit, and that training should not go to waste. So there will be an open
technical track in Prague as well; it will consist of some presentations,
but also more discussion-oriented topics.
The end of the Kernel Summit as it has been run for so many years did not appear to bother many people, but there was some grumbling about one aspect of this plan: the co-location with the Open Source Summit. For many developers, the technical content of that event has fallen to the point where they are not really interested in attending. They would rather see the new kernel event attached to a more technical gathering, such as the Linux Plumbers Conference (as was done this year).
Ted responded that, for 2017, things are already locked in place. The lead time for event planning has gotten longer, so it's too late to change things for next year. The Linux Plumbers Conference will be in Los Angeles, co-located with the North American Open Source Summit, and that cannot be changed at this time. For the longer term, there has been a fair amount of discussion about joining the Kernel Summit with either Plumbers or the Linux Storage, Filesystem, and Memory-Management Summit. But that is for later; 2017 will be a transitional year. As 2016 was, in the end; the morning of the 2016 core day was organized much like future events might be.
Rik van Riel noted that co-location with other conferences helps to bring people into the kernel community. James Morris said that the Linux Security Summit has grown over the years, to the point that it now has over 120 attendees. This event wants to co-locate with Plumbers next year rather than with the Kernel Summit, in the theory that it will get a mix of attendees better suited to the security problem.
James Bottomley described the basic conflicts that arise when one tries to hold conferences together. If too many events are held in parallel, attendees have to choose between the sessions they are most interested in. If they are held serially, though, the resulting event becomes too long; few people are willing to dedicate more than a week to a set of conferences. Mark Brown, somewhat cynically, noted that there is an advantage to co-location with the Open Source Summit: attendees don't care if they miss it. The Open Source Summit is easy to attend even at the last minute; co-location with events like Plumbers, which routinely sells out quickly, makes it hard for last-minute attendees to come.
The session was more of an information-sharing exercise than one where decisions would be made. Kernel-oriented events are in a period of change; how that will play out will have to be seen over the next year or two.
The status of kernel hardening
At the 2015 Kernel Summit, Kees Cook said, he talked mostly about the things that the community could be doing to improve the security of the kernel. In 2016, instead, he was there to talk about what had actually been done. Kernel hardening, he reminded the group, is not about access control or fixing bugs. Instead, it is about the kernel protecting itself, eliminating classes of exploits, and reducing its attack surface. There is still a lot to be done in this area, but the picture is better than it was one year ago.One area of progress is in the integration of GCC plugins into the build system. The plugins in the kernel now are mostly examples, but there will be more interesting ones coming in the future. Plugins are currently supported for the x86, arm, and arm64 architectures; he would like to see that list grow, but he needs help from the architecture maintainers to validate the changes. Plugins are also not yet used for routine kernel compile testing, since it is hard to get the relevant sites to install the needed dependencies.
Linus asked how much plugins would slow the kernel build process; linux-next maintainer Stephen Rothwell also expressed interest in that question, noting that "some of us do compiles all day." Kees responded that there hadn't been a lot of benchmarking done, but that the cost was "not negligible." It is, though, an important part of protecting the kernel.
Probabilistic protections
The kernel has adopted a number of probabilistic protections over the last year. These protections only work if the attacker doesn't know something about the system. They include kernel address-space layout randomization (KASLR) and stack protection. Probabilistic protections can be defeated if the information leaks out, but they are still effective and worth doing.
One improvement is in the randomization of the kernel text base; it was added to arm64 in the 4.6 release and MIPS in 4.7. But the text base is only the beginning, more memory areas need to be randomized. One possibility is to randomize the kernel's link order at boot time. That would be a lot of work, but it would mean that an attacker would need more than a single information leak to defeat the whole thing.
Linus said that randomization can be a pain for debugging; it is not fun to
track down a problem that only happens in one boot out of every 300 or so. Al
Viro worried that changing the link order would also change the order in
which the kernel's initialization calls are made, with unpredictable
effects. Kees responded that this particular change isn't coming anytime
soon. Andi Kleen suggested just doing the link randomization and dropping
KASLR altogether; the kernel's addresses tend to leak via all kinds of paths
anyway. Linus responded that, while the address leaks are being plugged
over time, KASLR does indeed work poorly against local
attackers, but it is more useful against remote attackers.
Kees went on to say that the kernel got KASLR for its memory areas in 4.8 for the x86_64 architecture.
Work is being done on free-list randomization, which makes the layout of the heap less predictable. Perhaps more controversial is struct layout randomization. That cannot be done in a general way without causing all kinds of problems, but there is one place where it is especially useful: structs consisting of only function pointers. Such structs are one of the most prized targets for attackers, and the kernel has a lot of them. A GCC plugin can be used to detect these structures and randomize their order. In general, the kernel shouldn't care about that ordering, and changing it should not have performance effects.
Linus was not entirely convinced; he said that most people are running distributor kernels, so the specific ordering used will always be available to an attacker. The value, Kees responded, is forcing attackers to identify specific kernel builds; that is "excruciating" for them. It greatly expands the number of settings their exploit has to work in.
Deterministic protections
While probabilistic protections only work if some key data remains secret, Kees said, deterministic protections work all the time. These include things like read-only memory; if memory is read-only, it is always protected from being changed. Bounds checking to head off overflows is another form of deterministic protection.
One useful protection is the CONFIG_DEBUG_RODATA configuration option which, Kees said, is badly named. It ensures that executable memory is not writable anywhere in the kernel; it should be mandatory on all systems that support it. It is turned on by default on the x86 architecture as of 4.6, and will be for arm64 as of 4.9.
Another important protection is protection of user space against access by
the processor when it is running in a privileged mode. By far the most
common way
to exploit the kernel, he said, is to get the kernel to execute code
that has been placed somewhere in user-space memory. If the kernel cannot
access that memory, such exploits will not work. Processor vendors have
worked to provide such protections using technologies like SMAP and SMEP
(on x86) and PAN (on ARM), but there is a problem: such protections are not
widely available yet. There are no Xeon processors with
SMEP SMAP on the
market; PAN was added to the ARMv8.1 specification, but no hardware is
shipping yet.
So, he said, the kernel needs emulation of those features instead; it is, he said, a fundamental need. Linus replied, though, that he hates the emulation patches with a passion. And, he said, it is not necessary, in that the kernel's support for SMEP protects systems that lack SMEP too. That is because it forces all kernel paths that access user-space memory to be verified, preventing accidental accesses. So, he said, the emulation does not buy much. Kees disagreed, saying that the emulation can protect systems that will not have hardware protection for a few years yet.
Work is being done on hardened usercopy, which performs sanity checking on operations that copy data to and from user space. The current patch set contains about 1/3 of the PaX USERCOPY protections, which is a start. Next steps include segregating the slab caches; objects that are exposed to user space should be stored apart from those that are purely internal to the kernel. The problem here is to find a clean way to deal with exceptions. An inode object, for example, should not be copyable to or from user space, but there can be reasons to copy the file name stored within that structure. The PaX code does such copies by way of the stack, which is generally seen as being the wrong approach; Kees said that a more maintainable API for exceptions is needed. Linus added that this kind of problem is exactly why he has never seriously considered merging the grsecurity patch set; it's full of "this kind of craziness."
Memory wiping is useful, in that it can block information leaks and some types of use-after-free exploits. The slab allocator can do poisoning of memory, but not zeroing, which would be nice to add. After Linus asked, Kees said that the advantage of zeroing is that the kernel often needs to allocate zeroed pages; if freed memory has already been zeroed, those allocations can be optimized. A problem with zeroing is that some objects are allocated and freed so often that the performance hit becomes prohibitive, so there needs to be a way to make exceptions. There is a GCC plugin out there to do stack clearing, which is worth looking at.
"Constification" — making unchanging data constant — can protect against some types of exploits. The lowest-hanging fruit here is structs full of function pointers; the "constify" GCC plugin tries to make those const by default. As of 4.6, the kernel can make data read-only after initialization, but that feature is not yet widely used in the kernel. There would be value in identifying "write-rarely" data that would be read-only most of the time, and only made writable during explicit updates.
Kees's final topic was reference-count hardening. If an attacker is able to force a reference count to overflow, a use-after-free exploit is usually not far away. Most of these attacks can be blocked if atomic variables can be kept from overflowing. The hardening patches out there will kill the responsible process when an overflow is detected, and the counter involved is permanently blocked at a high value. In this way, an exploit is downgraded to a denial-of-service situation.
Kees's slides are available for the curious.
The problematic kthread freezer
The kernel thread ("kthread") freezer, as its name would suggest, is charged with freezing kernel threads during a system hibernation cycle. At the 2016 Kernel Summit, Jiri Kosina took the stage (for the second time) to say that the usage of the kthread freezer is "out of control" and "broken everywhere." It is time, he said, to bring things under control, then get rid of the freezer altogether.The first problem, he said, is that the freezer's semantics are not well defined; nobody really knows what it means for a kthread to be frozen. Most of the current uses of the freezer are superfluous. In many cases, the purpose is to have filesystems be in a consistent state during hibernation; that can be better achieved with the filesystem freeze mechanism. It doesn't make sense to freeze I/O operations in general, since they are needed to write out the hibernation image. There is a lot of freezing in drivers too, a situation which, he said, makes no sense. There is a well-defined set of power-management callbacks in place to put drivers into a suspended state during hibernation.
The kernel, he said, is the victim of a massive copy-and-paste cargo cult. Uses of the kthread freezer are spreading like a disease, a situation that has to stop.
There are two especially pathological uses that he called out. One is try_to_freeze() calls for threads that have not been marked freezable in the first place; those calls will never have any effect. The other is try_to_freeze() calls after starting I/O, but without waiting for that I/O to complete.
The solution is to eliminate use of the kthread freezer wherever possible. It is not needed in threads that will not generate disk I/O. It is also not needed — indeed, its use is a bug — in I/O helper threads. The best solution would be to move the entire hibernation subsystem to use filesystem freezing instead, and simply get rid of the kthread freezer. It might be necessary to keep it around for NFS, he said, but there's not much else that should need it. But the first step is to stop its use from spreading.
Ben Herrenschmidt spent a while talking about the history of the freezer, which, he said, was invented as "a big, fat band-aid" without which the system could not suspend properly. Now, instead, we simply need to make our drivers cope properly with I/O during a suspend operation. As the session closed, Linus agreed that the best approach was to get rid of the kthread freezer altogether and to use filesystem freezing where it is really needed. So one should expect development to go in that direction.
Kernel documentation update
The kernel's documentation "subsystem" has undergone some changes over the past few releases as we reported on in late October. The author of that report and the kernel's documentation maintainer, Jonathan Corbet, gave a presentation at this year's Kernel Summit to describe some of those changes. He was joined by Mauro Carvalho Chehab, who has done much of the work (along with Daniel Vetter, Jani Nikula, and Markus Heiser) to make it all happen.
Corbet started by noting the 4.8-rc1 release
announcement, where Linus Torvalds highlighted that "over 20% of
the patch is documentation updates, due to conversion of the drm and
media documentation from docbook to the Sphinx doc format
". Those
changes were unusual in that documentation changes have never been anywhere
near that large in previous merge windows.
Corbet set out on the Sphinx transition with several goals. The first was to eliminate the hand-rolled DocBook-based toolchain that was being used to generate the documentation. At a Kernel Summit a few years ago, he asked kernel developers how many had gotten the toolchain set up and less than half indicated that they had. In the end, it is simply "the wrong way to go"; the kernel project should not be developing its own tools for creating documentation, it should use something that is developed and maintained by others.
Another goal was to have integrated documentation with nice output formatted in multiple ways. But, he wanted to be able to do that without a complicated markup language and a bunch of toolchain dependencies. Beyond that, he wanted to clean up the Documentation directory so that it "doesn't look like my daughter's bedroom", he said, complete with a photo of said messy bedroom. All of that will make it easier for developers to write documentation, which should, in theory, lead to better documentation.
So, starting in the 4.7 cycle, the documentation began being switched to use the Sphinx documentation generator, which uses the reStructuredText markup language. There are, he said, LWN articles about the history of the change and how it all works. In addition, the kernel documentation now has a Linux Kernel Documentation book that describes how to build (and write) documentation for the kernel.
Open questions
There are, of course, some open questions. The organization of the documentation tree leaves a lot to be desired. It used to have around 300 files in the top-level directory, but he has slowly been moving things around. One move that he has been nervous about is the SubmittingPatches file, which is being incorporated into the development-process book. Chehab has submitted a patch to move the file and leave a three-line file pointing to the new location in its place, but Corbet is worried about dangling references to the file. He asked if there were objections to making that move.
At that point, Torvalds said with a grin: "No one in this room has ever read anything in the Documentation directory." He said it was really up to the users of the kernel and its documentation to decide if the move made sense. There were murmurs of disagreement in the room and Darren Hart said: "I do use it, read it, and cite it by section." He said that he liked what had been done so far, especially that he could now cite sections by URL.
That last piece is thanks to kernel.org maintainer Konstantin Ryabitsev, who built the documentation from the tree and put it up at kernel.org, Corbet said. Furthermore, a look at mailing list postings shows that the documentation is cited rather frequently. "So they may not read it, but they tell everyone else to read it", he said. Based on the reaction in the room, it appears that no one is "too upset" with moving SubmittingPatches, so he will leave that patch in for 4.10. The goal is to eventually have a top-level Documentation directory that looks like all of the others in the kernel tree.
Olof Johansson asked about having stub files pointing to the proper place, as was done for SubmittingPatches. Corbet replied that he has done that for some of the more important files, but not for every one. David Howells also cautioned against moving memory-barriers.txt. Corbet said that when he had broached the subject with Paul McKenney, he was told to "keep away", so he plans to work with McKenney on that down the road.
For something perhaps a bit more controversial, Corbet noted that there is only one directory in the top-level kernel directory that is capitalized: Documentation. Since part of the reorganization will be adding more subdirectories, thus lengthening the path names for files of interest—in addition to an already-long top-level name—it has led some to ask that he consider renaming the directory to doc .
That immediately led to discussion of tab-key-based auto-completion, as well as bikeshedding over a new name. But, as was also pointed out, those names are often used in places (e.g. email) where auto-completion does not work. H. Peter Anvin noted that files like README are capitalized in part to help newbies, who will often be attracted to those files because their names stand out.
After some more discussion, it was suggested that Corbet call for a vote, which he did. Roughly half of those assembled voted against, while about the same voted for the change. That made it obvious there was "no clear consensus" on the question, so things would stay the way they are. Shuah Khan was glad to hear that; she voted against changing the name because of the large number of blog posts and other types of information that she and others have written that would suddenly become outdated.
Adding complications
Moving to another topic, Corbet said he had set out to get to something simpler and the community had accomplished that, but now things have started to get more complicated again. A change that was made for 4.9 meant that LaTeX is required to build the HTML version of the documentation. He will be pushing to get that particular problem fixed.
There are number of files that some want to pre-process to get them into the Sphinx format. There was a request to add a Sphinx directive that would run an arbitrary shell command as part of the documentation build process, but he rejected that particular mechanism. It has also been suggested that the MAINTAINERS file be processed into the Sphinx format.
Since the media subsystem is where some of the push for pre-processing is coming from, he asked Chehab to explain what he would like to do there. Chehab has a patch that takes the ABI files from the media subsystem to convert them into the Sphinx format, which allows creating documentation that is sorted and arranged in various ways. That is useful for distributions and others, he said; "it adds value" to the documentation.
For the MAINTAINERS file, it would be nice if interested users could find out where they can get the latest development tree for a subsystem, which could be added into the information already there. It makes it easier for users if it is part of the documentation and it "comes almost for free", he said.
Corbet said that a decision will need to be made about how much more complicated the documentation toolchain should be allowed to get. Nikula has suggested that any changes made for the kernel should be upstreamed into Sphinx. While that is a nice idea, Corbet said, some of the changes are pretty kernel-specific so it may be hard to convince the Sphinx developers to accept them.
Another area of disagreement is about what to do with old and obsolete documentation, much of which has not even seen typo fixes in the Git era. For example, some instructions from Larry McVoy in 1996 on how to manually bisect a problem in the 1.3 kernel seem like they are past their prime. We don't keep old code around, Corbet said, so we should do the same with documentation.
Torvalds wryly noted that pull requests that remove lines from the kernel get high priority. But he had a different complaint as well: "Can we get rid of PDF in the kernel source?" It is, he said, "binary crap" and those files are simply PDF versions of the SVG files sitting right next to them.
Chehab said that the media subsystem needed some PDF files for the DocBook version of its documentation. Those may not be needed for Sphinx, he said. There are roughly ten PDF files that showed up recently, Torvalds said. Those files are not editable and have no reason to be in the kernel source.
Image files are similar, Torvalds said after a question from Corbet. A binary file that no one can edit should not be in the tree, Torvalds said. He suggested putting them on a web site, but that there is a reason the kernel tree is called a source tree. It was agreed that solutions could be found to have images with the documentation without requiring binary images in the kernel tree.
Rafael Wysocki asked that Corbet consult with him before moving any files in the power management part of the documentation tree. That is standard operating procedure, Corbet said. He will let the appropriate maintainer know what he is planning to do and won't do it over the objection of that maintainer.
Another request came from Hannes Reinecke, who would like to see the return values of kernel functions get added into the documentation. Right now, some free-form text could be added to the kerneldoc comments associated with the function, Corbet said, and something more structured could be worked out later. But, in order to get a full list of the return values, the entire set of kernel functions needs to be annotated, David Woodhouse said, so that return values from functions that are called can be incorporated into the list. But it was suggested that even just annotating the leaf functions (those that call no others) would be a good place to start. At that point, things kind of wound down; Corbet and Chehab left the stage in favor of Batman.
Tracepoint challenges
The final core-day session at the 2016 Kernel Summit, run by Steve "Batman" Rostedt and Shuah Khan, concerned the use of tracepoints in the kernel. It started with a discussion of tracepoint performance issues, but quickly came around to the perennial area of concern about tracepoints: whether they form part of the kernel's user-space ABI or not.Steve started by noting that he is seeing an "explosion" in the number of tracepoints being added. The problem is that, while the cost of tracepoints has been made as low as possible, they are still not free. Each tracepoint hurts performance slightly. So it may eventually become necessary to limit the addition of tracepoints into the kernel.
David Howells noted that a number of maintainers have been seen to push
back on the addition of printk() calls to the kernel, saying that
tracepoints should be used instead. Steve responded that they should push
back on tracepoints too. Each tracepoint should have its own rationale
justifying its existence. Chris Mason suggested that the best way to cut
down on tracepoints is to require developers to document them.
Mel Gorman reminded the group that tracepoints can be inserted dynamically into a running kernel. Mark Brown said that dynamic tracepoints require more tooling; that may be fine for a server system, but is harder on a phone. But Steve said that no special tools are required to insert tracepoints; it can all be done with echo commands.
Shuah brought things around to the ABI issue by saying that tracepoints can be highly effective for debugging problems on deployed systems. But, she asked, if we add tracepoints, do we have to maintain them forever? Ted Ts'o noted that the current work with eBPF makes tracepoints far easier to use, a change with both good and bad aspects. On the good side, the kernel now has dynamic tracing capabilities approaching those of DTrace. On the other hand, that means that people are starting to use these capabilities, and system administrators are starting to depend on them. So the ABI issue is no longer theoretical.
Peter Zijlstra said that there are tracepoints in the scheduler now that he would like to remove, but fears he can't without breaking things. Linus, though, said that problematic tracepoints should simply be taken out, especially if they are hindering development. This should happen even if the removal would break the LatencyTOP tool. Greg Kroah-Hartman protested that, in the past, Linus had blocked a tracepoint change that broke the PowerTOP utility. Linus's answer is that the community was still figuring out how to work with tracepoints then, and that there was no actual need to break PowerTOP at that time.
But, he said, tracepoints are still a view into the kernel's internals. They have to be able to change over time. If the removal of a particular tracepoint proves to be painful for user space, that removal will have to be reconsidered, but only then. That, he said, has always been the ABI rule: we can change things, but, if the result is broken user space, we'll change it back. Additionally, he said, LatencyTOP users tend to be people who compile their kernels anyway, while PowerTOP users are not. So LatencyTOP users can better adjust to a tracepoint change.
And, in the end, Linus said, if a tracepoint becomes so useful that it becomes part of the ABI, there is probably a good reason for it and it likely should be kept. But the way to find out is to change things and see who screams.
Ted suggested that now would be a good time to look at Brendan Gregg's perf-tools set to see which tracepoints it depends on. If those tracepoints need adjustment to be supportable in the long run, now is the time to make those changes before the usage of those tools increases further.
Some maintainers may feel better now about allowing tracepoints in the code they are responsible for, but others have not changed their view. Al Viro made it clear that his policy would not be changing, and that he would not be allowing any tracepoints in the virtual filesystem layer. He is worried about how some developers may use those tracepoints, and does not want to see a day in the future where systems are unable to boot with newer kernels as the result of tracepoint changes.
The session concluded with Linus saying that, in the history of kernel development, nobody has ever screamed about a change to a tracepoint. He allowed that this might happen as the use of tracepoints increases. But, he said, there is no point in making a big deal about that possibility before it proves to be a problem.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>

![[Group photo]](https://static.lwn.net/images/conf/2016/ks/ksgroup-sm.jpg)