Kernel development
Brief items
Kernel release status
The current development kernel is 3.12-rc7, released on October 27. Linus says: "The KS week is over, and thus the seventh - and likely the last - rc for 3.12 is out, and I'm back on the normal Sunday schedule." He also warned that upcoming travel is likely to turn the 3.13 merge window into a relatively messy affair.
Stable updates: 3.2.52 was released on October 27.
Quotes of the week
- user space does something stupid and wrong (example: "nice -19 X"
to work around some scheduler oddities)
- user space does nothing at all, and the kernel people say "hey, user space _could_ set this value Xyz, so it's not our problem, and it's policy, so we shouldn't touch it".
I think we in the kernel should say "our defaults should be what everybody sane can use, and they should work fine on average". With "policy in user space" being for crazy people that do really odd things and can really spare the time to tune for their particular issue.
The Linux Foundation Technical Advisory Board election results
The 2013 elections for five members of the Linux Foundation's Technical Advisory Board were held on October 23. The election, which almost certainly drew more interest than in any preceding year, re-elected Greg Kroah-Hartman, Thomas Gleixner, and Jonathan Corbet to the board; they will be joined by new members Sarah Sharp and Matthew Garrett.
Kernel development news
The 2013 Kernel Summit
The 2013 Kernel Summit was held October 23-25 in Edinburgh, UK. Following the pattern set in recent years, the 2013 summit was divided into a minisummit day (the 23rd), followed by a core-kernel, invitation-only session; a plenary day finished things out on the 25th. What follows is LWN's coverage from the event, supplemented in places with information posted by others.
The minisummit day
Several minisummits were held on the first day. Naturally, your editor was only able to attend one of them, being:
- Power-aware scheduling: how to improve CPU power management and tie it more firmly into the CPU scheduler.
Other minisummits held that day include:
- Media; notes posted by Mauro Carvalho
Chehab.
- ARM architecture maintainers (Notes by Grant Likely).
The core day
The second day of the summit was attended by 70 or so invited developers. Topics covered this day include:
- The kernel/user-space boundary: where
do we draw the line between the kernel and user space, especially when
it comes to ABI guarantees?
- The Outreach Program for Women and the
kernel.
- Control groups are under heavy
development; this session covered where this subsystem is headed and
the numerous issues that still need to be worked out.
- The linux-next and -stable trees; two
sessions on our most important non-mainline trees.
- Testing: in particular, Trinity and
Fenngguang Wu's build-and-boot test robot.
- On saying "no": are we accepting too
much marginal code into the kernel?
- Bugzilla, lightning talks, and future summits. The status of the kernel's bug tracker, some random subjects, and discussion of the Kernel Summit itself.
The plenary day
A larger group met for the final day of the kernel summit. Among the topics discussed there were:
- Minisummit reports, with details from
the ARM minisummit in particular.
- Git tree maintenance: how the tip and
arm-soc trees are managed.
- Scalability techniques: four talks on
the scalability mechanisms available in the kernel.
- Device tree bindings and how to
resolve the current mess.
- Checkpoint/restart in user space; what
is the status of this functionality?
- A kernel.org update; current and
future changes to the community's development repository site.
- Security practices: what to do when a
security problem is found.
- Lightning talks; a number of brief presentations to finish out the day.
[Your editor would like to thank the Linux Foundation for assistance with his travel to Edinburgh to attend the Kernel Summit].
The kernel/user-space boundary
H. Peter Anvin and Miklos Szeredi kicked off the invitation-only day of the 2013 Kernel Summit with a question: where, exactly, does the boundary between the kernel and user space lie? And, in particular, when is it possible to make an incompatible change to the kernel ABI with the understanding that the actual, supported ABI is provided by user-space code that is closely tied to the kernel? The answer they got was clear but, perhaps, not exactly what they wanted.
Peter started by saying that we have a clear "don't break user space"
policy. For the most part, living up to that policy is relatively
straightforward; one avoids making incompatible changes to system calls and
things continue to work. We are also getting better with other kernel
interfaces like sysfs and /proc. But there was, he said, an
interesting corner case last year: the GRUB2 bootloader was making a habit
of looking at the kernel configuration files during installation for the
setup of its menus. The restructuring of some internal kernel code broke
GRUB2. At this point, Linus jumped
in to claim that the kernel's configuration files do not constitute
a part of the kernel's ABI. When somebody does something that stupid, he
said, one really cannot blame the kernel.
Peter moved on to another problem, one he himself introduced some sixteen years ago. The automounter ABI has issues such that it failed to work with a 32-bit user space on a 64-bit kernel. A new type of pipe had to be introduced to fix this problem; it was, he said, an example of how far we are willing to go to avoid breaking applications.
What about, he asked, cases where we need to shift to a new ABI altogether? Changes to the pseudo terminal (pty) interface are needed to get ptys to work from within containers; it's still not clear how to handle the master device in such situations. The control group interface is in flux, and there have been some disagreements with the systemd folks over who "owns" the container hierarchy as a whole. When it was suggested that systemd "wants to take over" control groups, Linus was quick to state that no such thing was going to happen. James Bottomley jumped in to note that the issue had been discussed and that a mutually acceptable solution was at hand.
Another ABI issue is the current limitation, built into the Linux virtual filesystem layer, that no single I/O operation can transfer more than 2GB of data. As systems and memory sizes get larger, that limit may eventually hurt, he said, but Linus said that this limit would not be lifted. We are, he said, better than OS X, which causes overly large I/O requests to fail outright; Linux, instead, just transfers the maximum allowed amount of data. There are huge security implications to allowing larger I/O operations, to the point that there is no excuse for removing the limit. A whole lot of potential problems will simply be avoided if filesystem code just never sees 64-bit I/O request sizes. And, he said, if you try to do a 4GB write, "you're a moron." Such requests will not be any faster, there is just no reason to do it.
In general, Linus said, he is fundamentally opposed to doing anything that might break user space; he was not sure why the topic was being discussed at all. The old issue of tracepoints came up, and Linus said that, if we break something with a tracepoint change, that is a problem and we will fix it. Greg Kroah-Hartman pointed out that some subsystem maintainers — himself and Al Viro, for example — are refusing to add tracepoints because they are afraid of being committed to supporting them forever. Others thought that this policy was excessively cautious, noting that actual problems with tracepoint ABI compatibility have been few and far between. No-tracepoints policies, Ingo Molnar said, are simply not justified.
What about changes to printk() calls that break scripts that grep through the system logs? Linus answered that printk() is not special, and that problems there will be fixed as well. Masami Hiramatsu suggested that the sort of string-oriented data found in the logs is relatively easy to work with, and changes are easy to adapt to, but that hints that, perhaps, users are just coping with problems there rather than complaining about them. It would be interesting to see what would happen if a user were to actually complain about broken scripts resulting from a printk() change. Linus closed things off by complaining that the kernel developers have spent far more time worrying about this problem than they ever have dealing with actual issues.
Miklos stepped up to ask more specifically: where is the boundary that sets
the kernel ABI? Some parts of the operating system live in the kernel,
while others can be found in libraries and the kernel. Sometimes things
move; rename() was once in the C library, now it's a system call
provided by the kernel. NFS servers have been found on both sides of the
divide, graphics drivers have been moving into the kernel, sound drivers
have moved in both directions, and filesystems can be found on both sides.
Miklos may have been hoping for some sort of discussion of whether the interface between the kernel and some of these low-level components could be considered to be internal and not part of the kernel's ABI, but things didn't go in that direction. Instead, the discussion wandered a bit, covering whether parts of NetworkManager should be in the kernel (no, they would just have to call out to user space for authentication and such), drivers that require closed user-space components (still considered unwelcome), and the implementation of protocols like MTP, which, evidently, has more stuff in user space than should really be there.
[Next: Outreach Program for Women]
The Outreach Program for Women
Sarah Sharp led a session to update the group on the status of the Outreach Program for Women (OPW) and the kernel. The kernel project has just completed its first round of participation in this program, which funds three-month internships for women looking to work with a free software project. OPW runs internships twice each year, in the June-to-September and December-to-March time frames. There is a month-long application process prior to the beginning of each period, during which prospective participants much contact a mentor, submit a contribution to the target project, and get it accepted.In the first round, the kernel applicants mostly sent patches for the staging tree; the result was 93 patches in the 3.12 kernel and a tutorial on how to participate in the program. It was, Sarah said, a successful beginning.
For those who want to help, the project is always in need of mentors for participants. Also needed are people who can hang out on the OPW IRC channel and answer basic questions as they come up, and "patient people" who can do pre-posting review of patches.
Dave Airlie asked how OPW projects were picked. The answer is that it starts with the mentors, who generally have specific areas in which they are comfortable helping new developers. After that, it's up to the applicants to suggest specific projects they would like to work on. In general, the most successful projects seem to be those that do not require a lot of subsystem-specific knowledge.
What is expected of mentors? Sarah said that the way she worked was to have weekly phone meetings with her intern, and that she spent three or four hours per week looking at patches, responding to questions, etc. All told, she said, it is a commitment of about eight hours per week.
Returning to how the kernel's first OPW experience went: for the most part,
the participants "did pretty well." A couple of them have not yet
completed their projects, but they are still working on them. At least six
of them are looking for work in the kernel area. It was "pretty
successful" overall. James Bottomley asked whether any of the interns are
thinking about turning around and taking a turn as mentors; Sarah answered
that some of them are helping out on the IRC channel, but none of them are
ready to be full mentors yet.
Linus raised a general concern he had which, he said, had little to do with OPW specifically; it's something he has seen in other groups. There were, he said, a lot of trivial one-line patches from the OPW participants; the same fix applied to ten files showed up as ten separate patches. It makes the numbers look good, he said, but is not necessarily helpful otherwise. He is worried about people gaming the system to look good by having a lot of commits; Arnd Bergmann agreed that splitting things into too many patches tends to impede review.
From there, the conversation became a bit more unfocused. Ted Ts'o suggested that mentoring could be a good recruiting tool for companies that are forever struggling to hire enough kernel developers. There were complaints that three months is too short a period for an intern to really dig deeply into the kernel; it is, it was suggested, driven a little too much by the American university schedule. Dave added that universities in other parts of the world will often place students into this kind of project for longer periods. Mauro Carvalho Chehab suggested that it would be good to place a greater emphasis on places like Africa and South America where we have few developers now; Sarah agreed that interns can come from anywhere, we simply are not advertising well enough in those areas.
Ted asked what the limiting factor for the program was; funding seems to be the biggest issue. That will be especially true during the next cycle, when the Linux Foundation, which funded several interns the first time around, will not be able to participate. There is a need for more employers to kick in to support interns; the cost for one participant in $5,750. Information on how to sponsor OPW interns has been posted on the KernelNewbies site.
[Next: Control Groups].
The evolution of control groups
The control group (cgroup) subsystem is currently under intensive development; some of that work will lead, eventually, to ABI changes that are visible from user space. Given the amount of controversy around this subsystem, it was not surprising to see control groups show up on the 2013 Kernel Summit agenda. Tejun Heo led a session to describe the consensus that had been reached on the cgroup interface, only to find that there are still a few loose ends to be tied down.
Tejun started by reminding the group that the multiple hierarchy
feature of cgroups, whereby processes can be placed in multiple, entirely
different
hierarchies, is going away. The unified hierarchy work is not entirely
usable yet, though, because it requires that all controllers be enabled for
the full hierarchy. Some controllers still are not hierarchical at all;
they are being fixed over time. The behavior of controllers is being made
more uniform as well.
One big change that has been decided upon recently is to make cgroup controllers work on a per-process basis; currently they apply per-thread instead. Among other things, that means that threads belonging to the same process can be placed in different control groups, leading to various headaches. Of all the controllers only the CPU controller has any business working with individual threads. For that case, some sort of special interface will be introduced that will, among other things, allow processes to set CPU policies for their own threads.
That interface, evidently, might be implemented with yet another special-purpose virtual filesystem. There was some concern about how the cgroup subsystem may be adding features that, essentially, constitute new system calls without review; there were also concerns about how the filesystem-based interface suffers from race conditions. Peter Zijlstra worried about how the new per-thread interface might look, saying that there were a lot of vague details that still need to be worked out. Linus wondered if it was really true that only the CPU controller needs to look at individual threads; some server users, he said have wanted per-thread control for other resources as well.
Linus also warned that it might not be possible to remove the old cgroup interface for at least ten years; as long as somebody is using it, it will need to be supported. Tejun seemed unworried about preserving the old interface for as long as it is needed. Part of Tejun's equanimity may come from a feeling that it will not actually be necessary to keep the old interface for that long; he said that even Google, which has complained about the unified hierarchy plans in the past, has admitted that it can probably make that move. So he doesn't see people needing the old interface for a long time.
In general, he said, the biggest use for multiple hierarchies has been to work around problems in non-hierarchical controllers; once those problems are fixed, there will be less need for that feature. But he still agrees that it will need to be maintained for some years, even though removal of multiple hierarchy support would simplify things a lot. Linus pointed out that, even if nobody is using multiple hierarchies currently, new kernels will still need to work on old distributions for a long time. Current users can be fixed, he said, but Fedora 16 cannot.
Hugh Dickins worried that, if the old interface is maintained, new users may emerge in the coming years. Should some sort of warning be added to tell those users to shift to the new ABI? James Bottomley said, to general agreement, that deprecation warnings just don't work; distributions just patch them out to avoid worrying their users. Tejun noted that new features will only be supported in the new ABI; that, hopefully, will provide sufficient incentive to use it. Hugh asked what would happen if somebody submitted a patch extending the old ABI; Tejun said that the bar for acceptance would be quite high in that case.
From the discussion, it was clear that numerous details are still in need of being worked out. Paul Turner said that there is a desire for a notification interface for cgroup hierarchy changes. That, he said, would allow a top-level controller to watch and, perhaps, intervene; he doesn't like that idea, since Google wants to be able to delegate subtrees to other processes. In general, there seems to be a lack of clarity about who will be in charge of the cgroup hierarchy as a whole; the systemd project has plans in that area, but that creates difficulties when, for example, a distribution is run from within a container. Evidently some sort of accord is in the works there, but there are other interesting questions, such as what happens when the new and old interfaces are used at the same time.
All told, there is a fair amount to be decided still. Meanwhile, Tejun said, the next concrete step is to fix the locking, which is currently too strongly tied to the internal locking of the virtual filesystem layer. After that is done, it should be possible to post a prototype showing how the new scheme will work. That posting may happen by the end of the year.
[Next: Linux-next and -stable].
The linux-next and -stable trees
While most kernel development focuses on the mainline, Linus's tree is not the only one out there. Two 2013 Kernel Summit sessions focused on a couple of the other important trees: linux-next and the stable tree; coverage of both those sessions has been combined into this article.
linux-next
Linux-next maintainer Stephen Rothwell attended the Kernel Summit as the
last stop in a long journey away from the office — and from maintenance of
the linux-next tree. That tree continued to function in his absence, the
first time it has ever done so. Stephen's session covered the current
state of this tree and how things could maybe be made to work a little
better.
He started by thanking Thierry Reding and Mark Brown for keeping linux-next going during his break. Linux-next is a full-time job for him, so it has been hard to hand off in the past, but the substitute maintainership appears to have worked this time.
Stephen routinely looks at the code that flows into the mainline during merge windows to see how much of it had previously appeared in linux-next. For recent kernels, that figure has been approaching 90%; it would probably be difficult, he said, to do better than that. What might be improved a bit, though, is how long that code is in linux-next. In general, only 72% of the code that appears in linux-next is there one week before the merge window opens; that figure drops to less than 60% two weeks prior. So he tends to be busy in the last couple weeks of the development cycle, dealing with lots of merge conflicts and build failures. It leads to some long days. It would, he allowed, be nice of more of that code got into linux-next a bit earlier.
There are, he said, 181 active trees merged into linux-next every day. Some of those are second-level trees that, eventually, reach mainline by way of another subsystem maintainer's tree.
The worst problem he encounters is whole-tree changes that affect files across the kernel. Those create a lot of little messes that he must then try to clean up. It would be nice, he said, to find a better way to do things like API changes that require changes all over. Linus added that the reformatting of files to fix coding-style issues before applying small changes is "really nasty," leading to "conflicts from hell." It would be better to separate things like white-space changes from real work by a full release — or to not do the trivial changes at all. Ben Herrenschmidt suggested that white-space changes could be done at the end of a series, making them easy to drop, but Linus said that doesn't help, that the conflicts still come about. The best way to do white-space changes if they must be done, he said, is to do them to otherwise quiescent code.
Stephen said that he still sees more rebasing of trees than he would like; rebasing should be avoided whenever possible. Grant Likely asked about one of the common cases for rebasing: the addition of tags like Acked-by to patches that already appear in linux-next. Linus said that this is a bit of a gray area but that, in general, if those tags do not show up in a timely manner, it's usually best not to bother with them. Stephen added that the addition of tags to patches in a published git tree may be an indication that the tree has been published prematurely; such trees should not be fed into linux-next until they are deemed ready for merging.
James Bottomley pointed out that developers often publish git trees to get attention from Fengguang Wu's automated build-and-test system. But such trees do not need to be fed into linux-next to get that kind of testing; they can be put out on a separate testing branch. Ingo Molnar said that some tags, like Tested-by, can arrive fairly late; we really do not want to drop them and fail to credit our testers. Ted Ts'o added that employers often count things like Reviewed-by tags and that it is important to get them in.
Linus's response was that timeliness matters; a too-late review is essentially useless anyway. That said, he also said that some people do take the "no-rebase" policy a little bit too far. There are times when rebasing a tree can be justified, especially if other developers are not basing their own trees on the rebased tree. For example, when a patch turns out to be a bad idea, it can be better make it disappear from the history and to "pretend that all crap just didn't happen." Ben added that he asks his submitters to base their trees on the mainline; then he can rebase his tree and they will not be adversely affected by it.
Dave Jones complained about bugs which are claimed to be "fixed in linux-next," but those fixes then sit in linux-next for months until the next development cycle comes around. The right response to that problem, Andrew Morton said, was to "steal them" and forward them directly to Linus for immediate merging.
In general, Linus said, he has been using the linux-next tree a lot more than he used to because it is working a lot better. It has become a good indication of what will be coming in the next merge window. In general, the group agreed that linux-next is a valuable resource that has done a lot to make the development process work more smoothly.
The stable tree
Stable tree maintainer Greg Kroah-Hartman gave a quick update on the state of the various stable trees he runs. One of the biggest changes in stable tree maintenance, he said, is that he is starting to delay the inclusion of patches until they have appeared in one of Linus's -rc kernels; that delay will typically be one or two weeks. There have been a few incidents recently where "stable" patches have caused regressions; Greg hopes that, by inserting a small delay, he can flush out the problematic patches before shipping them in a stable release.
Among the other problems that Greg is trying to address is maintainers who
never mark
any patches for the stable tree. "We need to fix that," he said. There is
also the problem Dave mentioned, where stable fixes live in linux-next for
months; "don't do that." If a patch is tagged for stable, he said, that
means it should go out soon, not languish for months. James said that
sometimes he will hold stable fixes because he needs people to test them;
he does not have every piece of SCSI hardware, and so cannot verify that
every patch works as advertised.
Greg went on to say that it would be nice to have some way to automate the task of figuring out how far back a given patch needs to be backported. To that end, some developers are proposing the addition of a "Fixes:" tag to bug-fix patches. That tag would include the SHA hash of the commit that caused the original problem, along with that patch's subject line. Including the hash of the bad commit is better than just putting an initial version; it helps maintainers of non-mainline trees figure out if the fix applies to their version of the kernel or not.
Linus jumped in to say that he would like everybody to run this command in their repositories:
git config core.abbrev 12
That causes git to abbreviate commit hashes to 12 characters. The default of seven characters is too small to prevent occasional hash collisions in the kernel; it was, he said, a big mistake made early in git's history. He also noted that he spends a lot of time fixing up hashes in patch, many of which are "clearly bogus." Most of the problems, most likely, are caused by the rebasing of trees.
James asked: what should be done about patches that should have been marked for stable but, for whatever reason, did not get tagged? The answer was to send the relevant mainline git IDs to stable@vger.kernel.org; the rest will be taken care of.
Arnd Bergmann asked whether there had been complaints about the dropping of support for the 3.0.x series. Greg answered that people are mostly happy and not complaining. One person is considering picking up 3.0 maintenance, but Greg does not think that will happen.
Other problems that Greg mentioned including people reporting that things have been broken in stable, but the real problem is in the mainline. Those reports usually come, he said, from people who are not testing Linus's -rc releases. He also mentioned "certain distributors" who are not good about sending in the fixes they apply to their own kernels. Those fixes tend to be especially useful, since they were applied in response to a real problem that somebody encountered. If anybody wants to help out the stable process, he said, digging through distributor kernels for these fixes would be a useful thing to do.
As the session wound to a close, Greg was asked about what he does with patches that do not apply to older kernels. His response is that he will bounce them back to the maintainer for backporting. Subsystem maintainers, he said, need not worry about patches being tweaked on their way into -stable.
[Next: Testing].
Two sessions on kernel testing
Over the last couple of years, the amount of testing applied to pre-release kernels has quietly been increased in a big way; this work has had a significant impact on kernel release quality. Two of the developers behind that work — Dave Jones and Fengguang Wu — ran sessions to talk about what they are doing and their plans for the future.
Trinity
Dave's "Trinity" fuzz-testing tool has been around for some time, but the
pace of development has increased in the last year or two. Dave introduced
himself as the guy who "has broken lots of people's stuff" and who plans to
continue doing so; Trinity, he said, is getting better and growing in
scope. From the beginning, Trinity has tried to perform system call
intelligent fuzz testing by avoiding calls that will obviously get an
EINVAL error from the kernel. So, for example, system calls expecting a
file descriptor will get a file descriptor rather than a random number.
Work is continuing in that direction; the idea is to get Trinity to do
things that real programs would do.
One of the targets for the future is to add more subsystem-specific testing. There will also be more use of features like control groups. Among other things, these additional tests will require that Trinity be run as root — something that has been discouraged until now. He wants the ability to fuzz things that only root can do, he said, expressing confidence that there will be "all kinds of horrors" waiting to be found.
Dave was asked about using the fault injection framework for testing; he responded that, every time he tries, he feels like he is the first to use it. "Things blow up everywhere." Dave Airlie asked about fuzz-testing in 32-bit mode on a 64-bit kernel; the answer was that this mode was broken for a while, but should work now. When asked about testing user namespaces, Dave noted that a lot of problems have been found in that area. Trinity does not run within them now; it would be nice if somebody would submit a wrapper to make that work.
Ted Ts'o remarked on the difficulty of finding the real cause of a lot of trinity-caused crashes. Quite a few of them, he suspects, are really the result of memory corruption left behind by a previous test; the place where the crash actually happens may have nothing to do with the real problem. Dave agreed that reproducibility is a problem. There is a lot that changes between runs, even after recent work that is careful to save random seeds so that the random number sequence used will be the same. It is, he said, "the number-one thing that sucks" about Trinity, but fixing it has proved to be far harder than he thought it would be.
The build-and-boot robot
Fengguang Wu has 63 Reported-by credits in the 3.12 kernel — over 12% of the total. These bug reports are the result of the extensive testing setup that he has been building; he ran a session at the Kernel Summit to describe his work.
Essentially, Fengguang's system works by pulling and merging a large number
of git trees, building the resulting kernel, then booting it. There are a
number of
tests that are then run, looking for bugs and performance regressions.
When a problem comes up, Fengguang's (large) systems can run up to 1000 KVM
instances to quickly bisect the history and determine which patch caused
the problem. The result is an automated email message, of which he sends
about ten each day. Fengguang noted that a lot of developers send
apologetic emails in response, but, he said, "it's a robot, you don't have
to reply." Linus jibed that most of that mail was probably an automated
"thank you" script run by Greg Kroah-Hartman.
Of the problems reported by Fengguang's system, about 10% are build errors, 20% are build warnings and documentation issues, 60% are generated by the sparse utility, and 10% come from static checkers like smatch and Coccinelle. The number of error reports going out has been dropping over time, he said; it seems that more developers are running their own tests before making their code public.
There were various questions, starting with: which compiler does he use? Fengguang said that it's gcc from the Debian "sid" distribution. Are any branches excluded from testing? Those which hold only ancient commits or which are based on old upstream releases are not tested; any branch that has "experimental" in its name will also not be tested. Otherwise, once Fengguang's system finds your repository, no branch will go untested. How does he find trees to test? Mostly from mailing lists and git logs; as Ted put it, "you can run, but you can't hide."
One of the more recent changes is the running of performance tests. These tests are time consuming, though; Fengguang would like more tests that can run quickly. The best performance tests, he said, have a --runtime flag to control how long they run; that leads to predictable behavior on both fast and slow systems. He also noted that both the size of the kernel and the time required to boot are increasing over time.
The session ended with general agreement in the room that this work is helpful and welcome.
[Next: Saying "no"].
On saying "no"
Are we getting too lax in our acceptance of new code in the kernel? Christoph Hellwig ran a session to explore that question. It may be, he said, that kernel maintainers are getting old and just don't have the energy to fight against substandard code like they once used to; there is very little pushback against ideas now. The review process, he said, is now focused on white-space issues rather than on whether the patch is needed at all. Nobody is willing to just decide that a problem is too hard, resulting in things like the "Linux security modules debacle." People don't want to make decisions or stick their necks out to turn stuff down. There are, Christoph said, two types of maintainer currently: those who merge everything, and those who ignore everything.Olof Johansson claimed that he and Arnd Bergmann have gotten pretty good at pushing crap out of the ARM tree. There are a lot of new subsystems with energetic new maintainers, which helps. The jury is still out, he said, but things are running well at this point.
Ted Ts'o asked for specific examples of code that should not have been
merged. Christoph responded that, once upon a time, it used to be hard to
merge patches that bloated core kernel data structures; now everything just
keeps growing. He also pointed to the various filesystem notification
subsystems in the kernel, leading to a threat from Linus: he will kick
anybody who tries to submit another notification scheme.
Linus did allow that we are letting a lot of stuff through; he relies on
the top-level maintainers to push back and that isn't always happening,
despite the fact that he asks maintainers to say "no" more often.
Ted said that, ten years, ago, patches went through a lot more review by developers other than the maintainer of the affected subsystem, but we just can't scale to that kind of review anymore. So maintainers are not paying much attention to what is happening outside of their own subsystems, and a lot of people have tuned out of the linux-kernel mailing list entirely. Ingo Molnar suggested that we are seeing some of the natural results of a distributed development model; the social structure, too, is more distributed. It was also pointed out that heavy criticism is frowned upon more than it used to be.
Christoph went on to mention the problems with O_TMPFILE, which he described as "a trainwreck." There are still critical bugs being fixed with that code. In general, he said, it often seems that code going into the kernel has been rushed and hasn't had time for proper review or testing. Dave Jones added that, sometimes, maintainers abuse Linus's trust and merge code that really is not ready to go in.
Is the model of having a single maintainer for a subsystem still appropriate? Christoph noted that having multiple maintainers seems to help to ensure adequate review of patches. Perhaps group maintenance should become more of the rule than the exception. Tejun Heo responded that we just don't have the manpower to have multiple maintainers for a lot of subsystems. He cited the workqueue code which, he said, really needs somebody else with a deep understanding of what is going on there, but, which, for years, was looked at by nobody but him.
Andrew Morton said that he often struggles with the question of whether patches should be merged at all. He pointed to the "zstuff" — transcendent memory and related in-memory compression technologies; he kept pushing back against the pressure to merge that code, hoping for an explanation of why we needed it. In the end, a couple of distributors picked it up; that tipped the balance and made it hard to keep that code out. Peter Zijlstra described this as a sort of side-channel attack: distributors will carry almost anything if their users bother them enough. So, by harassing distributors enough, developers can get controversial code merged into the kernel.
What about patches that cause performance regressions? The sense in the room was that things have gotten a little bit better in that area. Some distributors are running more performance tests; this includes Mel Gorman's work at SUSE and the tests run by Martin Petersen at Oracle. Red Hat, too, has been pushed to run more tests, but the people involved apparently feel a little ignored at the moment.
Andrew came back in to repeat that he could really use more help in deciding which features should or shouldn't be merged. While being wary of the idea that every problem can be solved with a new mailing list, he thought that perhaps some sort of "graybeards@" list might be helpful for this kind of question. Linus agreed, suggesting that it should be a non-optional list and that all Kernel Summit attendees should be subscribed. If anybody complains loudly, they could be removed, and the community would know that they aren't interested in core issues and don't want to attend any more summits.
There was some agreement on the creation of this list, but it was also agreed that, somehow, the volume would need to be kept down. Linus suggested that it should not allow messages that are copied to any other list. There may also be a rule that no patches are posted to the list; anything posted there would include links to discussions elsewhere. Ted closed out the session by saying that he would go ahead and create the list and the members could work out the rules from there.
[Next: Bugzilla, lightning talks, and future summits].
Bugzilla, lightning talks, and future summits
The final sessions at the 2013 Kernel Summit were given over to a brief discussion of the kernel's bug tracker, followed by a series of lightning talks and a discussion of the summit itself.Konstantin Ryabitsev is one of the kernel.org administrators hired after the Linux Foundation took over responsibility for that site. One of the many services he is charged with managing is the bugzilla system. He came to the Kernel Summit asking if the community was happy with things and how could the kernel's bug tracker be made more useful?
Rafael Wysocki responded that his group is using it to track ACPI and power
management bugs; it works well and they are happy with it. Ted Ts'o said
that he tends to use specific bugs as a sort of miniature mailing-list
archive. He often forgets to close out the bugs when things are resolved;
as a result, he's not particularly interested in getting status report
messages out of the system.
There was a request for better ways of tracking patches posted in response to tracked bugs. One developer complained that he would like to be able to assign bugs to others, but the system won't allow that. It seems that the "edit bugs" permission is required; Konstantin said he is willing to give that permission to anybody who requests it, but it was also suggested that perhaps that permission should just be given to anybody who has a kernel.org account.
Russell King asked how he could get out of receiving mail for bugs in subsystems that he no longer maintains; Konstantin's response was "procmail." More seriously, though, it's just a matter of sending him a note and such issues will be taken care of.
Dave Jones noted that bugzilla is the second-worst bug tracker available, with the absolute worst being "everything else." He went on to say that Fedora developers are interested in using upstream bug trackers more and that, in particular, Fedora's kernel developers would like to track bugs in the kernel.org bugzilla. Beyond that, though, they would like to feed bugs from Fedora's "Automatic Bug Reporting Tool" directly into the system. Anybody who has ever had to wade through a morass of ABRT bugs in the Fedora tracker might be forgiven for being worried about this idea, but, Dave said, things have gotten much better and there aren't a whole lot of duplicate bugs anymore. So the assembled group was reasonably accepting of this idea, seeing it as a replacement for the much-lamented kerneloops.org service if nothing else.
Lightning talks
Linus wanted the group to know that the 3.13 merge window is likely to be
"chaos." He has a bunch of travel coming up, including an appearance at
the Korea
Linux Forum. So he will be traveling during the merge window,
sometimes in places without any sort of reasonable network access.
Ted asked whether it would just make sense to delay the release of 3.12 until Linus is back. Linus responded that he could do that if people wanted, but that it really wouldn't make much difference in the end. There's going to be a period where he just isn't looking at pull requests, but, he emphasized, maintainers should still get their pull requests in early. Greg added that, during this time, urgent stable fixes could, contrary to normal policy, get into the stable tree prior to being merged into the mainline; he won't hold them while Linus is off the net.
Dave Airlie asked: who out there is pining for kerneloops.org? Anybody who would like to see better tracking of the problems being encountered by users might want to take a look at retrace.fedoraproject.org. It has plots of reports, information on specific problems, and more. It's a useful resource, he said, but he still wishes the kernel project had somebody tracking regressions.
Dave Jones talked about his work digging through the reports generated by Coverity's static analysis tool. During the 3.11 merge window, he said, the merging of the staging tree alone added 435 new issues to the database. He's been working on cleaning out the noise, but there are still over 5000 open issues there. A lot of them are false positives or failures on the tool's part to understand kernel idioms; that noise makes it hard to see the real bugs. He's trying to clean things up over time.
He has been given the permission to allow others to see the problem list; interested developers are encouraged to contact him. He is also able to see who else is looking at the problems. A few of them are kernel developers, he said, but most of the people looking at problem reports have no commits in the tree. Instead, they work for government agencies, defense contractors, and commercial exploit companies. The list "gave him the creeps," he said. Should those people be kicked out of the system? It might be possible, but they could then just buy their own Coverity license or run a tool like smatch instead, so there would be little value in doing so.
Wireless network drivers are the leading source of problem reports, followed by misc drivers, SCSI drivers, and network drivers as a whole. In general, he said, the "dark areas of the kernel" are the ones needing the most attention.
Mathieu Desnoyers asked whether people working on their own static analysis tools should be looking at Coverity's results. That was acknowledged to be a gray area. Coverity does not appear to be a litigious company, but people working in that area might still want to steer clear of Coverity's results.
Future summits
Ted Ts'o closed out the day by asking: were developers happy with how the kernel summit went, and what would they like to see changed? The participants seemed generally happy, but there were a couple of complaints that the control group discussion was boring to many. That follows a longstanding Kernel Summit trend: highly technical subjects are generally considered to be better addressed in a more specialized setting.
Much of the discussion was related to colocation with other events. From a show of hands, only about 20% of the Kernel Summit attendees found their way over to LinuxCon, which was happening that same week. The separation of the venues (LinuxCon was a ten-minute walk away) certainly didn't help in that regard. In general, there were grumbles about how there was too much going on, or that the "cloud stuff" (CloudOpen was also running at the same time) is taking over.
Should the Kernel Summit be run as a standalone event? The consensus seemed to be that an entirely standalone summit doesn't work, but that, with enough minisummits, it might be able to go on its own. Should the summit be colocated with the Linux Plumbers Conference, and possibly away from everything else? One problem with that is that, like the summit, LPC is a high-intensity event; the two together make for a long week. Still, a straw poll indicated that most of the participants favored putting those two conferences together.
Eventually the discussion wound down. The group headed off for the group photo, followed by a nice dinner and the TAB election.
[Next: Minisummit reports].
Minisummit reports
Following the usual tradition, the plenary day at the 2013 Kernel Summit started with a series of reports from various kernel minisummits, most of which were held earlier in the week. For some of those meetings, there is reporting that is available elsewhere, so that information will not be duplicated here.The events with reports found elsewhere are:
- The power-aware scheduling minisummit; this meeting was covered separately here on LWN.
- The media minisummit; see Mauro Carvalho
Chehab's notes for details.
- The Linux security summit held in New Orleans in September; LWN posted several articles from that event.
The ARM minisummit
Grant Likely reported from the two-day ARM minisummit held in Edinburgh.
This gathering, he said, was "mostly boring"; for the most part, it was
"normal engineering stuff". Grant said that it was nice not to have "big
news items" to have to deal with. Notes are promised, but have not been
posted as of this writing.
One of the items discussed was the status of the "single zImage" work, which aims to create a single binary kernel that can boot on a wide variety of ARM hardware. Work is progressing in this area, with support being added for a number of ARM processor types. For the curious, there is an online spreadsheet showing the current status of many ARM-based chipsets.
Some time went into the problem of systems with non-discoverable topologies; this is an especially vexing issue in the server area. There was some talk of trying to push the problem into the firmware, but the simple fact is that it is not possible to get the firmware to hide the issue on all systems.
As anybody who has been unlucky enough to be subscribed to the relevant mailing lists knows, the big issue at the 2013 gathering was the problems with the device tree transition. Grant gave an overview of the discussion as part of his report; more details on the device tree issue came out during a separate session later in the day.
The big problem with device trees is their status as part of the kernel's ABI. As an ABI element, device tree bindings should not change in incompatible ways, but that constraint creates a problem: as the developers learn more about the problem, they need to be able to evolve the device tree mechanism to match. That has led to a situation where driver development has been stalled; the need to create perfect, future-proof device tree bindings has caused work to be hung up in the review process. The number of new bindings is large, while the number of capable reviewers is small. The result is a logjam that is slowing development as a whole.
There is a plan to resolve some of those issues which was discussed later in the day. In this session, though, Grant raised the question: might device trees be a failed experiment? Should the kernel maybe switch to something else? The alternatives are few, however. The "board file" scheme used in the past has proved to not scale and is an impediment to the single zImage goal. ACPI has its own problems in the ARM space, even if it were to become universally available. One might contemplate the possibility of something completely new, but there are no proposals on the table now. It seems that we are stuck with device trees for now.
So the ARM developers plan to focus on making things work better in that area. That means that much of the work in the coming year will be aimed at improving processes rather than inventing interesting new technologies.
[Next: Git tree maintenance]
Git tree maintenance
Git has transformed the kernel development process since its introduction in 2005. While this tool is well integrated into most developers' workflows, there are still substantial differences in how maintainers use it. A session in the 2013 Kernel Summit gave maintainers of two of the more active trees a chance to talk about their management processes and what they have learned about the best ways to shepherd large numbers of patches into the mainline.
H. Peter Anvin is one of the maintainers of the "tip" tree, which
takes its name from the first names of the group that manages it: Thomas
Gleixner, Ingo Molnar, and Peter. This tree was started in 2007; it was
initially focused on the x86 architecture tree, but has since expanded into
other, mostly core-kernel areas. They made a lot of mistakes early on,
Peter said, that caused Linus to "go very Finnish" on them, but things are
working smoothly now.
There are three types of branches maintained in the tip tree. "Topic branches" contain patches that are intended to be pushed during the next merge window. "Urgent branches" contain bug fixes that need to go in before the merge window, while "queue branches" hold patches that will be pushed in some merge window after the next one. So, as of this writing, when the 3.12 development cycle is nearing its end, topic branches will hold changes for 3.13, while queue branches hold changes for 3.14 or later.
All of these branches are periodically integrated into the tip "master" branch; Peter described master as their version of linux-next. This merge is done by hand, usually by Ingo, who then feeds the result to his extensive testing setup.
Other tip practices include tip-bot, a program which sends out notifications when patches are added to a tip branch. Those notifications used to only go to the patch author, but they have since been expanded to include the linux-kernel list as well. Patches in tip routinely include a "Link:" tag pointing to the relevant mailing list discussion. There is a status board in the works, based on Fengguang Wu's testing setup.
Olof Johansson talked about the management of the arm-soc tree, which was
started by Arnd Bergmann in July, 2011. Olof joined that effort later that
year; more recently, Arnd
has been on paternity leave, so Kevin Hilman has joined the team
to help keep things going. This tree, which is focused on system-on-chip
support for the ARM architecture, is run with no master branch. Instead,
there is a large set of branches, mostly with a "next/" prefix for
patches in a number of categories, including cleanups, non-urgent fixes,
SoC support additions, board support, device tree changes, and driver
changes. All of these branches are merged into a for-next branch
which is then fed into linux-next.
All of these branches lead to a lot of merges — about 150 of them for each kernel development cycle. Olof said that newcomers tend to have a bit of a rough start as they figure out how the arm-soc tree works, but, after a while, things tend to run smoothly.
Olof mentioned a few "pain points" that the arm-soc maintainers have to live with. At the top of his list was the time period around when Linus releases -rc6; that's when a whole lot of new code comes in. It gets hard to pick a reasonable time to cut things off for the upcoming merge window. Having two levels of trees tends to add latency to the system, which doesn't help. There is also an ongoing stream of merge conflicts, both within arm-soc and with linux-next, and troubles with dependencies on external trees that get rebased by their maintainers.
Repeating a common lament, Olof said that the arm-soc maintainers are unable to keep up with the traffic on the ARM mailing lists. So they depend on the submaintainers review patches and keep inappropriate changes out.
Arnd closed the session with a quick discussion of the process of moving most device drivers out of the ARM tree and into the regular kernel drivers tree. This work has caused a lot of merge conflicts, he said. But he expressed a hope that, once all the drivers are gone, there will be little need for a separate arm-soc tree and they will be able to stop maintaining it.
[Next: Scalability techniques]
Scalability techniques
The plenary day at the 2013 Kernel Summit included an hour-long block of time for the discussion of various scalability techniques. It was, in a sense, a set of brief tutorials for kernel developers wanting to know more about how to use some of the more advanced mechanisms available in the kernel.
Memory barriers
Paul McKenney started with a discussion of memory barriers — processor
instructions used to ensure that memory operations are carried out in a
specific order. Normally, Paul said, memory barriers cannot be used by
themselves; instead, they must be used in pairs. So, for example, a
typical memory barrier usage would follow a scheme like this:
- The writer side fills a structure with interesting data, then sets a
flag to indicate that the structure's contents are valid. It is
important that no other processor see the "valid" flag as being set
until the structure changes are visible there. To make that happen,
the writer process will execute a write memory barrier between filling
the structure and setting the flag.
- The reader process knows that, once it sees the flag set, the contents of the structure are valid. But that only works if the instructions reading the flag and the structure are not reordered within the processor. To ensure that, a read barrier is placed between the reading of the flag and subsequent operations.
Paul noted that memory barriers can be expensive; might there be something cheaper? That may not be possible in an absolute sense, but there is a mechanism by which the cost can be shifted to one side of the operation: read-copy-update (RCU). RCU splits time into "grace periods"; any critical section that begins before the start of a grace period is guaranteed to have completed by the end of that grace period. Code that is concerned with concurrency can use RCU's synchronization calls to wait for a grace period to complete in the knowledge that all changes done within that grace period will be globally visible at the end.
Doing things in this way shifts the entire cost to the side making the synchronization call, which is sufficient in many situations. For the cases where it is not, one can use RCU callbacks, but that leads to some other interesting situations. But that was the subject of the next talk.
RCU-correct data structures
Josh Triplett took over to try to make the task of creating data structures
that function properly with RCU a less-tricky task. The mental model for
ordinary locking, he said, is relatively easy for most developers to
understand. RCU is harder, with the result that most RCU-protected data
structures are "cargo-culted." If the data structure looks something like
a linked list, he said, it's pretty easy to figure out what is going on.
Otherwise, the process is harder; he described it as "construct an ordering
scenario where things go wrong, add a barrier to fix it, repeat, go
insane."
There is a simpler way, he said. Developers should forget about trying to get a handle on overlapping operations, possible reordering of operations, etc., and just assume that a reader can run atomically between every pair of writes. That leads to a pair of relatively straightforward rules:
- On the reader side: rcu_dereference() should be used for
pointer traversal, smp_rmb() should be used to place barriers
between independent reads, and the entire critical section should be
enclosed within an RCU read lock.
- For writers, there are two possibilities. If writes are done in the same order that readers will read the data, then synchronize_rcu() should be used between writes. If they are done in the opposite order, use smp_wmb() or rcu_assign_pointer() to insert a write memory barrier. There is no need for an expensive synchronize call in this case.
Those two rules, Josh contends, are all that is ever needed to create safe data structures protected by RCU.
Josh walked the group through a simple linked-list example. Supposed you have a simple linked list that looks like this:
If you want to insert an item into the list without taking any locks, you would start by setting the "next" pointer within the new item like this:
Once that is done, the list itself can be modified to include the new item:
Any code traversing the list concurrently will either see the new item or it will not, but it will always see a correct list and not go off into the weeds — as long as the two pointer assignments described above are visible in the correct order. To ensure that, one should apply Josh's rules. Since these pointer assignments are done in the opposite order that a reader will use to traverse the list, all that is needed is a memory barrier between the writes and all will be well.
Removing an item from the list reverses the above process. First, the list is modified to route around the item to be taken out:
Once it's sure that no threads are using the to-be-removed item, it's "next" link can be cleared:
...and the item itself can be freed. In this case, the writes are happening in the same order that the reader would use. So it is necessary to use synchronize_rcu() between the two steps to guarantee that the doomed item is truly no longer in use before freeing it. It is also possible, of course, to just use call_rcu() to complete the job and free the item asynchronously after the end of the grace period.
Lock elision
Andi Kleen talked for a while about the use of transactional memory to
eliminate the taking of locks in many situations; Andi described this technique in some detail in an
LWN article last January. Lock elision, he said, is much simpler to work
with than RCU and, if the conditions are right, it can also be faster.
Transactional memory, he said, is functionally the same as having an independent lock on each cache line in memory. It is based on speculative execution within CPUs, something that they have been doing for years; transactional memory just makes that speculation visible. This feature is rolling out on Intel processors now; it will be available throughout the server space within a year. There are a lot of potential uses for transactional memory, but he's restricting his work to lock elision in order to keep the existing kernel programming models.
With regard to which locks should be elided, Andi said that he prefers to just enable it for everything. It can be hard to predict which locks will elide well when the kernel runs. Ben Herrenschmidt complained that in some cases prediction is easy: overly large critical sections will always abort, forcing a fallback to regular locking. Memory-mapped I/O operations will also kill things.
Will Deacon asked whether the lock-elision code took any steps to ensure fairness among lock users. Andi replied that there is no need; lock elision only happens if there is no contention (and, thus, no fairness issue) for the lock. Otherwise things fall back to the regular locking code, which can implement fairness in the usual ways.
Linus said that, sometimes, lock elision can be slower than just taking the lock, but Andi disagreed. The only time when elision would be slower is if the transaction aborts and, in that case, there's contention and somebody would have blocked anyway. Linus pointed out that Intel still has not posted any performance numbers for lock elision within the kernel; he assumes that means that the numbers are bad. Andi did not address the lack of numbers directly, but he did say that elision allows developers to go back to coarser, faster locking.
He concluded by suggesting that, rather than add a bunch of hairy scalability code to the kernel, it might be better to wait a year and just use lock elision.
SRCU
The final talk of the scalability session was given by Lai Jaingshan, who
discussed the "sleepable" variant of the RCU mechanism. Normally, RCU
critical sections run in atomic context and cannot sleep, but there are
cases where a reader needs to block while holding an RCU read lock. There
are also, evidently, situations where a separate RCU domain is useful, or
when code is running on an offline CPU that does not take part in the grace
period mechanism.
SRCU was introduced in 2006; Paul McKenney documented it on LWN at that time. It turned out to be too slow, however, requiring a lot of expensive calls and a per-CPU counter wait for every critical section. So SRCU was reworked in 2012 by Paul and Lai. Updates can happen much more quickly now, with no synchronization calls required; it also has a new call_srcu() primitive.
There are about sixty users of SRCU in the 3.11 kernel, the biggest of which is the KVM hypervisor code. Lai provided an overview of the SRCU API, but it went quickly and it's doubtful that many in the audience picked up much of it. Consulting the code and the documentation in the kernel tree would be the best way to start working with the SRCU mechanism.
[Next: Device tree bindings]
Device tree bindings
The device tree issue was omnipresent during the 2013 Kernel Summit, with dedicated minisummit sessions, hallway discussions, and an interminable mailing list thread all in the mix. Despite all the noise, though, some progress was seemingly made on the issue of how to evolve device tree bindings without breaking systems that depend on them. A group of developers presented those results to the plenary session.
Grant Likely and David Woodhouse started by reiterating the problems that
led to the adoption
of device trees for the ARM architecture in the first place. It comes down
to undiscoverable hardware — hardware that does not describe itself to the
CPU, and which thus cannot be enumerated automatically. This hardware is
not just a problem with embedded ARM systems, Grant said; it is showing up
in desktop and server systems too. In many situations, we are seeing the
need for a general hardware description mechanism. The problem is coming
up with the best way of doing this description while supporting systems
that were shipped with intermediate device tree versions.
The solution comes down to a set of process changes, starting with a statement that device tree bindings are, indeed, considered to be stable by default. Once a binding has been included in a kernel release, developers should not break systems that are using that binding. That said, developers should not get hung up on creating perfect bindings now; we still do not know all of the common patterns and will need to make changes as we learn things. That means that bindings can, in fact, change after they have been released in a kernel; the key is to make those changes in the correct way.
Another decision that has been made is that configuration data is allowed within device tree bindings. This has been a controversial area; many developers feel that device trees should describe the hardware and nothing else. Grant made the claim that much configuration data should be considered part of the hardware design; there may be a region of memory intended for use as graphics buffers, for example.
There will be a staging-like mechanism for unstable bindings, but it expected that this mechanism will almost never be used. The device tree developers will be producing a document describing the recommended best practices and processes around device trees; there will also be a set of validation tools. Much of this work, it is hoped, will be completed within the next year.
The current rule that device tree bindings must be documented will be reinforced. The documentation lives in Documentation/devicetree/bindings in the kernel tree. The device tree maintainers would prefer to see these documents posted as a separate patch within a series so they can find it quickly. Bindings should get an acknowledgment from the device tree maintainers, but there is already too much review work to be done in this area. So, if the device tree maintainers are slow in getting to a patch, subsystem maintainers are considered empowered to merge bindings without an ack. These changes should go through the usual subsystem tree.
The compatibility rules say that new kernels must work with older device trees. If changes are required, they should be put into new properties; the old ones can then be deprecated but not removed. New properties should be optional, so that device trees lacking those properties continue to work. The device tree developers will provide a set of guidelines for the creation of future-proof bindings.
If it becomes absolutely necessary to introduce an incompatible change, Grant said, the first step is that the developer must submit to the mandatory public flogging. After that, if need be, developers should come up with a new "compatible" string and start over, while, of course, still binding against the older string if that is all that is available. DTS files (which hold a complete device tree for a specific system) should contain either the new or the old compatible string, but never both.
If all else fails, it is still permissible to add quirks in the code for specific hardware. If this is done with care, it should not reproduce the old board file problem; such quirks should be relatively rare.
Ben Herrenschmidt worried about the unstable binding mechanism; it is
inevitable, he thought, that manufacturers would ship hardware using
unstable bindings. David replied that bad manufacturer behavior
is not limited to bindings; they ship a lot of strange code as well. But,
he said, manufacturers have learned over time that things go a lot easier
if they work with upstream-supported code. He didn't think that the
unstable binding mechanism would ever be used; it is a "political
compromise" that should never need to be employed. Arnd Bergmann added
that, should this ever happen, it will not be the end of the world; the
kernel community just has to make the consequences of shipping unstable
bindings clear. In such cases, users will just have to update the device
tree in their hardware before they can install a newer kernel.
What about the reviewer bandwidth problem? The main change in this area, it seems, is that the device tree reviewers will only look at the binding documentation; they will not look at the driver code itself. That is part of why they want the documentation in a separate patch. That means that subsystem maintainers will have to be a bit more involved in ensuring that the code matches the documentation — though there will be some tools that will help in that area as well.
[Next: Checkpoint/restart in user space]
Checkpoint/restart in user space
There has long been a desire for the ability to checkpoint a set of processes (save their state to disk) and restore those processes at some future time, possibly on a different system. For almost as long, Linux has lacked that feature, but those days are coming to an end. Pavel Emelyanov led a session during the 2013 Kernel Summit's plenary day to update the audience on the status of this functionality.
Pavel started with the history of this feature. Early attempts to add checkpoint/restart went
with an entirely in-kernel approach. The resulting patch set was large and
invasive; it looked like a maintenance burden and never got much acceptance
from the broader development community. Eventually, some developers
realized that the APIs provided by the kernel were nearly sufficient to
allow the creation of a checkpoint/restore mechanism that ran almost
entirely in user space. All that was needed was a few additions here and
there; as of the 3.11 kernel, all of those additions have been merged and
user-space checkpoint/restart works. Live migration is supported as well.
Pavel had some requests for developers designing kernel interfaces in the future. Whenever new resources are added to a process, he asked, please provide a call to query the current state. A classic example is timers; developers added interfaces to create and arm timers, but nothing to query them, so the checkpoint/restart developers had to fill that in. He also requested that any user-visible identifiers exposed by the kernel not be global; instead, they should be per-process identifiers like file descriptors. If identifiers must be global — he gave process IDs as an example — it will be necessary to create a namespace around them so that the same identifiers can be restored with a checkpointed process.
Now that the basic functionality works, some interesting new features are being worked on. One of these checkpoints all processes in the system, but keeps the contents of their memory in place. It then boots into a new kernel with kexec and restores the processes quickly, using the saved memory whenever possible. This, Pavel said, is the path toward a seamless kernel upgrade.
Andrew Morton expressed his amazement that all of this functionality works, especially given that the checkpoint/restore developers added very little in the way of new kernel code. Is there, he asked, anything that doesn't work? Pavel responded that they have tried a lot of stuff, including web servers, command-line utilities, huge high-performance computing applications, and more. Almost everything will checkpoint and restore just fine.
Andrew then refined his question: could you write an application that is not checkpointable? The answer is "yes"; the usual problem is the use of external resources that cannot be checkpointed. For example, Unix-domain sockets where one end is held by a process that is not being checkpointed will block things; syslog can apparently be a problem in this regard. Work is being done to solve this problem for a set of known services; the systemd folks want it, Pavel added. Unknown devices are another problematic resource; there is a library hook that can be used to add support for specific devices if their state can be obtained and restored.
Beyond that, though, this long-sought functionality seems to work at last.
[Next: A kernel.org update].
A kernel.org update
An update on the status of the kernel.org infrastructure is a traditional Kernel Summit feature; the 2013 event upheld that tradition, but with a new speaker. Kernel.org admin Konstantin Ryabitsev started out by saying that he knows nothing about the 2011 security incident; he has deliberately avoided reading the forensics reports that are (apparently) available. His choice is to focus on making kernel.org be as good as it can be now without being driven by past problems.
He gave a tour of the site's architecture, which your editor will not
attempt to reproduce here. In general terms, there is an extensive backend
system with a set of machines providing specific services and a large
storage array; it is protected by a pair of firewall systems. The front
end consists of a pair of servers, each of which runs two virtual machines;
one of them handles git and dynamic content, while the other serves static
content.
The front end systems are currently located in Palo Alto, CA and Portland, OR. One will be added in Seoul sometime around the middle of 2014, and another one in Beijing, which will only serve git trees, "soon." Work is also proceeding on the installation of a front end system in Montreal.
There is an elaborate process for accepting updates from developers and propagating them through the system. This mechanism has been sped up considerably in recent times; code pushed into kernel.org can be generally available in less than a minute. The developers in the session expressed their appreciation of this particular change.
Konstantin was asked about the nearly devastating git repository corruption problem experienced by the KDE project; what was kernel.org doing to avoid a similar issue? It comes down to using the storage array to take frequent snapshots and to keep them for a long period of time. In the end, the git repository is smaller than one might think (about 30GB), so keeping a lot of backups is a reasonable thing to do. There are also frequent git-fsck runs and other tests done to ensure that the repositories are in good shape.
With regard to account management, everybody who wants an account must appear in the kernel's web of trust. That means having a key signed by Linus, Ted Ts'o, or Peter Anvin, or by somebody who has such a key. Anybody who has an entry in the kernel MAINTAINERS file will automatically be approved for an account; anybody else must be explicitly approved by one of a small set of developers.
With regard to security, two-factor authentication is required for
administrative access everywhere. All systems are all running SELinux in
enforcing mode — an idea which caused some in the audience to shudder.
System logs are stored to a write-only write-once medium.
There is also an extensive
alert system that calls out unusual activity; that leads to kernel.org
users getting an occasional email asking about their recent activity on the
system.
Plans for the next year include faster replication through the mirror network and an updated Bugzilla instance. Further out, there are plans for offsite backups, a git mirror in Europe, a new third-party security review, and the phasing out of the bzip2 compression format.
[Next: Security practices].
Security practices
Kees Cook has been actively trying to improve the security of the Linux kernel for some time. His talk during the plenary day at the 2013 Kernel Summit was split into two parts. The first, on security antipatterns, was the same as the talk he gave at the Linux Security Summit in September; LWN covered that talk at the time, so there is no need to repeat that material here. The second half, instead, was a new talk on what a developer should do in response to a security-relevant bug. This talk, he said, was predicated on the assumption that kernel developers had made an ethical choice in favor of fixing flaws; otherwise their response may differ.
So what are the goals when dealing with a security fix? The wish, of
course, is to get the fix out to end users as quickly as possible. If time
is available, identifying the severity of the issue can be helpful, but
that process is also error-prone. If the bug turns out to be serious, it
is worthwhile to try to minimize the time that the public is exposed once
the bug has been disclosed.
If a developer is unsure about the impact of a given bug, the best thing to do is to simply ask. Help is available in two places: the security@kernel.org list (which consists of a small number of kernel developers) and linux-distros@vs.openwall.org, which is made up of representatives from distributors. Mail to the latter list must include the string "[vs]" in the subject line to get past the spam filters. Both lists are private. Members of those lists will attempt to handle serious bugs in a coordinated manner. For less serious issues, the best approach is usually to just take the problems directly to the relevant public list.
When possible, security-related fixes should be tagged for the stable tree; a "Fixes:" tag to identify the commit that introduced the problem is also helpful. If possible, the CVE number assigned to the bug should go into the commit changelog; numbers can be assigned by a number of vendors, or from the oss-security mailing list.
It's worth noting that patience for embargoes is limited in the kernel community. Any problem sent to security@kernel.org can be kept under embargo for a maximum of five days; the limit on linux-distros is ten days. The whole point of the process is to get fixes out to users quickly; developers are sick of long delays in that regard.
For distributors and manufacturers who are concerned about getting security fixes, Kees had a simple piece of advice: don't bother with tracking CVE numbers. Instead, just pull the entire stable tree and ship everything in it. A lot of security problems will never have CVE numbers assigned to them; if you only take patches with CVEs, you'll miss a lot of important fixes.
At the end, Dave Jones jumped in to say that he would very much like to know about security bugs that the Trinity tool did not catch; that will help to refine the tests to catch similar problems in the future. Dan Carpenter expressed a similar wish with regard to the smatch utility. It will probably never be possible to find all security bugs automatically, but any progress in that direction seems like a good thing.
[Next: Lightning talks].
Plenary day lightning talks
The 2013 Kernel Summit closed out with a set of lightning talks covering a wide range of topics including random numbers, configuration options, and ARM board testing.Ted Ts'o started things off with a discussion of the kernel's random number generators. There has been, he noted dryly, a significant increase in interest in the quality of the kernel's random numbers recently. His biggest concern in this area remains embedded devices that create cryptographic keys shortly after they boot; there may be little or no useful entropy in the system at that point. Some fixes have been added recently, including adding more entropy sources and mixing in system-specific information like MAC addresses, but that still may not be enough entropy to do a whole lot of good. That can be especially true on systems where the in-kernel get_cycles() function returns no useful information, a problem which was covered here in September.
MIPS was one of the architectures that had just that problem. Since MIPS chips are used in devices like home routers, this is a real concern. In that case, the developers were able to find a fine-grained counter that, while useless for timekeeping, can be used to add a bit of entropy to the random pool. A new interface has been added to allow architecture code to provide access to such counters. But the best solution, he said, was for vendors to put hardware random number generators on their chips.
Josh Triplett presented a proposal to get rid of the various "defconfig" files found in the kernel tree. These files are supposed to contain a complete, bootable configuration for a given architecture. He would like to move that information into the Kconfig that define the configuration options themselves. There would be a new syntax saying whether a given option should be enabled by default whenever the default system config was requested by the user.
Linus didn't like the idea, though, saying that it would clutter the Kconfig files and still not solve the problem. He also noted that most defconfig files are nearly useless; the x86 one, he said, is essentially a random configuration used by Andi Kleen several years ago. A lot of the relevant configuration settings are architecture-dependent, so it would be necessary to add architecture-specific selectors and such.
The plan at this point is to move further discussion to the mailing list, but, without some changes, this idea probably will not get too far.
Peter Senna talked briefly about the Coccinelle semantic analysis tool which, he said, is finding a few bugs in each kernel development cycle. He would like to add more test cases to the system; interested developers are directed toward coccinellery.org for examples of how to use this tool. (One could also see this LWN article for an introduction to Coccinelle). Dan Carpenter talked briefly about his smatch tool, which is also improving over time. His biggest goal at this point is to provide more user-friendly output; the warnings that come out of smatch now can be rather difficult to interpret.
The final talk was presented by Paul Walmsley; it covered automatic testing of ARM kernels. He is running a testing lab that builds and boots a number of trees, generating reports when things go wrong. Olof Johansson run an elaborate testing setup; among other things, it performs fuzz testing with Trinity. There is also a 20-board testing array run by Kevin Hilman; he is doing power consumption tests as well.
These testing rigs, Paul said, are catching a lot of bugs, often before the relevant patches get very far. There are also a lot of work to keep going, though. Part of that problem may be related to the fact that the bisection of problems must be done manually; work is being done to automate that process as soon as possible.
After that there was a brief discussion of the Kernel Summit itself; some developers complained about communications, saying that they didn't always know about everything that was going on. There was also some discussion of the Linux Foundation Technical Advisory Board election held the night before, which was a somewhat noisy and chaotic affair. Thereafter, the group picture was taken and the developers headed out in search of dinner and beer.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>