User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.5-rc3, released on June 16. "The week started calm with just a few small pulls, with people apparently really trying to make my life easier during travels - thank you. But it kind of devolved at some point, and I think more than half the pull requests came in the last two days and they were bigger too. Oh well.." It's mostly fixes, but there is also a new network driver for Tile-Gx systems.

Stable updates: the 3.0.35 and 3.4.3 stable kernel updates were released on June 17; 3.2.21 was released on June 19. These updates all contain the usual set of important fixes.

The 3.0.36 and 3.4.4 updates are in the review process as of this writing; they can be expected on or after June 22.

Comments (none posted)

Quotes of the week

Sorry, it can't always be constructive, but I'll try my best. I'll also try to not cast aversions about your cat, but if you taunt me, all bets are off.
Greg Kroah-Hartman

Hooks and notifiers are a form of "COME FROM" programming, and they make it very hard to reason about the code. The only way that that can be reasonably mitigated is by having the exact semantics of a hook or notifier -- the preconditions, postconditions, and other invariants -- carefully documented. Experience has shown that in practice that happens somewhere between rarely and never.
H. Peter Anvin

Comments (22 posted)

2012 Kernel Summit: Call for Participation

The planning process for the 2012 Kernel Summit (August 27-29, San Diego) has begun. "This year, in order to make the selection process more transparent, we're trying a new mechanism where we'll be selecting this year's attendees from amongst those who submit proposals to attend as described below." There is no formal deadline for proposals, but sooner is better.

Full Story (comments: none)

Brown: A Nasty md/raid bug

Neil Brown has written a blog post about a nasty RAID bug in some versions of the Linux kernel. "The bug only fires when you shutdown/poweroff/reboot the machine. While the machine remains up the bug is completely inactive. So you will only notice the bug when you boot up again. The effect of the bug is to erase important information from the metadata that is stored on the disk drives. In particular the level, chunksize, number of devices, data_offset and role of each device in the array are erased ... and probably some other information too. This means that if you know those details you can recover your data, but if you don't, it will be harder. Hence the "mdadm -E" command suggested earlier."

Comments (23 posted)

Kernel development news

Supporting block I/O contexts

By Jonathan Corbet
June 18, 2012
Memory storage devices, including flash, are essentially just random-access devices with some peculiar restrictions. Given direct access to the device, Linux kernel developers could certainly come up with drivers that would provide optimal performance and device lifetime. In the real world, though, these devices are hidden behind their own proprietary operating systems and software stacks; much of the real (commercial) value seems to be in the software bundled inside. As a result, the kernel must try to coax the device's firmware into doing an optimal job. Over time, the storage industry has added various mechanisms by which an operating system can pass hints down to the device; the "trim" or "discard" mechanism is one of those. Newer eMMC and unified flash storage (UFS) devices add a new hint in the form of "contexts"; patches exist to support this feature, but they seem to have raised more questions than they have answered.

The standards documents describing contexts do not appear to be widely available—or at least findable. From what your editor has been able to divine, "contexts" are a small number added to I/O requests that are intended to help the device optimize the execution of those requests. They are meant to differentiate different types of I/O, keeping large, sequential operations separate from small, random requests. I/O can be placed into a "large unit" context, where the operating system promises to send large requests and, possibly, not attempt to read the data back until the context has been closed.

Saugata Das recently posted a small patch set adding context support to the ext4 filesystem and the MMC block driver. At the lower level, context numbers are associated with block I/O requests by storing the number in the newly-added bi_context (in struct bio) and context (in struct request) fields. The virtual filesystem layer takes responsibility for setting those fields, but, in the end, it defers to the actual filesystems to come up with the proper context numbers. There is a new address space operation (called get_context()) by which the VFS can call into the filesystem code to obtain a context number for a specific request. The block layer has been modified to avoid merging block I/O requests if those requests have been assigned to different contexts.

There was little discussion of the lower-level changes, which apparently make sense to the developers who have examined them. The filesystem-level changes have seen rather more discussion, though. Saugata's patch set only touches the ext4 filesystem; those changes cause ext4 to use the inode number of the file under I/O as the context number. Thus, all I/O requests to a single file will be assigned to the same context, while requests to different files would go into different contexts (within limits—eMMC hardware, for example, only supports 15 contexts, so many inode numbers will be mapped onto a single context number at the lower levels). The question that came up was: is using the inode number the right policy? Coming up with an answer involves addressing two independent questions: (1) what does the "context" mechanism actually do?, and (2) how can Linux filesystems provide the best possible context information to the storage devices?

Arnd Bergmann (who has spent a lot of time understanding the details of how flash storage works) has noted that the standard is deliberately vague on what the context mechanism does; the authors wanted to create something that would outlive any specific technology. He went on to say:

That said, I think it is rather clear what the authors of the spec had in mind, and there is only one reasonable implementation given current flash technology: You get something like a log structured file system with 15 contexts, where each context writes to exactly one erase block at a given time.

The effect of such an implementation would be to concentrate data written under any one context into the same erase block(s). Given that, there are at least a couple of ways to use contexts to optimize I/O performance.

For example, one could try to concentrate data with the same expected lifetime, so that, when part of an erase block is deleted, all of the data in that erase block will be deleted. Using the inode number as the context number could have that effect; deleting the file associated with that inode will delete all of its blocks at the same time. So, as long as the file is not subject to random writes (as, say, a database file might be), using contexts in this manner should reduce the amount of garbage collection and read-modify-write cycles needed when a file is deleted.

Another helpful approach might be to use contexts to separate large, long-lived files from those that are shorter and more ephemeral. The larger files would be well-placed on the medium, and the more volatile data would be concentrated into a smaller number of erase blocks. In this case, using the inode number to identify contexts may or may not work well. Large files would be nicely separated, but the smaller files could be separated from each other as well, which may not be desirable: if several small files would fit into a single erase block, performance might be improved if all of those files were written in the same context. In this case, some other policy might be more advisable.

But what should that policy be? Arnd suggested that using the inode number of the directory containing the file might work better. Various commenters thought that using the ID of the process writing to the file could work, though there are some potential difficulties when multiple processes write the same file. Ted Ts'o suggested that grouping files written by the same process in a short period of time could give good results. Also useful, he thought, might be to look at the size of the file relative to the device's erase block size; files much smaller than an erase block would be placed into the same context, while larger files would get a context of their own.

A related idea, also from Ted, was to look at the expected I/O patterns. If an existing file is opened for write access, chances are good that a random I/O pattern will result. Files opened with O_CREAT, instead, are more likely to be sequential; separating those two types of files into different contexts would likely yield better results. Some flags used with posix_fadvise() could also be used in this way. There are undoubtedly other possibilities as well. Choosing a policy will have to be done with care; poor use of contexts could just as easily reduce performance and longevity instead of increasing them.

Figuring all of this out will certainly take some time, especially since devices with actual support for this feature are still relatively rare. Interestingly, according to Arnd, there may be an opportunity in getting ext4 to supply context information early:

Having code in ext4 that uses the contexts will at least make it more likely that the firmware optimizations are based on ext4 measurements rather than some other file system or operating system. From talking with the emmc device vendors, I can tell you that ext4 is very high on the list of file systems to optimize for, because they all target Android products.

Ext4 is, of course, the filesystem of choice for current Android systems. So, conceivably, an ext4 implementation could drive hardware behavior in the same way that much desktop hardware is currently designed around what Windows does.

Given that the patches are relatively small and that policies can be changed in the future without user-space compatibility issues, chances are good that something will be merged into the mainline as soon as the 3.6 development cycle. Then it will just be a matter of seeing what the hardware manufacturers actually do and adjusting accordingly. With luck, the eventual result will be longer-lasting, better-performing memory storage devices.

Comments (7 posted)

msync() and subtle behavioral tweaks

By Jonathan Corbet
June 19, 2012
Some kernel behavior is determined by standards like POSIX; others are simply a function of what the kernel developers implemented. The latter type of behavior can, in theory, be changed if there is a good reason to do so, but there is always a risk of breaking an application that depended on the previous behavior. Even worse, this kind of problem can be impossible to find during development and, indeed, may lurk until long after the new code has been deployed. A system call patch currently under consideration shows how hard it can be to know when a change is truly safe.

The msync() system call exists to request that a file-backed memory region created with mmap() be written back to persistent storage. Once upon a time, msync() was the only way to guarantee that modified pages would be saved to disk in any reasonable period of time; the kernel could not always detect on its own that they had been changed by the application. That problem has long since been dealt with, but msync() is still a good way to inform the kernel that now would be a good time to flush modified pages to disk.

Paolo Bonzini recently posted a small patch set making a couple of changes to msync(). The actual API does not change at all, but how the system call implements the API changes in subtle and interesting ways.

There are a few options to msync(), one of which (MS_ASYNC) asks that the writeback of modified pages be "scheduled," but not necessarily completed immediately. It is meant to be a non-blocking system call that sets the necessary actions in motion, but does not wait for them to complete. Current kernels will write back dirty pages as part of the normal writeback process; the system behaves, in other words, as if msync(MS_ASYNC) were being called on a regular basis on every mapping. Writeback of dirty pages is already scheduled as soon as the page is dirtied. Given that, there's not much work for an explicit MS_ASYNC call from user space to do, and, indeed, the kernel essentially ignores such calls.

Paolo's patch causes the kernel to immediately start I/O on modified pages in response to MS_ASYNC calls. The result is to get those pages to persistent storage a bit more quickly than would otherwise happen, but still avoid blocking the calling process. The change seems reasonable, but Andrew Morton worried that this behavioral change might be a problem for some users:

Means that people will find that their msync(MS_ASYNC) call will newly start IO. This may well be undesirable for some. Also, it hardwires into the kernel behaviour which userspace itself could have initiated, with sync_file_range(). ie: reduced flexibility.

Most users are unlikely to notice the change at all. But it's entirely possible that somebody out there has a precisely-tuned system that will choke if the underlying I/O behavior changes. Users complain about exactly this kind of change at times, but usually when the change shows up in a new enterprise kernel, years too late. That said, many patches make behavioral changes that can affect users in surprising ways. The only thing that is different about this one is that the nature of the change is understood from the beginning. Andrew's concerns were not echoed by others and may not be enough to keep this change from being merged.

The other change is potentially more troubling. msync() takes two parameters indicating the offset and length of the memory area to be written back. But the kernel has always ignored those parameters, choosing instead to just write back all modified pages in the file, and the related metadata as well. Paolo's patch changes the implementation to only synchronize the specific pages requested by the user.

It would be hard to argue that the new behavior breaks the documented API; the offset and length parameters are there for a reason, after all. Still, once again, Andrew worried that applications could break in especially unpleasant ways:

Would be nice, but if applications were previously assuming that an msync() was syncing the whole file, this patch will secretly and subtly break them.

No developer should have written a program with the assumption that msync() would write pages outside of the range it was given. Any such program would clearly be buggy. But, programs written that way will work under current kernels. Changing msync() to not write some pages that it currently writes could cause such programs to fail in strange and difficult-to-reproduce ways.

In general, the kernel tries not to break existing applications, even if those applications can be said to have been written in a buggy manner. If something works now, it should continue to work with future kernels. If the msync() changes described here break those programs, the changes should probably not be merged into the kernel. The problem, of course, is that it can be very difficult to know if a specific change will break somebody's application. Any problems caused by subtle changes are relatively unlikely to turn up before the changes are included in a released kernel. So it is necessary to proceed with care. That said, it is not practical to hold back every change that might break a badly-written program somewhere; kernel development would likely be slowed considerably by such a constraint. So, probably, these changes will probably go in unless an affected user happens to notice a problem in the near future.

Comments (18 posted)

Proposals for Kernel Summit discussions

By Jake Edge
June 20, 2012

As preparation for this year's Kernel Summit gets underway, a new "more transparent" process is being used to select the 80-100 participants. The Summit will take place August 27-29, just prior to LinuxCon North America in San Diego. Those interested in attending are being asked to describe the technical expertise they will bring to the meeting, as well as to suggest topics for discussion. All of that is taking place on the ksummit-2012-discuss mailing list since the announcement on June 14, so it seems worth a look to see what kinds of topics may find their way onto the agenda.

Development process issues are a fairly common topic at the summit and they figure in a number of the suggestions for this year. One of the hot topics is the role of maintainers with multiple, at least partly related, ideas about discussions in that area. Thomas Gleixner noted a few concerns that he had in a mini-rant:

So the main questions I want to raise on Kernel Summit are:

- How do we cope with the need to review the increasing amount of (insane) patches and their potential integration?

- How do we prevent further insanity to known problem spaces (like cpu hotplug) without stopping progress?

A side question, but definitely related is:

- How do we handle "established maintainers" who are mainly interested in their own personal agenda and ignoring justified criticism just because they can?

As one might guess, that kicked off a bit of a conversation about those problems on the list, but also led several developers to concur about the need to discuss the problems at the summit. Somewhat more diplomatically, Trond Myklebust suggested a related discussion on a possible restructuring of the maintainer's role:

Currently, the Linux maintainer appears to be responsible for filling all of the traditional roles of software architect, software developer, patch reviewer, patch committer, and software maintainer.

My question is whether or not there might be some value in splitting out some of these roles, so that we can assign them to different people, and thus help to address the scalability issues that Thomas raised?

Greg Kroah-Hartman also wants to talk about maintainership and offered to "referee" a discussion. He has some ideas that he described at LinuxCon Japan and in a recent linux-kernel posting that he thinks "will go a long ways in helping smooth this out". John Linville also expressed interest in that kind of discussion.

Another area that is generating a lot of interest is the stable tree. Kroah-Hartman is interested in finding out how the process is working for the other kernel developers:

[...] is it going well for everyone? Are there things we can do differently? How can I kick maintainers who don't mark patches for stable backports in ways that do not harm them too much? How can I convey decisions about the longterm kernel selection process in a better way so that it isn't surprising to people?

Based on the number of other submissions that mentioned the stable tree, there seems to be a fair amount to discuss. The relationship between the stable tree and the distributions is one fertile area. Kroah-Hartman said that he often has to go "digging through distro kernel trees" to find patches to apply, to which Andrew Morton suggested that the "distro people need a vigorous wedgie" for not making that easier. Various distribution kernel maintainers (e.g. Josh Boyer and Jiri Kosina) agreed that the distributions could do better, but that some discussion of the process would be worthwhile.

In addition, some discussion of how distributions could better work with the upstream kernel for regression tracking and bug reporting was proposed by Boyer. Kosina wants to discuss the stable review process with an eye toward helping distributions decide which patches to merge into their kernels. Mark Brown is also interested but from the perspective of embedded rather than enterprise distributions. Others also expressed interest in having stable/longterm tree discussions.

How to track bugs and regressions was a topic proposed by Rafael Wysocki, who has been reporting to the summit on that topic for many years. He was joined by Dave Jones, who would like to report on bugs and regressions, both those found by his "trinity" stress-testing tool and ones that have been found in the Fedora kernel over the last year. Like Wysocki, Kosina is also interested in discussing whether the kernel bugzilla is the right tool for tracking bugs and regressions.

Kernel testing is another area that seems ripe for a discussion. Fengguang Wu would like to report on his efforts to test kernels as each new commit is added:

And I would like a chance to talk about doing kernel tests in a timely fashion: whenever one pushes new commits to, build/boot/stress tests will be kicked off and possible errors be notified back to the author within hours.

This fast develop-test feedback cycle is enabled by running a test backend that is able to build 25000 kernels and runtime test 3000 kernels (assuming 10m boot+testing time for each kernel) each day. Just capable enough to outrace our patch creation rate ;-)

On an average day, 1-2 build errors are caught in the 160 monitored kernel trees.

Wu's posting spawned a long thread where various developers described their test setups and what could be done better. Jones mentioned the Coverity scanner in that thread, which led Jason Wessel to highlight Jones's comment as well as give more information on the tool and the kinds of information it can provide. More and better automated kernel testing is definitely on the minds of a lot of potential summit attendees.

James Bottomley would like to eliminate "kernel work creation schemes", in particular he targeted the amount of code that is needed to support CONFIG_HOTPLUG:

[...] the massive proliferation of __dev... _mem... __cpu... and their ilk are getting out of control. Plus, the amount of memory they save is tiny (a few pages at best) and finally virtually no-one compiles without CONFIG_HOTPLUG, so they're mostly nops anyway. However, for that very case, we've evolved a massive set of tools to beat ourselves up whenever we violate the rules of using these tags. What I'd like to explore is firstly, can we just eliminate CONFIG_HOTPLUG and make it always y (this will clear up the problem nicely) or, failing that, can we just dump the tags and the tools and stop causing work for something no-one cares about.

There were few defenders of CONFIG_HOTPLUG=n in the thread, but he was also interested in finding ways to avoid constructs that lead to a lot of code churn to no good end. In a somewhat similar vein, H. Peter Anvin would like to discuss the baseline requirements for the kernel. Supporting some of the niche uses of Linux (on exotic hardware or with seriously outdated toolchains) creates an ongoing cost for kernel hackers that Anvin would like to see reduced or eliminated.

Several PCI topics were proposed, including PCI root bus hotplug issues by Yinghai Lu and a PCI breakout session that Benjamin Herrenschmidt suggested. In the latter, Lu's work, some PCI-related KVM issues, cleaning up some PowerPC special cases, and the rework of the PCI hotplug core could all be discussed. As Herrenschmidt put it: "I think there's enough material to keep us busy and a face to face round table with a white board might end up being just the right thing to do".

Memory management topics also seem popular. Glauber Costa proposed several topics, including kmem tracking and per-memory-control-group kmem memory shrinking, while Hiroyuki Kamezawa suggested memory control group topics. Johannes Weiner is also interested in talking about a separate memory management tree that would supplement the work that Morton does with the -mm tree. The ever-popular memory control group writeback topic was also suggested by Wu and Weiner.

Srivatsa S. Bhat would like to present a newcomer's perspective on kernel development with an eye toward reducing some of the challenges new developers face. Josef Bacik has a similar idea, and would like to discuss how to make it easier for new contributors. In addition to a report on work in the USB subsystem (and USB 3.0 in particular), Sarah Sharp would like to "do a brief readout" about what she learns at AdaCamp in July:

AdaCamp is a conference focused on gathering tech women together to work on solutions for getting women into open technology fields, and retaining them. I think this would be of interest to the Linux kernel community, since we have very few women kernel developers. I hope to keep this read out focused on positive changes we can make.

As one can see, these proposals (and many more that were not mentioned) range all over the kernel map. There tends to be a focus on more process and social aspects of the kernel at the summit, mostly because the hardcore technical topics are generally better handled by a more focused group. The summit tries to address global concerns, and there seem to be plenty to choose from.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


  • Lucas De Marchi: kmod 9 . (June 19, 2012)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds