Brief itemsreleased on June 16. "The week started calm with just a few small pulls, with people apparently really trying to make my life easier during travels - thank you. But it kind of devolved at some point, and I think more than half the pull requests came in the last two days and they were bigger too. Oh well.." It's mostly fixes, but there is also a new network driver for Tile-Gx systems.
Kernel development news
The standards documents describing contexts do not appear to be widely available—or at least findable. From what your editor has been able to divine, "contexts" are a small number added to I/O requests that are intended to help the device optimize the execution of those requests. They are meant to differentiate different types of I/O, keeping large, sequential operations separate from small, random requests. I/O can be placed into a "large unit" context, where the operating system promises to send large requests and, possibly, not attempt to read the data back until the context has been closed.
Saugata Das recently posted a small patch set adding context support to the ext4 filesystem and the MMC block driver. At the lower level, context numbers are associated with block I/O requests by storing the number in the newly-added bi_context (in struct bio) and context (in struct request) fields. The virtual filesystem layer takes responsibility for setting those fields, but, in the end, it defers to the actual filesystems to come up with the proper context numbers. There is a new address space operation (called get_context()) by which the VFS can call into the filesystem code to obtain a context number for a specific request. The block layer has been modified to avoid merging block I/O requests if those requests have been assigned to different contexts.
There was little discussion of the lower-level changes, which apparently make sense to the developers who have examined them. The filesystem-level changes have seen rather more discussion, though. Saugata's patch set only touches the ext4 filesystem; those changes cause ext4 to use the inode number of the file under I/O as the context number. Thus, all I/O requests to a single file will be assigned to the same context, while requests to different files would go into different contexts (within limits—eMMC hardware, for example, only supports 15 contexts, so many inode numbers will be mapped onto a single context number at the lower levels). The question that came up was: is using the inode number the right policy? Coming up with an answer involves addressing two independent questions: (1) what does the "context" mechanism actually do?, and (2) how can Linux filesystems provide the best possible context information to the storage devices?
Arnd Bergmann (who has spent a lot of time understanding the details of how flash storage works) has noted that the standard is deliberately vague on what the context mechanism does; the authors wanted to create something that would outlive any specific technology. He went on to say:
The effect of such an implementation would be to concentrate data written under any one context into the same erase block(s). Given that, there are at least a couple of ways to use contexts to optimize I/O performance.
For example, one could try to concentrate data with the same expected lifetime, so that, when part of an erase block is deleted, all of the data in that erase block will be deleted. Using the inode number as the context number could have that effect; deleting the file associated with that inode will delete all of its blocks at the same time. So, as long as the file is not subject to random writes (as, say, a database file might be), using contexts in this manner should reduce the amount of garbage collection and read-modify-write cycles needed when a file is deleted.
Another helpful approach might be to use contexts to separate large, long-lived files from those that are shorter and more ephemeral. The larger files would be well-placed on the medium, and the more volatile data would be concentrated into a smaller number of erase blocks. In this case, using the inode number to identify contexts may or may not work well. Large files would be nicely separated, but the smaller files could be separated from each other as well, which may not be desirable: if several small files would fit into a single erase block, performance might be improved if all of those files were written in the same context. In this case, some other policy might be more advisable.
But what should that policy be? Arnd suggested that using the inode number of the directory containing the file might work better. Various commenters thought that using the ID of the process writing to the file could work, though there are some potential difficulties when multiple processes write the same file. Ted Ts'o suggested that grouping files written by the same process in a short period of time could give good results. Also useful, he thought, might be to look at the size of the file relative to the device's erase block size; files much smaller than an erase block would be placed into the same context, while larger files would get a context of their own.
A related idea, also from Ted, was to look at the expected I/O patterns. If an existing file is opened for write access, chances are good that a random I/O pattern will result. Files opened with O_CREAT, instead, are more likely to be sequential; separating those two types of files into different contexts would likely yield better results. Some flags used with posix_fadvise() could also be used in this way. There are undoubtedly other possibilities as well. Choosing a policy will have to be done with care; poor use of contexts could just as easily reduce performance and longevity instead of increasing them.
Figuring all of this out will certainly take some time, especially since devices with actual support for this feature are still relatively rare. Interestingly, according to Arnd, there may be an opportunity in getting ext4 to supply context information early:
Ext4 is, of course, the filesystem of choice for current Android systems. So, conceivably, an ext4 implementation could drive hardware behavior in the same way that much desktop hardware is currently designed around what Windows does.
Given that the patches are relatively small and that policies can be changed in the future without user-space compatibility issues, chances are good that something will be merged into the mainline as soon as the 3.6 development cycle. Then it will just be a matter of seeing what the hardware manufacturers actually do and adjusting accordingly. With luck, the eventual result will be longer-lasting, better-performing memory storage devices.
The msync() system call exists to request that a file-backed memory region created with mmap() be written back to persistent storage. Once upon a time, msync() was the only way to guarantee that modified pages would be saved to disk in any reasonable period of time; the kernel could not always detect on its own that they had been changed by the application. That problem has long since been dealt with, but msync() is still a good way to inform the kernel that now would be a good time to flush modified pages to disk.
Paolo Bonzini recently posted a small patch set making a couple of changes to msync(). The actual API does not change at all, but how the system call implements the API changes in subtle and interesting ways.
There are a few options to msync(), one of which (MS_ASYNC) asks that the writeback of modified pages be "scheduled," but not necessarily completed immediately. It is meant to be a non-blocking system call that sets the necessary actions in motion, but does not wait for them to complete. Current kernels will write back dirty pages as part of the normal writeback process; the system behaves, in other words, as if msync(MS_ASYNC) were being called on a regular basis on every mapping. Writeback of dirty pages is already scheduled as soon as the page is dirtied. Given that, there's not much work for an explicit MS_ASYNC call from user space to do, and, indeed, the kernel essentially ignores such calls.
Paolo's patch causes the kernel to immediately start I/O on modified pages in response to MS_ASYNC calls. The result is to get those pages to persistent storage a bit more quickly than would otherwise happen, but still avoid blocking the calling process. The change seems reasonable, but Andrew Morton worried that this behavioral change might be a problem for some users:
Most users are unlikely to notice the change at all. But it's entirely possible that somebody out there has a precisely-tuned system that will choke if the underlying I/O behavior changes. Users complain about exactly this kind of change at times, but usually when the change shows up in a new enterprise kernel, years too late. That said, many patches make behavioral changes that can affect users in surprising ways. The only thing that is different about this one is that the nature of the change is understood from the beginning. Andrew's concerns were not echoed by others and may not be enough to keep this change from being merged.
The other change is potentially more troubling. msync() takes two parameters indicating the offset and length of the memory area to be written back. But the kernel has always ignored those parameters, choosing instead to just write back all modified pages in the file, and the related metadata as well. Paolo's patch changes the implementation to only synchronize the specific pages requested by the user.
It would be hard to argue that the new behavior breaks the documented API; the offset and length parameters are there for a reason, after all. Still, once again, Andrew worried that applications could break in especially unpleasant ways:
No developer should have written a program with the assumption that msync() would write pages outside of the range it was given. Any such program would clearly be buggy. But, programs written that way will work under current kernels. Changing msync() to not write some pages that it currently writes could cause such programs to fail in strange and difficult-to-reproduce ways.
In general, the kernel tries not to break existing applications, even if those applications can be said to have been written in a buggy manner. If something works now, it should continue to work with future kernels. If the msync() changes described here break those programs, the changes should probably not be merged into the kernel. The problem, of course, is that it can be very difficult to know if a specific change will break somebody's application. Any problems caused by subtle changes are relatively unlikely to turn up before the changes are included in a released kernel. So it is necessary to proceed with care. That said, it is not practical to hold back every change that might break a badly-written program somewhere; kernel development would likely be slowed considerably by such a constraint. So, probably, these changes will probably go in unless an affected user happens to notice a problem in the near future.
As preparation for this year's Kernel Summit gets underway, a new "more transparent" process is being used to select the 80-100 participants. The Summit will take place August 27-29, just prior to LinuxCon North America in San Diego. Those interested in attending are being asked to describe the technical expertise they will bring to the meeting, as well as to suggest topics for discussion. All of that is taking place on the ksummit-2012-discuss mailing list since the announcement on June 14, so it seems worth a look to see what kinds of topics may find their way onto the agenda.
Development process issues are a fairly common topic at the summit and they figure in a number of the suggestions for this year. One of the hot topics is the role of maintainers with multiple, at least partly related, ideas about discussions in that area. Thomas Gleixner noted a few concerns that he had in a mini-rant:
- How do we cope with the need to review the increasing amount of (insane) patches and their potential integration?
- How do we prevent further insanity to known problem spaces (like cpu hotplug) without stopping progress?
A side question, but definitely related is:
- How do we handle "established maintainers" who are mainly interested in their own personal agenda and ignoring justified criticism just because they can?
As one might guess, that kicked off a bit of a conversation about those problems on the list, but also led several developers to concur about the need to discuss the problems at the summit. Somewhat more diplomatically, Trond Myklebust suggested a related discussion on a possible restructuring of the maintainer's role:
My question is whether or not there might be some value in splitting out some of these roles, so that we can assign them to different people, and thus help to address the scalability issues that Thomas raised?
Greg Kroah-Hartman also wants to talk about maintainership and offered to "referee" a discussion. He has some ideas that he described at LinuxCon Japan and in a recent linux-kernel posting that he thinks "will go a long ways in helping smooth this out". John Linville also expressed interest in that kind of discussion.
Another area that is generating a lot of interest is the stable tree. Kroah-Hartman is interested in finding out how the process is working for the other kernel developers:
Based on the number of other submissions that mentioned the stable tree, there seems to be a fair amount to discuss. The relationship between the stable tree and the distributions is one fertile area. Kroah-Hartman said that he often has to go "digging through distro kernel trees" to find patches to apply, to which Andrew Morton suggested that the "distro people need a vigorous wedgie" for not making that easier. Various distribution kernel maintainers (e.g. Josh Boyer and Jiri Kosina) agreed that the distributions could do better, but that some discussion of the process would be worthwhile.
In addition, some discussion of how distributions could better work with the upstream kernel for regression tracking and bug reporting was proposed by Boyer. Kosina wants to discuss the stable review process with an eye toward helping distributions decide which patches to merge into their kernels. Mark Brown is also interested but from the perspective of embedded rather than enterprise distributions. Others also expressed interest in having stable/longterm tree discussions.
How to track bugs and regressions was a topic proposed by Rafael Wysocki, who has been reporting to the summit on that topic for many years. He was joined by Dave Jones, who would like to report on bugs and regressions, both those found by his "trinity" stress-testing tool and ones that have been found in the Fedora kernel over the last year. Like Wysocki, Kosina is also interested in discussing whether the kernel bugzilla is the right tool for tracking bugs and regressions.
Kernel testing is another area that seems ripe for a discussion. Fengguang Wu would like to report on his efforts to test kernels as each new commit is added:
This fast develop-test feedback cycle is enabled by running a test backend that is able to build 25000 kernels and runtime test 3000 kernels (assuming 10m boot+testing time for each kernel) each day. Just capable enough to outrace our patch creation rate ;-)
On an average day, 1-2 build errors are caught in the 160 monitored kernel trees.
Wu's posting spawned a long thread where various developers described their test setups and what could be done better. Jones mentioned the Coverity scanner in that thread, which led Jason Wessel to highlight Jones's comment as well as give more information on the tool and the kinds of information it can provide. More and better automated kernel testing is definitely on the minds of a lot of potential summit attendees.
James Bottomley would like to eliminate "kernel work creation schemes", in particular he targeted the amount of code that is needed to support CONFIG_HOTPLUG:
There were few defenders of CONFIG_HOTPLUG=n in the thread, but he was also interested in finding ways to avoid constructs that lead to a lot of code churn to no good end. In a somewhat similar vein, H. Peter Anvin would like to discuss the baseline requirements for the kernel. Supporting some of the niche uses of Linux (on exotic hardware or with seriously outdated toolchains) creates an ongoing cost for kernel hackers that Anvin would like to see reduced or eliminated.
Several PCI topics were proposed, including PCI root bus hotplug issues by Yinghai Lu and a PCI breakout session that Benjamin Herrenschmidt suggested. In the latter, Lu's work, some PCI-related KVM issues, cleaning up some PowerPC special cases, and the rework of the PCI hotplug core could all be discussed. As Herrenschmidt put it: "I think there's enough material to keep us busy and a face to face round table with a white board might end up being just the right thing to do".
Memory management topics also seem popular. Glauber Costa proposed several topics, including kmem tracking and per-memory-control-group kmem memory shrinking, while Hiroyuki Kamezawa suggested memory control group topics. Johannes Weiner is also interested in talking about a separate memory management tree that would supplement the work that Morton does with the -mm tree. The ever-popular memory control group writeback topic was also suggested by Wu and Weiner.
Srivatsa S. Bhat would like to present a newcomer's perspective on kernel development with an eye toward reducing some of the challenges new developers face. Josef Bacik has a similar idea, and would like to discuss how to make it easier for new contributors. In addition to a report on work in the USB subsystem (and USB 3.0 in particular), Sarah Sharp would like to "do a brief readout" about what she learns at AdaCamp in July:
As one can see, these proposals (and many more that were not mentioned) range all over the kernel map. There tends to be a focus on more process and social aspects of the kernel at the summit, mostly because the hardcore technical topics are generally better handled by a more focused group. The summit tries to address global concerns, and there seem to be plenty to choose from.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds