The current development kernel is 3.5-rc3
on June 16. "The week
started calm with just a few small pulls, with people apparently really
trying to make my life easier during travels - thank you. But it kind of
devolved at some point, and I think more than half the pull requests came
in the last two days and they were bigger too. Oh well..
mostly fixes, but there is also a new network driver for Tile-Gx systems.
Stable updates: the 3.0.35 and 3.4.3 stable kernel updates were released on
June 17; 3.2.21 was released on
June 19. These updates all contain the usual set of important fixes.
The 3.0.36 and 3.4.4 updates are in the review process as of
this writing; they can be expected on or after June 22.
Comments (none posted)
Sorry, it can't always be constructive, but I'll try my best. I'll
also try to not cast aversions about your cat, but if you taunt me,
all bets are off.
— Greg Kroah-Hartman
Hooks and notifiers are a form of "COME FROM" programming, and they
make it very hard to reason about the code. The only way that that
can be reasonably mitigated is by having the exact semantics of a
hook or notifier -- the preconditions, postconditions, and other
invariants -- carefully documented. Experience has shown that in
practice that happens somewhere between rarely and never.
— H. Peter Anvin
Comments (22 posted)
The planning process for the 2012 Kernel Summit (August 27-29, San Diego)
has begun. "This year, in order to make the selection process more transparent,
we're trying a new mechanism where we'll be selecting this year's
attendees from amongst those who submit proposals to attend as
" There is no formal deadline for proposals, but
sooner is better.
Full Story (comments: none)
Neil Brown has written a blog post about a nasty
in some versions of the Linux kernel. "The bug only
fires when you shutdown/poweroff/reboot the machine. While the machine
remains up the bug is completely inactive. So you will only notice the bug
when you boot up again. The effect of the bug is to erase important information from the metadata that is stored on the disk drives. In particular the level, chunksize, number of devices, data_offset and role of each device in the array are erased ... and probably some other information too. This means that if you know those details you can recover your data, but if you don't, it will be harder. Hence the "mdadm -E" command suggested earlier.
Comments (23 posted)
Kernel development news
Memory storage devices, including flash, are essentially just random-access
devices with some peculiar restrictions. Given direct access to the
device, Linux kernel developers could certainly come up with drivers that
would provide optimal performance and device lifetime. In the real world,
though, these devices are hidden behind their own proprietary operating systems and
software stacks; much of the real (commercial) value seems to be in the
software bundled inside. As a result, the kernel must try to
coax the device's firmware into doing an optimal job. Over time, the
storage industry has added various mechanisms by which an operating system
can pass hints down to the device; the "trim" or "discard" mechanism is one
of those. Newer eMMC and unified flash storage (UFS) devices add a
new hint in the form of
"contexts"; patches exist to support this feature, but they seem to have
raised more questions than they have answered.
The standards documents describing contexts do not appear to be widely
available—or at least findable. From what your editor has been able to
divine, "contexts" are a small number added to I/O requests that are
intended to help the device optimize the execution of those requests. They
are meant to differentiate different types of I/O, keeping large,
sequential operations separate from small, random requests. I/O can be
placed into a "large unit" context, where the operating system promises to
send large requests and, possibly, not attempt to read the data back until
the context has been closed.
Saugata Das recently posted a small patch
set adding context support to the ext4 filesystem and the MMC block
driver. At the lower level, context numbers are associated with block I/O
requests by storing the number in the newly-added bi_context (in
struct bio) and context (in struct request)
fields. The virtual filesystem layer takes responsibility for setting
those fields, but, in the end, it defers to the actual filesystems to come
up with the proper context numbers. There is a new address space operation
(called get_context()) by which the VFS can call into the
filesystem code to obtain a context number for a specific request. The
block layer has been modified to avoid merging block I/O requests if those
requests have been assigned to different contexts.
There was little discussion of the lower-level changes, which apparently
make sense to the developers who have examined them. The filesystem-level
changes have seen rather more discussion, though. Saugata's patch set only
touches the ext4 filesystem; those changes cause ext4 to use the inode
number of the file
under I/O as the context number. Thus, all I/O requests to a single file
will be assigned to the same context, while requests to different files
would go into different contexts (within limits—eMMC hardware, for example, only supports 15 contexts, so many inode numbers will be mapped onto a single context number at the lower levels).
The question that came up was: is using the inode number the right policy?
Coming up with an answer involves addressing two independent questions:
(1) what does the "context" mechanism actually do?, and (2) how
can Linux filesystems provide the best possible context information to the
Arnd Bergmann (who has spent a lot of time
understanding the details of how flash storage works) has noted that the standard is deliberately vague
on what the context mechanism does; the authors wanted to create something
that would outlive any specific technology. He went on to say:
That said, I think it is rather clear what the authors of the spec
had in mind, and there is only one reasonable implementation given
current flash technology: You get something like a log structured
file system with 15 contexts, where each context writes to exactly
one erase block at a given time.
The effect of such an implementation would be to concentrate data written
under any one context into the same erase block(s). Given that, there are at
least a couple of ways to use contexts to optimize I/O performance.
For example, one could try to concentrate data with the same expected
lifetime, so that, when part of an erase block is deleted, all of the data
in that erase block will be deleted. Using the inode number as the context
number could have that effect; deleting the file associated with that inode
will delete all of its blocks at the same time. So, as long as the file is
not subject to random writes (as, say, a database file might be), using
contexts in this manner should reduce the amount of garbage collection and
read-modify-write cycles needed when a file is deleted.
Another helpful approach might be to use contexts to separate large,
long-lived files from those that are shorter and more ephemeral. The
larger files would be well-placed on the medium, and the more volatile data
would be concentrated into a smaller number of erase blocks. In this case,
using the inode number to identify contexts may or may not work well.
Large files would be nicely separated, but the smaller files could be
separated from each other as well, which may not be desirable: if
several small files would fit into a single erase block, performance might
be improved if all of those files were written in the same context.
In this case, some other policy might be more advisable.
But what should that policy be? Arnd suggested that using the inode number
of the directory containing the file might work better. Various commenters
thought that using the ID of the process writing to the file could work,
though there are some potential difficulties when multiple processes write
the same file. Ted Ts'o suggested that
grouping files written by the same process in a short period of time could
give good results. Also useful, he thought, might be to look at the size
of the file relative to the device's erase block size; files much smaller
than an erase block would be placed into the same context, while larger
files would get a context of their own.
A related idea, also from Ted, was to look
at the expected I/O patterns. If an existing file is opened for write
access, chances are good that a random I/O pattern will result. Files
opened with O_CREAT, instead, are more likely to be sequential;
separating those two types of files into different contexts would likely
yield better results. Some flags used with posix_fadvise() could
also be used in this way. There are undoubtedly other possibilities as
well. Choosing a policy will have to be done with care; poor use of
contexts could just as easily reduce performance and longevity instead of
Figuring all of this out will certainly take some time, especially since
devices with actual support for this feature are still relatively rare.
Interestingly, according to Arnd, there may
be an opportunity in getting ext4 to supply context information early:
Having code in ext4 that uses the contexts will at least make it
more likely that the firmware optimizations are based on ext4
measurements rather than some other file system or operating
system. From talking with the emmc device vendors, I can tell you
that ext4 is very high on the list of file systems to optimize for,
because they all target Android products.
Ext4 is, of course, the filesystem of choice for current Android systems.
So, conceivably, an ext4 implementation could drive hardware behavior in
the same way that much desktop hardware is currently designed around what
Given that the patches are relatively small and that policies can be
changed in the future without user-space compatibility issues, chances are
good that something will be merged into the mainline as soon as the 3.6
development cycle. Then it will just be a matter of seeing what the
hardware manufacturers actually do and adjusting accordingly. With luck,
the eventual result will be longer-lasting, better-performing memory
Comments (7 posted)
Some kernel behavior is determined by standards like POSIX; others are
simply a function of what the kernel developers implemented. The latter
type of behavior can, in theory, be changed if there is a good reason to do
so, but there is always a risk of breaking an application that depended on
the previous behavior. Even worse, this kind of problem can be impossible
to find during development and, indeed, may lurk until long after the new
code has been deployed. A system call patch currently under consideration
shows how hard it can be to know when a change is truly safe.
The msync() system call exists to request that a file-backed
memory region created with mmap() be written back to persistent
storage. Once upon a time, msync() was the only way to guarantee
that modified pages would be saved to disk in any reasonable period of time;
the kernel could not always detect
on its own that they had been changed by the application. That problem has
long since been dealt with, but msync() is still a good way to
inform the kernel that now would be a good time to flush modified pages to
Paolo Bonzini recently posted a small patch
set making a couple of changes to msync(). The actual API
does not change at all, but how the system call implements the API changes
in subtle and interesting ways.
There are a few options to msync(), one of which
(MS_ASYNC) asks that the writeback of modified pages be
"scheduled," but not necessarily completed immediately. It is meant to be
a non-blocking system call that sets the necessary actions in motion, but
does not wait for them to complete. Current kernels will write back dirty
pages as part of the normal writeback process; the system behaves, in other
words, as if msync(MS_ASYNC) were being called on a regular basis
on every mapping. Writeback of dirty pages is already scheduled as soon as
the page is dirtied.
Given that, there's not much work for an explicit
MS_ASYNC call from user space to do, and, indeed, the kernel
essentially ignores such calls.
Paolo's patch causes the kernel to immediately start I/O on modified pages
in response to MS_ASYNC calls. The result is to get those pages
to persistent storage a bit more quickly than would otherwise happen, but
still avoid blocking the calling process. The change seems reasonable,
but Andrew Morton worried that this behavioral change might be a
problem for some users:
Means that people will find that their msync(MS_ASYNC) call will
newly start IO. This may well be undesirable for some. Also, it
hardwires into the kernel behaviour which userspace itself could
have initiated, with sync_file_range(). ie: reduced flexibility.
Most users are unlikely to notice the change at all. But it's entirely
possible that somebody out there has a precisely-tuned system that will
choke if the underlying I/O behavior changes. Users complain about exactly
this kind of change at times, but usually when the change shows up in a new
enterprise kernel, years too late.
That said, many patches make behavioral changes that can affect users in
surprising ways. The only thing that is different about this one is that
the nature of the change is understood from the beginning. Andrew's
concerns were not echoed by others and may not be enough to keep this
change from being merged.
The other change is potentially more troubling. msync() takes two
parameters indicating the offset and length of the memory area to be
written back. But the kernel has always ignored those parameters, choosing
instead to just write back all modified pages in the file, and the related
metadata as well. Paolo's patch changes the implementation to only
synchronize the specific pages requested by the user.
It would be hard to argue that the new behavior breaks the documented API;
the offset and length parameters are there for a reason, after all. Still,
once again, Andrew worried that
applications could break in especially unpleasant ways:
Would be nice, but if applications were previously assuming that an
msync() was syncing the whole file, this patch will secretly and
subtly break them.
No developer should have written a program with the assumption that
msync() would write pages outside of the range it was given. Any
such program would clearly be buggy. But, programs written that way will
work under current kernels. Changing msync() to not write some
pages that it currently writes could cause such programs to fail in strange
and difficult-to-reproduce ways.
In general, the kernel tries not to break existing applications, even if
those applications can be said to have been written in a buggy manner. If
something works now, it should continue to work with future kernels. If
the msync() changes described here break those programs, the
changes should probably not be merged into the kernel.
The problem, of course, is that it can be very difficult to know if
a specific change will break somebody's application. Any problems caused
by subtle changes are relatively unlikely to turn
up before the changes are included in a released kernel. So it is
necessary to proceed with care. That said, it is not practical to hold
back every change that might break a badly-written program
somewhere; kernel development would likely be slowed considerably by such a
constraint. So, probably, these changes will probably go in unless an
affected user happens to notice a problem in the near future.
Comments (18 posted)
As preparation for this year's Kernel Summit gets underway, a new "more
transparent" process is being used to
select the 80-100 participants. The Summit will take place August 27-29,
just prior to
LinuxCon North America in San Diego. Those interested in attending are
being asked to describe the technical expertise they will bring to the
meeting, as well as to suggest topics for discussion. All of that is
taking place on the ksummit-2012-discuss
mailing list since the announcement on June
14, so it seems worth a look to see what kinds of topics may find their way
onto the agenda.
Development process issues are a fairly common topic at the summit and they
figure in a number of the suggestions for this year. One of the hot topics
is the role of maintainers with multiple, at least partly related, ideas
about discussions in that area. Thomas Gleixner noted
a few concerns that he had in a mini-rant:
So the main questions I want to raise on Kernel Summit are:
- How do we cope with the need to review the increasing amount of
(insane) patches and their potential integration?
- How do we prevent further insanity to known problem spaces (like
cpu hotplug) without stopping progress?
A side question, but definitely related is:
- How do we handle "established maintainers" who are mainly
interested in their own personal agenda and ignoring justified
criticism just because they can?
As one might guess, that kicked off a bit of a conversation about those
problems on the list, but also led several developers to concur about the
need to discuss the problems at the summit. Somewhat more diplomatically, Trond
a related discussion on a possible restructuring of the maintainer's role:
Currently, the Linux maintainer appears to be responsible for filling
all of the traditional roles of software architect, software developer,
patch reviewer, patch committer, and software maintainer.
My question is whether or not there might be some value in splitting out
some of these roles, so that we can assign them to different people, and
thus help to address the scalability issues that Thomas raised?
Greg Kroah-Hartman also wants
to talk about maintainership and offered to "referee" a discussion. He has
some ideas that he described at LinuxCon
Japan and in a recent linux-kernel
posting that he thinks "will go a long ways in helping smooth this
out". John Linville also expressed
interest in that kind of discussion.
Another area that is generating a lot of interest is the stable tree.
Kroah-Hartman is interested
in finding out how the process is working for the other kernel
[...] is it going well for
everyone? Are there things we can do differently? How can I kick
maintainers who don't mark patches for stable backports in ways that
do not harm them too much? How can I convey decisions about the
longterm kernel selection process in a better way so that it isn't
surprising to people?
Based on the number of other submissions that mentioned the stable tree,
there seems to be a fair amount to discuss. The relationship between the
stable tree and the distributions is one fertile area. Kroah-Hartman said
that he often has to go "digging through distro kernel trees" to find
patches to apply, to which Andrew Morton suggested
that the "distro people
need a vigorous wedgie" for not making that easier. Various distribution
kernel maintainers (e.g. Josh Boyer and Jiri Kosina) agreed that the
distributions could do better, but that some discussion of the process
would be worthwhile.
In addition, some discussion of how distributions could better work with
the upstream kernel for regression tracking and bug reporting was proposed
by Boyer. Kosina wants
to discuss the stable review process with an eye toward helping
distributions decide which patches to merge into their kernels.
Mark Brown is also interested
but from the perspective of embedded rather than enterprise distributions.
Others also expressed interest in having stable/longterm tree discussions.
How to track bugs and regressions was a topic proposed
by Rafael Wysocki, who has been reporting to the summit on that topic for
many years. He was joined by Dave Jones, who would like to report
on bugs and regressions, both those found by his "trinity"
stress-testing tool and ones that have been found in the Fedora kernel over
the last year. Like Wysocki, Kosina is also interested in discussing whether
the kernel bugzilla is the right tool for tracking bugs and regressions.
Kernel testing is another area that seems ripe for a discussion. Fengguang
Wu would like to report
on his efforts to test kernels as each new commit is added:
And I would like a chance to talk about doing kernel tests in a timely fashion:
whenever one pushes new commits to git.kernel.org, build/boot/stress tests will
be kicked off and possible errors be notified back to the author within hours.
This fast develop-test feedback cycle is enabled by running a test
backend that is able to build 25000 kernels and runtime test 3000
kernels (assuming 10m boot+testing time for each kernel) each day.
Just capable enough to outrace our patch creation rate ;-)
On an average day, 1-2 build errors are caught in the 160 monitored kernel trees.
Wu's posting spawned a long thread where various developers described their
test setups and what could be done better. Jones mentioned
the Coverity scanner in that thread, which led Jason Wessel to highlight
Jones's comment as well as give more information on the tool and the kinds
of information it can provide. More and better automated kernel testing is
definitely on the minds of a lot of potential summit attendees.
James Bottomley would like
to eliminate "kernel work creation schemes", in particular he targeted
the amount of code that is needed to support CONFIG_HOTPLUG:
massive proliferation of __dev... _mem... __cpu... and their ilk are
getting out of control. Plus, the amount of memory they save is tiny (a
few pages at best) and finally virtually no-one compiles without
CONFIG_HOTPLUG, so they're mostly nops anyway. However, for that very
case, we've evolved a massive set of tools to beat ourselves up whenever
we violate the rules of using these tags. What I'd like to explore is
firstly, can we just eliminate CONFIG_HOTPLUG and make it always y (this
will clear up the problem nicely) or, failing that, can we just dump the
tags and the tools and stop causing work for something no-one cares
There were few defenders of CONFIG_HOTPLUG=n in the thread, but he
was also interested in finding ways to avoid constructs that lead to a lot
of code churn to no good end. In a somewhat similar vein, H. Peter Anvin
would like to discuss the baseline
requirements for the kernel. Supporting some of the niche
uses of Linux (on exotic hardware or with seriously outdated toolchains)
creates an ongoing cost for kernel hackers that Anvin would like to see
reduced or eliminated.
Several PCI topics were proposed, including PCI
root bus hotplug issues by Yinghai Lu and a PCI
breakout session that Benjamin Herrenschmidt suggested. In the latter,
Lu's work, some PCI-related KVM issues, cleaning up some PowerPC special
cases, and the rework of the PCI hotplug
core could all be discussed. As Herrenschmidt put it: "I think there's enough material to keep us busy and a face to face round
table with a white board might end up being just the right thing to do".
Memory management topics also seem popular. Glauber Costa proposed
several topics, including kmem tracking and per-memory-control-group kmem
memory shrinking, while Hiroyuki Kamezawa suggested
memory control group topics. Johannes Weiner is also interested in talking
about a separate memory management tree that would supplement the work
that Morton does with the -mm tree. The ever-popular memory control group
writeback topic was also suggested by Wu and Weiner.
Srivatsa S. Bhat would like to present
a newcomer's perspective on kernel development with an eye toward
reducing some of the challenges new developers face. Josef Bacik has a similar
idea, and would like to discuss how to make it easier for new
contributors. In addition to a report on work in the USB subsystem
(and USB 3.0 in particular), Sarah Sharp would like
to "do a brief
readout" about what she learns at AdaCamp in July:
AdaCamp is a conference
focused on gathering tech women together to work on solutions for
getting women into open technology fields, and retaining them. I think
this would be of interest to the Linux kernel community, since we have
very few women kernel developers. I hope to keep this read out focused
on positive changes we can make.
As one can see, these proposals (and many more that were not mentioned)
range all over the kernel map. There tends to be a focus on more process
and social aspects of the kernel at the summit, mostly because the hardcore
technical topics are generally better handled by a more focused group. The
summit tries to address global concerns, and there seem to be plenty
to choose from.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
- Lucas De Marchi: kmod 9 .
(June 19, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>