The current 2.6 development kernel is 2.6.31-rc3
by Linus on
July 13. There's lots of fixes, but this prepatch also includes a new
driver for GPIO-based matrix keypads, the "osdblk" driver (which presents
an object on an object storage device as a Linux block device), support for
the loading of kernel module symbols in the performance counter tools, and
the removal of the Intel Langwell USB OTG driver (because it depends on
infrastructure which is not yet present in the mainline). The short-form
changelog is in the announcement; full details can be found in the
There have been no stable kernel updates over the last week.
Comments (none posted)
Kernel development news
To many kernel developers, the most scary ideas/patches are not the
ones which are totally outlandish and crazy (such as making the
kernel be able to parse and execute SQL-like statements as
arguments to a modified readdir system call, in the grand but
quixotic goal of unifying filesystems and databases), but the ones
which are just sane enough to be initially appealing, but which
ends up being a millstone making future development and backwards
compatibility efforts extremely difficult. At least, this is the
fear which I suspect drives the often harsh reception some have
received (regardless of whether that reception was justified or
-- Ted Ts'o
This is why changing core MM is so worrisome - there's so much
secret and subtle history to it, and performance dependencies are
unobvious and quite indirect and the lag time to discover
regressions is long.
-- Andrew Morton
Having this name is very convenient, people review my drivers like
-- Linus Walleij
(Thanks to Alejandro
Comments (none posted)
. It appears that vger.kernel.org, the busy system which
handles the bulk of the kernel-oriented mailing lists, is about to get a new mailing list manager
new system is called "listmanager"; it is meant to be somewhat more
efficient that the existing majordomo installation. The interface should
stay about the same, though.
An early-stage version of the software has been posted for those who would
like to play with it. Matti Aarnio, the author of the code, says:
"Somebody will want to know if the sources will be available. Yes,
but it is only my 3rd day of hacking at it yet..."
Btrfs and RAID. Speaking of early-stage software, David Woodhouse
has posted an initial
implementation of RAID5 and RAID6 support for Btrfs. It's not exactly
functional yet, but a number of the pieces are in place. Some additional
good news is that this work does not involve the addition of another RAID
implementation to the kernel; instead, David has moved the MD implementation into
common code so that it can be used in both places.
VFAT. Andrew Tridgell continues to work on the VFAT patent
workaround patch. He has set up a directory on kernel.org which
contains the latest version of the patch; there is also a README file which
describes the interoperability problems which have been identified so far.
Meanwhile, some developers are pushing for a return to the previous version of the
patch, which simply took away the ability to create long file names.
That patch removes some useful functionality, but it is also pretty well
guaranteed not to cause interoperability problems.
Tridge does not appear to have given up on the long-name-only approach,
though. Stay tuned; he may yet come up with a version which interoperates
Kmemleak. The kmemleak code was merged for
2.6.31; that means that testers are now beginning to post about memory leak
reports. At this stage, it seems that kmemleak is still putting out a fair
number of false reports - but it is also turning up real leaks. People who
are interested in playing with kmemleak should (1) run 2.6.31-rc3 or
later, and (2), look at these suggestions
for evaluating leak reports posted by kmemleak author Catalin Marinas.
Comments (none posted)
The 2009 kernel summit is planned for October in Tokyo. Over the years,
your editor has observed that the discussion on what to discuss at the
summit can sometimes be as interesting as the summit itself. Recently, the
question of how user-space programmers can communicate requirements to the
kernel community was raised. The ensuing discussion was short on
definitive answers, but it did begin to clarify a problem in an interesting
For the curious, the entire thread can be found in the
ksummit-2009-discuss archives. Matthew Garrett started things this way:
I've just run a session at the desktop summit in Gran Canaria on
the functionality that userspace developers would like to see from
the kernel. Some interesting things came out of it, but one of the
major points was that people seemed generally unclear on how they
could communicate those requirements to kernel developers. Worth
Dave Jones's response was instructive:
What exactly is the problem ? They know where
firstname.lastname@example.org is, so why aren't they talking to us
To developers who are used to the ways of linux-kernel, and who are well
established in that community, a question like this might make sense. If
one were to poll developers who do not normally hang out in the kernel
community, though, one might get an answer something like this:
- The volume on linux-kernel is far too high for ordinary people to cope
- Even if we could keep up with linux-kernel, the volume is still likely
to bury anything we might post there.
- Kernel people speak their own language, making it hard to follow
discussions, much less participate in them.
- If somebody does notice our request, they will probably flame it to a
cinder without necessarily taking the time to understand it first.
- If they don't flame it, they will probably tell us to send a patch,
but we're not kernel developers and thus not in a position to do that.
There can be no doubt that some communications problems can be easily
blamed on the requesting side. If a feature request is phrased as a
demand, it is unlikely to be received well. Kernel developers are beholden
to demands from their employers, but from nobody else; like most other
developers, they take a dim view of people who feel entitled to free work
just because they want it. A classic example here would be the early
Carrier Grade Linux specifications produced by OSDL; they read like a "to
do" list handed to the kernel community, even if OSDL eventually claimed
that it was not intended that way.
Another problem can be poorly-expressed requirements. Consider the early TALPA proposal, which
posted a very clear set of low-level requirements. Unfortunately, they
were too low-level, requiring features like the ability to intercept
file close operations. Instead, TALPA (now fanotify) needed to express
requirements like "we need a clean way to support proprietary
malware-scanning software, and this is why." That disconnect set the
project back significantly, and could well have killed a project whose
developers showed less persistence or less willingness to learn. Clearly
expressing requirements at the right level is never an easy task, but it's
Finally, some ideas just don't make sense in the kernel. Perhaps they
cannot be implemented in a way which avoids security problems, does not
break other features, and does not create long-term maintenance problems.
Or perhaps there are better solutions in user space. A developer who goes
to linux-kernel with this kind of request is likely to go away feeling like
kernel developers are completely unwilling to listen to reason.
All of the above notwithstanding, there is some recognition that user-space
developers have real difficulties in bringing requirements to the kernel
community. Kernel developers tend to be busy, focused on their own
projects, and not always entirely open to requests from outside the
community. There is no mechanism for tracking feature requests, so it is
very easy for them to be buried in the flood of email. The tone of
discussions can be harsh, even though it truly has improved over the
years. And so on.
This is not good.
The kernel exists to provide for the needs of user space; if the kernel
development community is not hearing what those needs are, it can only fail
to satisfy them. So thinking about how to make it easier for user-space
developers to communicate their requirements would seem to be worthwhile;
chances are that space will be made at the summit for that topic.
But there is no need to wait for the summit to start talking about how
things could be improved.
Matthew Wilcox suggested the creation of a
document on how user-space developers can interact with the kernel
community. The idea makes sense (your editor may just try to help there),
but this is not a problem which can be solved by documents alone.
James Bottomley described three broad
categories of users needing changes to the kernel:
- Sophisticated developers who can write their own kernel extensions.
- Users who can get a kernel developer interested in their desired
- Users who want features that no developers are interested in.
James points out that categories 1 and 2 can be helped with documentation
and general outreach. He worries, though, that we have no way to help the
third category of users, who are generally left with no way to get the
kernel changed to meet their needs.
Ted Ts'o had a different taxonomy which he
has put forward as a way to help understand the problem:
- Core kernel developers (or those who have access to such people).
Core developers have the advantage of a high degree of trust in the
community; that allows them to get features into the kernel with a
relatively small amount of trouble. They are able to merge code which
might well not pass muster if it came from a different source.
- Competent, but non-core kernel developers (and, again, people who have
access to them). These developers have to work harder to justify
their changes, but they are generally in a position to get changes
merged as long as the work is good.
- Potentially competent developers with "patently bad design taste."
Ted suggests that the frank nature of the kernel review process is
intended mainly to weed out bad patches from this source.
- Users with no access to kernel development expertise, who must thus
try to convince somebody else in the community to implement their
desired feature for them. Ted divided this category into two
subcategories, depending on whether there is an active kernel
developer working in the user's area of interest or not.
Ted's thought is that this taxonomy can help users to understand why
certain patches and ideas are treated the way they are. It can also be
used to help
develop ways to reach out to each specific group of users. Certainly the
different groups need to hear different messages. One could argue that the
existing documentation should be sufficient for people with kernel
development skills, but there is relatively little help out there for those
who must find a developer to do their work for them.
It is that last group which is most likely to be intimidated by the
prospect of walking into linux-kernel and asking for features. The kernel
community could really use a person who would take on the task of working
with these users, helping them to clarify their requirements, connecting
them with the appropriate developers, and tracking requests. The good news
is that we do have such a person; the bad news is that it's Andrew Morton,
who has one or two other things to do as well. The community would benefit
from a person who worked something close to full time on the task of
helping users and user-space developers get what they need from the kernel.
That sort of position tends to be very hard to fund, though; as a result,
it tends to stay vacant.
As was noted at the outset, this conversation did not produce much in the
way of concrete conclusions. It is far from complete, though. If not
before, it will be resumed in October at the full summit. Needless to say,
your editor plans to be there; stay tuned.
Comments (15 posted)
Complexity has long been known to be the enemy of security. A code base
which is small and straightforward can be verified, with a reasonable
degree of trust, to do what it is supposed to do and only that. As the
code becomes more complex and harder to understand, that sort of
verification becomes increasingly difficult. For this reason, developers
try to separate code requiring privilege from that which doesn't. Any code
which can be run in a nonprivileged mode is code which is relatively
unlikely to create security problems, and the code which is left can,
hopefully, be reviewed to a level sufficient to give the required degree of
confidence in its security.
Now consider the X window system. It is a massive body of code with a
relatively small development community. Some of the code is truly ancient
and hasn't seen much real developer attention in years. As with most
projects, some of the code is of higher quality than the rest. The X
server performs a complex task - based on a complicated protocol - for
almost every Linux user. Sometimes it is even exposed to the full
Internet. And the X server runs as root, with full privileges. The actual
number of security problems found in X over the years has been relatively small,
but it is not hard to believe that it is more a matter of luck and a lack of
attackers than the inherent quality of the X code base.
Worries about the X server are not new; that is why there has been
discussion of running it in a nonprivileged mode for several years. Much
of the work which has been done on display graphics has had this goal in
mind. All that notwithstanding, Linux distributions still install X as a
setuid program. It just has not been possible to enable a nonprivileged X
server to get its job done without opening up the system as a whole.
Those days are just about done. The Moblin project is now claiming that it will be the first
distribution to ship with a nonprivileged X server. Where Moblin goes,
others will certainly follow.
Some details of how this work
was done were posted to the xorg-devel list by Jesse Barnes at the
beginning of July. According to Jesse, finishing out this multi-year job
was "pretty easy." It seems that the pieces are in place now - at least
for some graphics hardware - to the point that a few hours of work got the
job done. It seems like an almost anti-climactic end to such a long-term
challenge. But, of course, the work which has made this result possible
has been ongoing for a long time.
The biggest piece is the kernel mode setting (KMS) code which was merged
for 2.6.29. Prior to KMS, the X.org server was charged with finding the
hardware and driving it directly from user space. Needless to say, this
sort of access requires root privilege, since it can easily be used to
compromise the system. KMS (and the associated graphical memory management
code) turns graphical hardware into something closer to a normal device
with a normal kernel driver - albeit a rather complex and specialized sort
of "normal" device. The hardware manipulations requiring privilege have
been isolated into a relatively small piece of kernel code; they are now
separated from the rather larger body of code implementing the X protocol.
That means that X code accessing the hardware can now be run unprivileged
as long as it has the ability to open the appropriate device file.
The server must also have the ability to open related device files: input
devices, the virtual console, and so on. But that is a problem that has
been solved for years; the login process can easily change the ownership of
those files so that an unprivileged server can access them but the world as
a whole cannot.
What's left is some detail work. Some of the ioctl() calls for
direct rendering are currently root-only; Jesse thinks that they can be
made generally available in a safe way. There may be some small additions
to the driver ABI to allow the final root-only operations to be pushed onto
the kernel side. But that's about it.
Of course, user-mode X will currently only work with Intel chipsets, since
those are the only ones with full KMS support at this time. Radeon drivers
are acquiring that support quickly, though, and may be able to support
no-root operation in the relatively near future. That leaves NVIDIA as the
usual odd chipset out of the big three; the current Nouveau feature
matrix suggests that it will be some time yet before the requisite
features are available there.
It may also be a little while until we see no-root support in more
general-purpose distributions. Moblin has a relatively narrow focus on
Intel hardware, by virtue of the fact that it's still mostly Intel people
who are doing the work. Distributors who need to make things work on
whatever hardware happens to be present may approach a change of this
magnitude with a bit more caution. Still, X without root is clearly in the
future, and the near future at that.
Comments (16 posted)
Changes are happening the way the virtual filesystem and virtual memory
subsystems interact. One of the goals of this work is to close a race existing in
page_mkwrite() (called when a previously read-only page table
entry (PTE) is about to
become writable), namely to make sure that file blocks are properly
zero-filled if a truncate operation increases the file size.
As a part of improving page_mkwrite(), Jan Kara posted a set of patches. However, these
patches introduced a new lock to resolve the problem. Nick Piggin
thinks this can be done by using the page lock instead of a
new lock. As a first step toward a resolution of the problem, he posted a set of
to improve the truncate sequence. The new truncate sequence is simpler
to understand, flexible in usage, and most
important of all, handles errors gracefully.
The truncate() and ftruncate() system calls are used to set a file
to the specified size. If the file is larger
than the argument passed, the file size is truncated and the part of the
file greater than the passed file size is lost. If the current file size is
smaller than the passed file size argument, the file size is increased
and the area greater than the previous file size is filled with zeros. The
file size argument passed cannot be greater
than the maximum possible file size on the filesystem.
A user space truncate() call is handled, inside the kernel, by
do_sys_truncate(), which is responsible for weeding out all
error cases (such as "the inode is a directory" or permission errors). It
breaks leases of files locked with flock() and calls
do_truncate(). do_truncate(), in turn,
creates a new attribute structure with
the new length of the file and calls notify_change() with the
and the new attributes under inode->i_mutex.
the generic inode_setattr(), either explicitly, or through the
filesystem implementation of setattr(). Then, inode_setattr()
to set the inode size and unmap the pages mapped beyond the new file
size. After unmapping the pages, the associated filesystem's
truncate() operation is called to free the disk blocks associated
with the file.
According to Nick, this approach has problems:
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
Nick's new truncate sequence introduces a way to
better communicate error conditions and consolidates the checks
which most filesystems currently perform individually. The
original intention was to add a
new truncate() operation in struct inode_operations which
would be called
directly for a truncate operation in inode_setattr().
disagreed with the call sequence, stating that the new truncate
function should be called from notify_sequence, and not from
inode_setattr which is the default implementation for
Nick felt that clearing ATTR_SIZE before calling generic
setattr is not unusual (discussed later), so he decided to
introduce his changes with a flag called new_truncate in
struct inode_operations, and not using a new truncate
function altogether. The new_truncate flag indicates that the
truncate() function in the inode operations handles the new format.
Nick admits that this is a nasty hack when he introduces the variable in
inode_operations. However, it will be required until
all filesystems transition to the new truncate sequence. Filesystem code which
does not implement the new convention will automatically initialize
new_truncate to zero, indicating that it has not transitioned
patch in the patch series introduces new functions to facilitate the
change. inode_newsize_ok() performs simple checks to check if
the intended new file size is within limits defined by the filesystem or is
not a swap file:
int inode_newsize_ok(struct inode *inode, loff_t offset)
These checks are currently done by individual filesystems. Using this
function results in cleanups in individual filesystem code.
The truncate_pagecache() function truncates the inode pages and
unmaps the pages in the range beyond the new filesystem size:
void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);
truncate_pagecache() should ideally be called before the filesystem releases
the data blocks associated with the inode. This way the page cache will
always be in sync with the on-disk format and the filesystem will
not have to deal with situations such as writepage() being called for
a page that has had its underlying blocks deallocated.
The vmtruncate() function is consolidated for NUMA and non-NUMA
architectures in mm/truncate.c. However,
vmtruncate() is deprecated. Instead,
truncate_pagecache() and inode_newsize_ok() introduced in
the first patch should be used.
The third patch is the main patch of the series which
uses the new truncate operation. It introduces
which performs equivalent of vmtruncate().
simple_setsize() is called by
inode_setattr() when ATTR_SIZE is passed. So filesystems implementing
their own truncate code in setattr must clear ATTR_SIZE
before calling the generic inode_setattr().
To follow the new standards of the truncate operation, individual
filesystems must implement their own setsize() function,
which performs the
file size validation checks, truncates the page cache, and truncates the data
blocks associated with the inode. Filesystems must not trim
off blocks past i_size using vmtruncate(). Instead,
they must handle the truncate in the filesystem code using
truncate_pages(). This creates a
better opportunity to catch errors. The
inode_operations.truncate fields will go away once all filesystems
To demonstrate the change, the final patch in the series
modifies the ext2 filesystem to use the new truncate interface. The patch
introduces ext_setsize() to set the inode size of the file,
truncate the pagecache, and, finally, trim the data blocks on the
filesystem. If ATTR_SIZE is set, ext2_setattr()
calls ext2_setsize() to
perform the truncate and the ATTR_SIZE is unset so that
inode_setattr() does not perform the operations again.
The new truncate patchset has gone through a fair share of review and
is pretty likely to get merged. However, it would require the "nasty
hack" until all filesystems have transitioned to the new way of
truncating files, after which the hack will be removed. The patches
are part of the improvements Nick wants to see in the VM layer. Based
on the new truncate patches, Nick posted an RFC
on how he would
close a race condition in page_mkwrite when a file is
truncated beyond the current file size. Closing races is a good thing, so
expect this work to proceed apace.
Comments (1 posted)
Patches and updates
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>