The current 2.6 development kernel is 2.6.29-rc4
on February 8. It
contains a long list of fixes merged over the course of a week and a half.
The short-form changelog is in the announcement, or see the
for all the details.
The current stable 2.6 kernel is 18.104.22.168, released (along with 22.214.171.124) on February 6.
Both updates contain a long list of fixes. The 126.96.36.199 and 188.8.131.52 updates - with yet another long list
of fixes - are in the review process as of this writing; they will most
likely be released on February 12.
Comments (none posted)
Kernel development news
So, it was never a cpumask at all; just a remnant of the use of
sigaction for interrupt handlers. We've been happily setting it
throughout the kernel since 1995.
On the assumption that it has failed to coerce the spirits of our
ancestors to land among us, I'll create a patch to remove it.
-- Rusty Russell
Please write good changelogs. This is not some pointless
book-keeping exercise. People will make decisions about which
kernel versions patches should be merged into, and they will want
to know if a particular patch addresses a particular problem which
they are experiencing. For this, they need information.
-- Andrew Morton
Comments (none posted)
Once upon a time, the way to get a patch into the mainline kernel was to
email it to Linus Torvalds. A hopeful developer would then wait for Linus
to release a new kernel tree to see whether the patch had been included or
not. In the latter case, the more persistent developers would resend the
patch. Often, developers had to be persistent indeed if they wanted their
code to be merged. The system was, in other words, lossy; we'll never know
how much useful code was simply dropped.
The use of git (and BitKeeper before it) has brought an end to that era.
Once a change gets into somebody's tree, it is relatively unlikely to be
lost. It's a much better way of doing things for everybody involved;
important fixes no longer get lost, and developers, rather than checking
for their patches and resending them, can now devote themselves to the
creation of new bugs to be fixed.
Beyond that, though, things have changed in that, for most developers, the
way to get a patch into the kernel is no longer to send it to Linus.
Instead, they will pass their work through a subsystem tree. This
mechanism is reasonably well understood, but, to your editor's knowledge,
nobody has taken a hard look at what the flow of patches into the mainline
looks like now. With that in mind, your editor set out with the
complementary goals of (1) charting the paths patches take on their
way to Linus, and (2) figuring out how Graphviz works. A certain amount of
success was achieved on both fronts.
Back in the BitKeeper days, your editor asked Larry McVoy if there was any
way to track which repositories a specific changeset had passed through;
unfortunately, that information was not preserved by BitKeeper. As it
turns out, git does a better job of keeping that information around -
though it is not a perfect record keeper either. When Linus pulls a tree
from some other developer, git will (usually) add a "merge commit" to the repository
which indicates where the other tree came from. This commit has (at least)
two parent commits; one is whatever was at the tip of Linus's tree prior to
the merge, while the other points to the tip of the stream of changesets
which came from the pulled tree. Multiple trees can be merged at once; in
this case, there will be more than two parent commits.
By following the links from each commit to its parent, one can determine
which tree each commit came from. Merges, too, are propagated up through
pull operations, so it is possible to follow this history back through an
arbitrary number of trees. The gitk tool does a nice job of displaying how
the various paths come together into a given repository; the resulting
graph can be quite complex. What your editor has done is to generate a
statistical view of this process; this view loses information about
specific patches, but provides, instead, an overall view of how patches get
into the mainline.
A piece of the resulting graph can be seen on the right; click on the
thumbnail to see the whole thing, which is quite large. It is, arguably, a
messy picture, but some interesting things jump out of it. At the top of
the list is the fact that the graph is quite shallow: it shows 107 trees,
almost all of which feed directly into the mainline. For the 2.6.29
development cycle, only a handful of trees are pulled into a separate
subsystem tree before going to Linus, and exactly one tree feeds patches
through two other layers. For the most part, subsystem maintainers are
going straight to Linus without dealing with middle managers.
975 of 11,260 changesets went directly into the mainline without existing
in any subsystem tree at all. Some of those are the merge changesets
created by Linus as he pulls trees; many of the rest are the patches which
go by way of Andrew Morton. Linus wrote a very small number of them
himself. And, occasionally, Linus merges a patch sent directly from a
developer, but that is a relatively uncommon occurrence.
When interpreting these numbers, there is one important thing which must be
kept in mind: by default, git will not record merge information when it is
doing a "fast forward" merge. If a developer pulls down the current
mainline repository, adds some patches on top, then gets Linus to pull the
patches before anything else changes in the mainline, those patches can be
added directly to the mainline without the need for a merge commit to hold
things together. Fast-forward merges into the mainline are (probably)
fairly rare, but they may well happen more often at the subsystem level.
So this kind of information, when generated from a git repository, will
never be 100% complete; some merges (and the repositories they came from)
will be invisible.
For 2.6.29, two networking trees maintained by David Miller were the
biggest waypoint for changesets (1910 of them) headed into the mainline; of
those, many came from John Linville's wireless tree. After that, the
"linux-2.6-tip" tree (the tree maintained by Ingo Molnar and company for a
few subsystems, including the x86 architecture and the scheduler)
contributed 1270 changesets to this development cycle. Other large sources
of changes were the btrfs tree (910 changesets - the entire btrfs
development history), the Video4Linux tree, the sound tree, and the ARM
architecture tree. At the other end of the scale, twelve trees were the
source of five or fewer changes.
For the curious, the statistics are available in text form along with the full names of the
relevant git repositories. The code which generated this information is
available as part of the gitdm repository at
git://git.lwn.net/gitdm.git. An obvious place for future
improvement is to track information about branches within repositories;
this would increase the resolution of the whole picture. But that's for
another development cycle; stay tuned.
Comments (6 posted)
The relationship between embedded system developers and the kernel
community is known for being rough, at best. Kernel developers complain
about low-quality work and a lack of contributions from the embedded side;
the embedded developers, when they say anything at all, express
frustrations that the kernel development process does not really keep their
needs in mind. A current discussion involving developers from the Android
project gives some insight into where this disconnect comes from.
Android, of course, is Google's platform for mobile telephones. The
initial Android stack was developed behind closed doors; the code only made
it out into the world when the first deployments were already in the
works. The Android developers have done a lot of kernel work, but very
little code has made made the journey into the mainline. The code which
has been merged all went into the staging tree without a whole lot
of initiative from the Android side. Now, though, Android developer Arve
Hjønnevåg is making an effort to merge a piece of that
project's infrastructure through the normal process. It is not proving to
be an easy ride.
The most controversial bit of code is a feature known as "wakelocks." In
Android-speak, a "wakelock" is a mechanism which can prevent the system
from going into a low-power state. In brief, kernel code can set up a
wakelock with something like this:
wake_lock_init(struct wakelock *lock, int type, const char *name);
The type value describes what kind of wakelock this is;
name gives it a name which can be seen in
/proc/wakelocks. There are
two possibilities for the type: WAKE_LOCK_SUSPEND prevents the system from
suspending, while WAKE_LOCK_IDLE prevents going into a low-power
idle state which may increase response times. The API for acquiring and
releasing these locks is:
void wake_lock(struct wake_lock *lock);
void wake_lock_timeout(struct wake_lock *lock, long timeout);
void wake_unlock(struct wake_lock *lock);
There is also a user-space interface. Writing a name to
/sys/power/wake_lock establishes a lock with that name, which
can then be written to /sys/power/wake_unlock to release the
lock. The current patch set
only allows suspend locks to be taken from user space.
This submission has not been received particularly well. It has, instead,
drawn comments like this from Ben Herrenschmidt:
looks to me like some people hacked up some ad-hoc trick for
their own local need without instead trying to figure out how to fit
things with the existing infrastructure (or possibly propose changes to
the existing infrastructure to fit their needs).
or this one from Pavel Machek:
Ok, I think that this wakelock stuff is in "can't be used properly"
area on Rusty's scale of nasty interfaces.
There's no end of reasons to dislike this interface. Much of it duplicates
the existing pm_qos (quality of service) API; it seems that pm_qos does not meet Android's needs, but it
also seems that no effort was made to fix the problems. The scheme seems
over-engineered when all that is really needed is a "do not suspend" flag -
or, at most, a counter. The patches disable the existing
/sys/power/state interface, which does not play well with
wakelocks. There is no way to recover if a user-space process exits
while holding a wakelock. The default behavior for the system is to
suspend, even if a process is running; keeping a system awake may involve a
chain of wakelocks obtained by various software components. And so on.
The end result is that this code will not make it into the mainline
kernel. But it has been shipped on large numbers of G1 phones, with many
more yet to go. So users of all those phones will be using out-of-tree
code which will not be merged, at least not in anything like its current
form. Any applications which depend on the wakelock sysfs interface will
break if that interface is brought up to proper standards. It's a bit of a
mess, but it is a very typical mess for the embedded systems community.
Embedded developers operate under a set of constraints which makes proper
kernel development hard. For example:
- One of the core rules of kernel development is "post early and often."
Code which is developed behind closed doors gets no feedback from the
development community, so it can easily follow a blind path for a long
time. But embedded system vendors rarely want to let the world know
about what they are doing before the product is ready to ship; they
hope, instead, to keep their competitors in the dark for as long as
possible. So posting early is rarely seen as an option.
- Another fundamental rule is "upstream first": code goes into the
mainline before being shipped to customers. Once again, even if an
embedded vendor wants to send code into the mainline, they rarely want
to begin that process before the product ships. So embedded kernels
are shipped containing out-of-tree code which almost certainly has a number of
problems, unsupportable APIs, and more.
- Kernel developers are expected to work with the goal of improving the
kernel for everybody. Embedded developers, instead, are generally
solving a highly-specific problem under tight time constraints. So
they do not think about, for example, extending the existing
quality-of-service API to meet their needs; instead, they bash out
something which is quick, dirty, and not subject to upstream review.
One could argue that Google has the time, resources, and in-house kernel
development knowledge to avoid all of these problems and do things right.
Instead, we have been treated to a fairly classic example of how things can
The good news is that Google developers are now engaging with the community
and trying to get their code into the mainline. This process could well be
long, and require a fair amount of adjustment on the Android side. Even if
the idea of wakelocks as a way to prevent the system from suspending is
accepted - which is far from certain - the interface will require
significant changes. The associated "early suspend" API - essentially a
notification mechanism for system state changes - will need to be
generalized beyond the specific needs of the G1 phone. It could well be a
lot of work.
But if that work gets done, the kernel will be much better placed to handle
the power-management needs of handheld devices. That, in turn, can only
benefit anybody else working on embedded Linux deployments. And,
crucially, it will help the Android developers as they port their code to
other devices with differing needs. As the number of Android-based phones
grows, the cost of carrying out-of-tree code to support each of them will
also grow. It would be far better to generalize that support and get it
into the mainline, where it can be maintained and improved by the
Most embedded systems vendors, it seems, would be unwilling to do that
work; they are too busy trying to put together their next product. So this
sort of code tends to languish out of the mainline, and the quality of
embedded Linux suffers accordingly. Perhaps this case will be different,
though; maybe Google will put the resources into getting its specialized
code into shape and merged into the mainline. That effort could help to
establish Android as a solid, well-supported platform for mobile use, and
that should be good for business. Your editor, ever the optimist, hopes
that things will work out this way; it would be a good demonstration of how
embedded community can work better with the kernel community, getting a
better kernel in return.
Comments (23 posted)
A longstanding out-of-tree kernel feature—used by half-a-dozen or more
virus scanners—Dazuko has
recently changed its modus operandi in an effort to be included into
the mainline. Dazuko, and now DazukoFS,
are mechanisms to control access to files, which are generally used to stop
viruses from propagating on Linux servers. The goal is similar in many
ways to that of fsnotify/fanotify/TALPA, but the
DazukoFS implementation as a stackable filesystem is a completely different
The Dazuko project started almost exactly seven years ago as an effort to
allow user-space programs—Windows-style anti-virus scanners
mostly—to make file access decisions. One of the reasons to have
the scanning in user space—aside from the zero probability of
getting one added to the kernel—is to keep it vendor-neutral by not
favoring any particular anti-virus engine. But the means to that end was
system call hooking, which is a technique that is seriously frowned upon by
kernel hackers. Dazuko made an
abortive move to the LSM
API, but ran into various problems, including the inability to stack
multiple security modules. Eventually, the project started looking at a
stackable filesystem as a solution
that would be palatable for moving into the mainline.
Originally suggested for Dazuko by Christoph Hellwig in 2004, a stackable
has a number of advantages over the other solutions. It is a
self-contained solution that won't require core kernel code changes if
anti-virus developers wish to add new features. It also would add another
stackable filesystem to the kernel, which may help foster a more general
stackable filesystem framework. But the main reason is
that the project sees it as the most likely
path into the mainline. Main developer John Ogness explains:
Nearly seven years of out-of-tree development were more than
enough to prove that out-of-tree kernel drivers have an unnecessarily
large maintenance cost (which increases with each new kernel
release). With DazukoFS mainline, anti-virus vendors would finally
have an official interface and implementation on which to base their
online scanning applications.
DazukoFS is mounted atop an already-mounted filesystem in order to handle file
access decisions for files in the underlying filesystem. For example:
mount -t dazukofs /opt /opt
sets up the /opt
for checking by user-space processes that open a special /dev
file. All of the scanning application interaction with DazukoFS
is done through /dev
files, all of which is
documented in Documentation/filesystems/dazukofs.txt
File access decisions are made by processes or threads which make up a
"group". Groups act as a pool of available scanners to allow multiple
outstanding file access decisions. Once the pool is fully occupied, file
accesses will block until one becomes available. Groups are registered by
writing "add=MyGroupName" to /dev/dazukofs.ctrl. A
group id will then be assigned, which can be parsed from the output of reading
the dazukofs.ctrl file. Group ids are then used to access the
proper device for providing access decisions.
Based on the group id (N), a /dev/dazukofs.N file is created.
Each process in the group registers itself by opening that device. It
should then block in a read of the device waiting for a file access event.
Each event has three pieces of information that are read from the device file:
an event id, the process id of the accessing program, and the number of an
open file descriptor that can be used to read the contents of the file.
The scanning process should then perform whatever actions it requires to
make the decision whether to allow or deny the access.
Because it gets passed an open file descriptor, the scanning process does
not need any special privileges beyond those required to access the
/dev/dazukofs* files. Once it has made the decision, the scanning
process writes a string indicating the result to the device. It is then
responsible for closing the file descriptor for the accessed file.
There are a few additional things that can be done via the user-space API:
deleting groups, providing for some crash protection within groups, and
to protected files from within DazukoFS, all of which are described in the
There is also a major caveat that goes with this release of DazukoFS:
DazukoFS does not support writing to memory mapped files. This should not
cause the kernel to crash, but will instead result in the application
failing to perform the writes (although mmap() will appear to be
successful from the application's viewpoint!).
That is done, at least partially, to avoid race conditions where a
malicious program overwrites
the file contents between the scanning and the actual access. This is a
general achilles' heel for virus scanning mechanisms, but silently ignoring
writes to mapped files is a rather extreme reaction to that problem.
TALPA, which has subsequently become fanotify, defines this problem away as
not being a part of the threat model it is handling. Perhaps DazukoFS
should do something similar.
It would seem likely that only one of the two proposed solutions for
user-space file scanning will end up in the mainline. Ogness mentions
fanotify in his patch submission:
I am aware of the current work of Eric Paris to implement a file
access control mechanism within a unified inotify/dnotify
framework. Although I welcome any official interface to provide a file
access control mechanism for userspace applications in Linux, I feel
that DazukoFS provides a more elegant solution. (Note that the two
projects do not conflict with each other.)
So far, there has been no comment on the v2 patch submission, but there
were some suggestions to the first
submission back in December. The kernel filesystem hackers are pretty
busy folks in general, but right now there are numerous filesystems in
various states of review: btrfs, POHMELFS, DST, FS-Cache, and others.
Those may be
using up all of the available review bandwidth. Ogness recently announced
that he will be dropping support for the 2.x version of Dazuko—based
on system call hooks—to focus on
DazukoFS. In it he notes the lack of review:
As you probably know, DazukoFS has been submitted for inclusion in the
mainline Linux kernel. Unfortunately it is getting practically no
attention. I do not know if the silence is because I am not CC'ing the
correct people, because those people refuse to look at it, or because
no one has any time for it.
From the announcement, it seems clear that Ogness has the patience
necessary to shepherd DazukoFS through the kernel inclusion process. It
would seem that spending some time working with Eric Paris to try to find
some common ground between their two solutions might be time well spent as
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Boaz Harrosh: exofs.
(February 9, 2009)
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>