LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.29-rc4, released on February 8. It contains a long list of fixes merged over the course of a week and a half. The short-form changelog is in the announcement, or see the full changelog for all the details.

The current stable 2.6 kernel is 2.6.28.4, released (along with 2.6.27.15) on February 6. Both updates contain a long list of fixes. The 2.6.28.5 and 2.6.27.16 updates - with yet another long list of fixes - are in the review process as of this writing; they will most likely be released on February 12.

Comments (none posted)

Kernel development news

Quotes of the week

So, it was never a cpumask at all; just a remnant of the use of sigaction for interrupt handlers. We've been happily setting it throughout the kernel since 1995.

On the assumption that it has failed to coerce the spirits of our ancestors to land among us, I'll create a patch to remove it.

-- Rusty Russell

Please write good changelogs. This is not some pointless book-keeping exercise. People will make decisions about which kernel versions patches should be merged into, and they will want to know if a particular patch addresses a particular problem which they are experiencing. For this, they need information.
-- Andrew Morton

Comments (none posted)

How patches get into the mainline

By Jonathan Corbet
February 10, 2009
Once upon a time, the way to get a patch into the mainline kernel was to email it to Linus Torvalds. A hopeful developer would then wait for Linus to release a new kernel tree to see whether the patch had been included or not. In the latter case, the more persistent developers would resend the patch. Often, developers had to be persistent indeed if they wanted their code to be merged. The system was, in other words, lossy; we'll never know how much useful code was simply dropped.

The use of git (and BitKeeper before it) has brought an end to that era. Once a change gets into somebody's tree, it is relatively unlikely to be lost. It's a much better way of doing things for everybody involved; important fixes no longer get lost, and developers, rather than checking for their patches and resending them, can now devote themselves to the creation of new bugs to be fixed.

Beyond that, though, things have changed in that, for most developers, the way to get a patch into the kernel is no longer to send it to Linus. Instead, they will pass their work through a subsystem tree. This mechanism is reasonably well understood, but, to your editor's knowledge, nobody has taken a hard look at what the flow of patches into the mainline looks like now. With that in mind, your editor set out with the complementary goals of (1) charting the paths patches take on their way to Linus, and (2) figuring out how Graphviz works. A certain amount of success was achieved on both fronts.

Back in the BitKeeper days, your editor asked Larry McVoy if there was any way to track which repositories a specific changeset had passed through; unfortunately, that information was not preserved by BitKeeper. As it turns out, git does a better job of keeping that information around - though it is not a perfect record keeper either. When Linus pulls a tree from some other developer, git will (usually) add a "merge commit" to the repository which indicates where the other tree came from. This commit has (at least) two parent commits; one is whatever was at the tip of Linus's tree prior to the merge, while the other points to the tip of the stream of changesets which came from the pulled tree. Multiple trees can be merged at once; in this case, there will be more than two parent commits.

[Tree plot] By following the links from each commit to its parent, one can determine which tree each commit came from. Merges, too, are propagated up through pull operations, so it is possible to follow this history back through an arbitrary number of trees. The gitk tool does a nice job of displaying how the various paths come together into a given repository; the resulting graph can be quite complex. What your editor has done is to generate a statistical view of this process; this view loses information about specific patches, but provides, instead, an overall view of how patches get into the mainline.

A piece of the resulting graph can be seen on the right; click on the thumbnail to see the whole thing, which is quite large. It is, arguably, a messy picture, but some interesting things jump out of it. At the top of the list is the fact that the graph is quite shallow: it shows 107 trees, almost all of which feed directly into the mainline. For the 2.6.29 development cycle, only a handful of trees are pulled into a separate subsystem tree before going to Linus, and exactly one tree feeds patches through two other layers. For the most part, subsystem maintainers are going straight to Linus without dealing with middle managers.

975 of 11,260 changesets went directly into the mainline without existing in any subsystem tree at all. Some of those are the merge changesets created by Linus as he pulls trees; many of the rest are the patches which go by way of Andrew Morton. Linus wrote a very small number of them himself. And, occasionally, Linus merges a patch sent directly from a developer, but that is a relatively uncommon occurrence.

When interpreting these numbers, there is one important thing which must be kept in mind: by default, git will not record merge information when it is doing a "fast forward" merge. If a developer pulls down the current mainline repository, adds some patches on top, then gets Linus to pull the patches before anything else changes in the mainline, those patches can be added directly to the mainline without the need for a merge commit to hold things together. Fast-forward merges into the mainline are (probably) fairly rare, but they may well happen more often at the subsystem level. So this kind of information, when generated from a git repository, will never be 100% complete; some merges (and the repositories they came from) will be invisible.

For 2.6.29, two networking trees maintained by David Miller were the biggest waypoint for changesets (1910 of them) headed into the mainline; of those, many came from John Linville's wireless tree. After that, the "linux-2.6-tip" tree (the tree maintained by Ingo Molnar and company for a few subsystems, including the x86 architecture and the scheduler) contributed 1270 changesets to this development cycle. Other large sources of changes were the btrfs tree (910 changesets - the entire btrfs development history), the Video4Linux tree, the sound tree, and the ARM architecture tree. At the other end of the scale, twelve trees were the source of five or fewer changes.

For the curious, the statistics are available in text form along with the full names of the relevant git repositories. The code which generated this information is available as part of the gitdm repository at git://git.lwn.net/gitdm.git. An obvious place for future improvement is to track information about branches within repositories; this would increase the resolution of the whole picture. But that's for another development cycle; stay tuned.

Comments (6 posted)

Wakelocks and the embedded problem

By Jonathan Corbet
February 10, 2009
The relationship between embedded system developers and the kernel community is known for being rough, at best. Kernel developers complain about low-quality work and a lack of contributions from the embedded side; the embedded developers, when they say anything at all, express frustrations that the kernel development process does not really keep their needs in mind. A current discussion involving developers from the Android project gives some insight into where this disconnect comes from.

Android, of course, is Google's platform for mobile telephones. The initial Android stack was developed behind closed doors; the code only made it out into the world when the first deployments were already in the works. The Android developers have done a lot of kernel work, but very little code has made made the journey into the mainline. The code which has been merged all went into the staging tree without a whole lot of initiative from the Android side. Now, though, Android developer Arve Hjønnevåg is making an effort to merge a piece of that project's infrastructure through the normal process. It is not proving to be an easy ride.

The most controversial bit of code is a feature known as "wakelocks." In Android-speak, a "wakelock" is a mechanism which can prevent the system from going into a low-power state. In brief, kernel code can set up a wakelock with something like this:

    #include <linux/wakelock.h>

    wake_lock_init(struct wakelock *lock, int type, const char *name);

The type value describes what kind of wakelock this is; name gives it a name which can be seen in /proc/wakelocks. There are two possibilities for the type: WAKE_LOCK_SUSPEND prevents the system from suspending, while WAKE_LOCK_IDLE prevents going into a low-power idle state which may increase response times. The API for acquiring and releasing these locks is:

    void wake_lock(struct wake_lock *lock);
    void wake_lock_timeout(struct wake_lock *lock, long timeout);
    void wake_unlock(struct wake_lock *lock);

There is also a user-space interface. Writing a name to /sys/power/wake_lock establishes a lock with that name, which can then be written to /sys/power/wake_unlock to release the lock. The current patch set only allows suspend locks to be taken from user space.

This submission has not been received particularly well. It has, instead, drawn comments like this from Ben Herrenschmidt:

looks to me like some people hacked up some ad-hoc trick for their own local need without instead trying to figure out how to fit things with the existing infrastructure (or possibly propose changes to the existing infrastructure to fit their needs).

or this one from Pavel Machek:

Ok, I think that this wakelock stuff is in "can't be used properly" area on Rusty's scale of nasty interfaces.

There's no end of reasons to dislike this interface. Much of it duplicates the existing pm_qos (quality of service) API; it seems that pm_qos does not meet Android's needs, but it also seems that no effort was made to fix the problems. The scheme seems over-engineered when all that is really needed is a "do not suspend" flag - or, at most, a counter. The patches disable the existing /sys/power/state interface, which does not play well with wakelocks. There is no way to recover if a user-space process exits while holding a wakelock. The default behavior for the system is to suspend, even if a process is running; keeping a system awake may involve a chain of wakelocks obtained by various software components. And so on.

The end result is that this code will not make it into the mainline kernel. But it has been shipped on large numbers of G1 phones, with many more yet to go. So users of all those phones will be using out-of-tree code which will not be merged, at least not in anything like its current form. Any applications which depend on the wakelock sysfs interface will break if that interface is brought up to proper standards. It's a bit of a mess, but it is a very typical mess for the embedded systems community. Embedded developers operate under a set of constraints which makes proper kernel development hard. For example:

  • One of the core rules of kernel development is "post early and often." Code which is developed behind closed doors gets no feedback from the development community, so it can easily follow a blind path for a long time. But embedded system vendors rarely want to let the world know about what they are doing before the product is ready to ship; they hope, instead, to keep their competitors in the dark for as long as possible. So posting early is rarely seen as an option.

  • Another fundamental rule is "upstream first": code goes into the mainline before being shipped to customers. Once again, even if an embedded vendor wants to send code into the mainline, they rarely want to begin that process before the product ships. So embedded kernels are shipped containing out-of-tree code which almost certainly has a number of problems, unsupportable APIs, and more.

  • Kernel developers are expected to work with the goal of improving the kernel for everybody. Embedded developers, instead, are generally solving a highly-specific problem under tight time constraints. So they do not think about, for example, extending the existing quality-of-service API to meet their needs; instead, they bash out something which is quick, dirty, and not subject to upstream review.

One could argue that Google has the time, resources, and in-house kernel development knowledge to avoid all of these problems and do things right. Instead, we have been treated to a fairly classic example of how things can go wrong.

The good news is that Google developers are now engaging with the community and trying to get their code into the mainline. This process could well be long, and require a fair amount of adjustment on the Android side. Even if the idea of wakelocks as a way to prevent the system from suspending is accepted - which is far from certain - the interface will require significant changes. The associated "early suspend" API - essentially a notification mechanism for system state changes - will need to be generalized beyond the specific needs of the G1 phone. It could well be a lot of work.

But if that work gets done, the kernel will be much better placed to handle the power-management needs of handheld devices. That, in turn, can only benefit anybody else working on embedded Linux deployments. And, crucially, it will help the Android developers as they port their code to other devices with differing needs. As the number of Android-based phones grows, the cost of carrying out-of-tree code to support each of them will also grow. It would be far better to generalize that support and get it into the mainline, where it can be maintained and improved by the community.

Most embedded systems vendors, it seems, would be unwilling to do that work; they are too busy trying to put together their next product. So this sort of code tends to languish out of the mainline, and the quality of embedded Linux suffers accordingly. Perhaps this case will be different, though; maybe Google will put the resources into getting its specialized code into shape and merged into the mainline. That effort could help to establish Android as a solid, well-supported platform for mobile use, and that should be good for business. Your editor, ever the optimist, hopes that things will work out this way; it would be a good demonstration of how embedded community can work better with the kernel community, getting a better kernel in return.

Comments (23 posted)

DazukoFS: a stackable filesystem for virus scanning

By Jake Edge
February 11, 2009

A longstanding out-of-tree kernel feature—used by half-a-dozen or more virus scanners—Dazuko has recently changed its modus operandi in an effort to be included into the mainline. Dazuko, and now DazukoFS, are mechanisms to control access to files, which are generally used to stop Windows viruses from propagating on Linux servers. The goal is similar in many ways to that of fsnotify/fanotify/TALPA, but the DazukoFS implementation as a stackable filesystem is a completely different approach.

The Dazuko project started almost exactly seven years ago as an effort to allow user-space programs—Windows-style anti-virus scanners mostly—to make file access decisions. One of the reasons to have the scanning in user space—aside from the zero probability of getting one added to the kernel—is to keep it vendor-neutral by not favoring any particular anti-virus engine. But the means to that end was system call hooking, which is a technique that is seriously frowned upon by kernel hackers. Dazuko made an abortive move to the LSM API, but ran into various problems, including the inability to stack multiple security modules. Eventually, the project started looking at a stackable filesystem as a solution that would be palatable for moving into the mainline.

Originally suggested for Dazuko by Christoph Hellwig in 2004, a stackable filesystem has a number of advantages over the other solutions. It is a self-contained solution that won't require core kernel code changes if anti-virus developers wish to add new features. It also would add another stackable filesystem to the kernel, which may help foster a more general stackable filesystem framework. But the main reason is that the project sees it as the most likely path into the mainline. Main developer John Ogness explains:

Nearly seven years of out-of-tree development were more than enough to prove that out-of-tree kernel drivers have an unnecessarily large maintenance cost (which increases with each new kernel release). With DazukoFS mainline, anti-virus vendors would finally have an official interface and implementation on which to base their online scanning applications.

DazukoFS is mounted atop an already-mounted filesystem in order to handle file access decisions for files in the underlying filesystem. For example:

    mount -t dazukofs /opt /opt 
sets up the /opt filesystem for checking by user-space processes that open a special /dev file. All of the scanning application interaction with DazukoFS is done through /dev files, all of which is documented in Documentation/filesystems/dazukofs.txt

File access decisions are made by processes or threads which make up a "group". Groups act as a pool of available scanners to allow multiple outstanding file access decisions. Once the pool is fully occupied, file accesses will block until one becomes available. Groups are registered by writing "add=MyGroupName" to /dev/dazukofs.ctrl. A group id will then be assigned, which can be parsed from the output of reading the dazukofs.ctrl file. Group ids are then used to access the proper device for providing access decisions.

Based on the group id (N), a /dev/dazukofs.N file is created. Each process in the group registers itself by opening that device. It should then block in a read of the device waiting for a file access event. Each event has three pieces of information that are read from the device file: an event id, the process id of the accessing program, and the number of an open file descriptor that can be used to read the contents of the file. The scanning process should then perform whatever actions it requires to make the decision whether to allow or deny the access.

Because it gets passed an open file descriptor, the scanning process does not need any special privileges beyond those required to access the /dev/dazukofs* files. Once it has made the decision, the scanning process writes a string indicating the result to the device. It is then responsible for closing the file descriptor for the accessed file.

There are a few additional things that can be done via the user-space API: deleting groups, providing for some crash protection within groups, and handling accesses to protected files from within DazukoFS, all of which are described in the Documentation file.

There is also a major caveat that goes with this release of DazukoFS:

DazukoFS does not support writing to memory mapped files. This should not cause the kernel to crash, but will instead result in the application failing to perform the writes (although mmap() will appear to be successful from the application's viewpoint!).

That is done, at least partially, to avoid race conditions where a malicious program overwrites the file contents between the scanning and the actual access. This is a general achilles' heel for virus scanning mechanisms, but silently ignoring writes to mapped files is a rather extreme reaction to that problem. TALPA, which has subsequently become fanotify, defines this problem away as not being a part of the threat model it is handling. Perhaps DazukoFS should do something similar.

It would seem likely that only one of the two proposed solutions for user-space file scanning will end up in the mainline. Ogness mentions fanotify in his patch submission:

I am aware of the current work of Eric Paris to implement a file access control mechanism within a unified inotify/dnotify framework. Although I welcome any official interface to provide a file access control mechanism for userspace applications in Linux, I feel that DazukoFS provides a more elegant solution. (Note that the two projects do not conflict with each other.)

So far, there has been no comment on the v2 patch submission, but there were some suggestions to the first submission back in December. The kernel filesystem hackers are pretty busy folks in general, but right now there are numerous filesystems in various states of review: btrfs, POHMELFS, DST, FS-Cache, and others. Those may be using up all of the available review bandwidth. Ogness recently announced that he will be dropping support for the 2.x version of Dazuko—based on system call hooks—to focus on DazukoFS. In it he notes the lack of review:

As you probably know, DazukoFS has been submitted for inclusion in the mainline Linux kernel. Unfortunately it is getting practically no attention. I do not know if the silence is because I am not CC'ing the correct people, because those people refuse to look at it, or because no one has any time for it.

From the announcement, it seems clear that Ogness has the patience necessary to shepherd DazukoFS through the kernel inclusion process. It would seem that spending some time working with Eric Paris to try to find some common ground between their two solutions might be time well spent as well.

Comments (2 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

  • Boaz Harrosh: exofs. (February 9, 2009)

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds