User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.23-rc3; the 2.6.23-rc4 release is somewhat overdue as of this writing. There has been a steady flow of fixes into the mainline git repository over the last week.

The current -mm tree is 2.6.23-rc3-mm1. Recent changes to -mm include the long-awaited ath5k wireless driver (mac80211-based support for Atheros 5xxx wireless cards), a long list of x86_64 patches, a new PID namespaces patch, and the PIE executable randomization patch from ExecShield.

The current stable 2.6 kernel is, released on August 20. This one contains a single patch - a security fix for the signal vulnerability which allows, under some circumstances, an arbitrary signal to be sent to a setuid process. (Note that the process has already begun with that update being due on or after the 23rd).

For older kernels: was released on August 16 with a couple dozen fixes. The next 2.6.20 stable update is currently in review; it is a large patch set with quite a few fixes.

Comments (none posted)

Kernel development news

Quotes of the week

Why can DOS delete an infinite number of files and rm can't? Because rm was written using the "vi" editor and it causes brain damage and that's why after 20 years rm hasn't caught up with del.
-- Marc Perkel has solutions for all our problems

Asserting that critics should patch the holes in your handwaving is unlikely to impress anybody; arrogance is not in short supply around here and yours is not even original.
-- Al Viro responds (thanks to Dag Bakke).

Comments (49 posted)

Distributed storage

By Jonathan Corbet
August 21, 2007
Evgeniy Polyakov is not an easily discouraged developer. He has been the source of a great deal of interesting kernel code - including a network channels implementation, an asynchronous crypto framework, the kevent subsystem, the "network tree" memory management layer, and the netlink connector code. Of all of those patches, only the netlink connector has made it into the mainline kernel - and that was back in 2005. Undeterred, Evgeniy has come forward with another significant patch set for consideration. His ambitions are no lower this time around: he would like to replace much of the functionality offered by the device mapper, iSCSI, and network block device (NBD) layers. He calls the new subsystem distributed storage, or DST for short. The goal is to allow the creation of high-performance storage networks in a reliable and easy manner.

At the lowest level, the DST code implements a simple network protocol which allows block devices to be exported across a network. The number of operations supported is small: block read and write operations and a "how big is your disk?" information request is about it. But it is intended to be fast, non-blocking, and able to function without copying the data on the way through. The zero-copy nature of the code allows it to perform I/O operations with no memory allocations at all - though the underlying network subsystem might do some allocations of its own.

There is no data integrity checking built into the DST networking layer; it relies on the networking code to handle that aspect of things. There is also no real security support at all. If a block device is exported for use in DST, it is exported to anybody who can reach the host. The addition of explicit export lists could certainly be done in the future, but, for now, hosts exporting drives via DST are probably best not exposed to anything beyond an immediate local network.

The upper layer of the DST code enables the creation of local disks. A simple ioctl() call would create a local disk from a remote drive, essentially reproducing the functionality offered by NBD. Evgeniy claims better performance than NBD, though, with non-blocking processing, no user-space threads, and a lack of busy-wait loops. There is also a simple failure recovery mechanism which will reconnect to remote hosts which go away temporarily.

Beyond that, though, the DST code can be used to join multiple devices - both local and remote - into larger arrays. There are currently two algorithms available: linear and mirrored. In a linear array, each device is added to the end of what looks like a much larger block device. The mirroring algorithm replicates data across each device to provide redundancy and generally faster read performance. There is infrastructure in place for tracking which blocks must be updated on each component of a mirrored array, so if one device drops out for a while it can be quickly brought up to date on its return. Interestingly, that information is not stored on each component; this is presented as a feature, in that one part of a mirrored array can be removed and mounted independently as a sort of snapshot. Block information also does not appear, in this iteration, to be stored persistently anywhere, meaning that a crash of the DST server could make recovery of an inconsistent mirrored array difficult or impossible.

Storage arrays created with DST can, in turn, be exported for use in other arrays. So a series of drives located on a fast local network can be combined in a sort of tree structure into one large, redundant array of disks. There is no support for the creation of higher-level RAID arrays at this time. Support for more algorithms is on the "to do" list, though Evgeniy has said that the Reed-Solomon codes used for traditional RAID are not fast enough for distributed arrays. He suggests that WEAVER codes might be used instead.

At this level, DST looks much like the device mapper and MD layers already supported by Linux. Evgeniy claims that the DST code is better in that it does all processing in a non-blocking manner, works with more network protocols, has simple automatic configuration, does not copy data, and can perform operations with no memory allocations. The zero-allocation feature is important in situations where deadlocks are a worry - and they are often a worry when remote storage is in use. Making the entire DST stack safe against memory-allocation deadlocks would require some support in the network layer as well - but, predictably, Evgeniy has some ideas for how that can be done.

This patch set is clearly in a very early state; quite a bit of work would be required before it would be ready for production use with data that somebody actually cares about. Like all of Evgeniy's patches, DST contains a number of interesting ideas. If the remaining little details can be taken care of, the DST code could eventually reach a point where it is seen as a useful addition to the Linux storage subsystem.

Comments (13 posted)

Who maintains this file?

By Jonathan Corbet
August 21, 2007
Kernel developers are generally encouraged to split patches into small pieces before posting them to the mailing lists. Making each change self-contained and easy to understand helps reviewers do their job and is thus a good thing. That said, anybody who doubted that one can get too much of a good thing surely learned the truth when Joe Perches submitted this patch set made up of almost 550 patches, all to the same file. It is fair to say that this deluge of patches was not universally welcomed.

Packaging aside, the ultimate goal of Joe's patch was not particularly controversial: he would like to make it possible to easily find out who is the maintainer of a specific file in the kernel tree. So, for each entry in the MAINTAINERS file, he added one or more lines with patterns describing which files belong to that entry. With that information in place, his script can quickly identify who is responsible for any file in the tree. No more digging through MAINTAINERS or trying to extract email addresses from copyright notices in the source.

It's an appealing idea, but nobody seems to be entirely clear on how to implement it. Keeping this information in a central file has a number of obvious disadvantages. It would clearly go out of date quickly, for example. The MAINTAINERS file tends to get stale as it is; the chances of it being patched for every new or renamed file seem quite small. If developers, contrary to expectations, do keep this file up to date, one can expect large numbers of conflicts as all the resulting patches try to touch the same file.

The patch conflict problem could be mitigated by splitting up the MAINTAINERS file into per-directory versions, much like what was done with the kernel configuration file in the past. There are now over 400 Kconfig files in the mainline tree; some developers have expressed dismay at the idea of similar numbers of MAINTAINERS files being scattered around the tree. And, in any case, per-directory files aren't much more likely to be updated than the single, central file.

So around came another idea: why not just put the maintainer information into the source files? The result would be nicely split documentation which gets put in front of the relevant developers every time they edit the file. The record for maintenance of documentation in the code is far from perfect, but it is much better than the record for completely out-of-line documentation.

One question which comes up when this approach is considered is whether the resulting information should go into the binary kernel image or not. It would be easy to define a new tag like:

    MODULE_MAINTAINER("Your name here");

The provided information could then go into a special section in the kernel image where special tools could find it. Doing things this way would make it possible for people who don't have a kernel tree handy to look up a maintainer. On the other hand, it would bloat the kernel image and fix information in a binary, widely-distributed form where it could persist long after it goes out of date. So ex-maintainers could continue receiving mail for years after they have changed all of the relevant documentation.

An alternative would be to just put the maintainer information at the top of the file as a comment. Then it would only be in the source, and would, presumably, be relatively easy to keep up to date. At least, until, say, a mailing list for a major subsystem moves and all of the associated source files have to be changed. For example, Adrian Bunk noted that the move of the netdev mailing list to vger would have forced patches to about 1300 files.

Yet another approach is to find a way to store the information in the git repository. Git already maintains quite a bit of metadata about source files; to some it seems natural to add maintainer information as well. So far, the git developers have not shown a lot of appetite for adding this sort of feature. But Linus did point out that one could already use git to a similar effect with a simple command:

Do a script like this:

	git log --since=6.months.ago -- "$@" |
		grep -i '^    [-a-z]*by:.*@' |
		sort | uniq -c |
		sort -r -n | head

and it gives you a rather good picture of who is involved with a particular subdirectory or file.

The advantage of doing things this way is that the resulting output gives a current picture of who has actually been working on a file - a picture which requires no explicit maintenance at all. That list of people is probably a much better group to send copies of patches to than whoever might be listed in a maintainers file; they are the ones who know about what is happening in that part of the tree now.

No real resolution has been reached on this topic. It may be that Linus's approach may be the one taken by default; it already works without the need to merge any patches at all. The question may well stay around for a while, though. Approximately 2,000 developers put patches into the mainline over the course of one year; keeping track of which of those developers is the best to notify of changes to a particular file is never going to be easy.

Comments (5 posted)

Network transmit batching

By Jonathan Corbet
August 22, 2007
At the core of most network drivers is the hard_start_xmit() method, which is called once for every packet which is to be transmitted. This method will normally acquire locks and insert the packet into the adapter's transmit queue. As a rule, outgoing packets do not accumulate in the kernel; they are handed to the driver, one at a time, when they are ready to go. There are times, though, when packets cannot be handed off immediately. If, for example, the hardware transmit queue is currently full, the networking subsystem will have to hold on to the packet until things clear out. Once the driver is able to accept packets for the device again, the one-at-a-time behavior will resume.

The networking developers are always looking for ways to squeeze a little more performance from their code. Krishna Kumar took a look at the behavior described above and wondered: why not pass the list of accumulated packets to the driver in a single call? Batching of transmission operations in this way has the potential to minimize the cost of locking and device preparation overhead, making packet transmission as a whole more efficient. To explore this idea, Krishna has posted a few versions of the SKB batching patch set.

Implementing SKB batching requires a couple of driver API changes - but they are small and only required for batching-aware drivers. The first step is to set the NETIF_F_BATCH_SKBS bit in the features field of the net_device structure. That flag tells the network stack that the driver can handle batched transmissions.

The prototype for hard_start_xmit() is:

    int (*hard_start_xmit)(struct sk_buff *skb, struct net_device *dev);

That prototype does not change, but a driver which has indicated that batching is acceptable for dev may find its hard_start_xmit() method called with skb set to NULL. The NULL value is an indication that there is a batch of packets to transmit; that batch will be found enqueued on the (new) list found at dev->skb_blist. So the (much simplified) form of a batching-aware driver's hard_start_xmit() function will look something like:

    if (skb)
	ret = send_a_packet(internal_dev, skb);
    else {
	while ((skb = __skb_dequeue(dev->skb_blist)) != NULL) {
	    ret = send_a_packet(internal_dev, skb);
	    if (ret)

The reality of the situation can be a bit more complicated, especially if the driver implements optimizations like suppressing completion interrupts until the last packet of the batch has been sent. But the core of the change is as described here - not a whole lot to it.

As of this writing, the networking developers are still trying to determine what the performance effects of this patch are. There is particular interest in seeing how batching compares with TCP segmentation offloading, which is also, at its core, a transmission batching mechanism. The proof is very much in the benchmarks for a patch like this; if the results are good enough, the patch will likely be merged.

Comments (none posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds