User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.26-rc9, released on July 5. "Enough changes that we needed another -rc, and the regression list isn't emptying fast enough either (probably because a number of people, including reporters, are vacationing)." Along with the fixes there's a new driver for cameras which implement the standard USB video class spec. The long-format changelog has the details.

A few dozen changesets have been merged since 2.6.26-rc9, as of this writing. They are mostly fixes, but there is also a printk() extension which allows for higher-level format string specifiers; see below for details.

The current -mm kernel is 2.6.26-rc8-mm1. Recent changes to -mm include the MMU notifiers patch, a new alloc_pages_exact() function for more efficient large allocations, a lot of tweaks, and a patch series ending with "revert-revert-revert-revert-linux-next-revert-bootmem-add-return-value-to-reserve_bootmem_node.patch" that one probably doesn't want to know about.

The current stable 2.6 release is, released on July 2. "It contains a number of assorted bugfixes all over the tree. And once again, any users of the 2.6.25 kernel series are STRONGLY encouraged to upgrade to this release."

Comments (2 posted)

Kernel development news

Quotes of the week

The problem is that SystemTap hasn't really benefited from community based innovation largely because it doesn't have much of a community. The bigger picture problem Red Hat didn't see when they accepted the cash was that this project wouldn't generate a community just from the usual publish the code and they will come philosophy. The result is what we see to day: a klunky and accident prone tool that has sun engineers writing trite little homilies on the benefits of a planned political operating system.
-- James Bottomley

Think about some of the evil perpetrated by hal and the userspace suspend-resume scripts (and how much complexity with random XML fragments getting parsed by various dbus plugins), and tell me with a straight face that you would trust these modern-day desktop application writers with this interface. Because they *will* find some interesting way to (ab)use it.....
-- Ted Ts'o

It's readily obvious that people (ie: top-level maintainers) aren't even compile-testing their own stuff once it's merged into linux-next. You say (but don't provide evidence that) linux-next is too unstable to develop against. Well guess why? Because people are choosing to let it be that way.
-- Andrew Morton

Comments (none posted)

Quotes of the week part 2: the firmware flamewar

Not 15 minutes after David posted his note, we're now up to 11 reports; and this is only from an -mm patch series. Can you imagine the number of bug reports if this were allowed to ship in a mainline release? One good thing is that we can definitely show that there people that are downloading, compiling and trying to build the -mm kernel. :-)
-- Ted Ts'o

External firmware is by design an error prone system, even with versioning. But by being built and linked into the driver, it is fool proof.

On a technical basis alone, we would never disconnect a crucial component such as firmware, from the driver. The only thing charging these transformations, from day one, is legal concerns.

-- David Miller

All forms of change introduce _some_ risk of breakage, of course. In this case, as usual, I've tried to be careful to avoid regressions. The most important part, obviously, was having a way to build firmware into the static kernel.

When it comes to _modules_, doing that would introduce a certain amount of complexity which just doesn't seem necessary -- if you can load modules, then you have userspace, and you can use hotplug for firmware too. Especially given that so many modern drivers already _require_ you to do that, so the users understand it, and the tools like mkinitrd already cope with it -- checking MODULE_FIRMWARE() for the modules they include and including the appropriate files automatically. [...]

You need to stay sober for long enough to say 'Y' when it asks you if you want to build the required firmware into the kernel. And we even made that the _default_ now, for the benefit of those who can't stay sober that long. (Perhaps we'll make 'allyesconfig' the default next?)

-- David Woodhouse

Comments (none posted)

The current development kernel is...linux-next?

By Jonathan Corbet
July 8, 2008
One of the development process advantages brought by git (and by BitKeeper before it) is the ability to see the up-to-the-second, bleeding-edge status of Linus's tree. So any developer who wants to know where the front edge of development lies can grab that tree and make patches fit into it. But the value of the mainline repository for development would appear to be less than it once was. The mainline is no longer where the action is.

Consider, for example, this response from Andrew Morton after finding that a patch posted to linux-kernel would not compile for him:

I assume this patch was prepared against some ancient out-of-date kernel such as current Linus mainline. Guys, we have a new development tree now.

He followed up with this statement:

But what I am repeatedly seeing is people cheerfully raising 2.6.27 patches against the 2.6.26 tree when we have a nice 2.6.27 tree for developing against. Those days are over, guys.

So the message would appear to be clear: development work should be done against the linux-next tree rather than against the mainline kernel. There are some clear advantages to having work done in this way. Patches developed against linux-next should merge cleanly during the next merge window. Developers will be testing each other's trees as they work, causing bugs to turn up earlier in the process. And, of course, Andrew won't have to complain about patches which fail to build for him - at least, not as often.

Linux-next is a somewhat strange base on which to try to develop, though. It is built anew every day from over 100 subsystem trees, each of which can, itself, change from one day to the next. So linux-next is a moving target, just like the mainline is. But, unlike the mainline, linux-next has no consistent or coherent history. Every day's linux-next tree is a completely new creation with a unique - and transient - history.

Consider a developer who bases some work on a mainline release - 2.6.26-rc9, say. That developer's work will be derived from a specific commit in the mainline tree, known as b7279469d66b55119784b8b9529c99c1955fe747 in this case. The history from 2.6.26-rc9 is well defined, and that series of patches can be merged into any other repository which also contains 2.6.26-rc9; the identity of that commit is consistent and immutable across all repositories. With such a development tree, it is (relatively) easy to track the mainline as it advances, and to merge one's work when the time comes. A git tree based on the mainline sits on a solid foundation.

It is not possible to base a tree on linux-next in the same way. Development can begin at a specific commit, but tomorrow's linux-next tree may not contain that commit at all. The various component trees will have advanced independently of the previous day's linux-next tree, which can, in itself, complicate things. But the process of making all those trees come together can involve tasks like moving patches from one tree to another, or fixing intermediate patches which break things. That makes the end result better, but at the cost of rebasing those trees. Rebasing completely rewrites the development history, causing the old history to disappear from the tree. So a patch series based on the previous history loses its foundation.

And, since linux-next is built from its components every day, a patch developed on top of linux-next may, when integrated into that tree, be merged somewhere in the middle of the sequence; in other words, the patch will be merged into a tree which differs considerably from the tree on which it was developed. As Stephen Rothwell, the maintainer of the linux-next tree, put it:

One downsides of the way linux-next works is that, because it is recreated every day, you cannot really base anything on it that is to be merged into it.

Another interesting aspect of linux-next development involves API changes. The longstanding rule in kernel development is that internal kernel interfaces can be changed if there is a good reason to do so, but that the person making the change is obligated to fix all in-tree code broken by that change. If an API change is introduced into linux-next, though, the developer is simply not able to fix any code which enters linux-next by way of the other subsystem trees. If the developer does get patches into those trees for the API change, they can no longer be built on top of kernels which lack that change - the mainline, for example. API changes have, in other words, become harder to do - a situation which some may see as a good thing.

What all this means is that API changes must be handled through techniques like the creation of backward-compatibility layers; those layers can then be removed a development cycle or two later once the transition is complete. Or changes can be split up and added to individual subsystem trees; that, however, can lead to interesting ordering dependencies between the trees. In some cases, we are seeing 2.6.27 changes being merged into 2.6.26 in stub form as a way of making all of the pieces fit together.

Then, there is the simple matter that developers like to have a stable base upon which to create their code. The linux-next tree, since it contains large amounts of relatively new code, will also contain its share of new bugs. That makes developers, who are often having enough trouble just tracking down their own bugs, somewhat grumpy. Development against the mainline tends to have a lower probability of forcing developers to look for bugs which are not of their own making.

Many of these complaints have an easy answer: the pain which comes from making all the pieces fit together in linux-next must be faced at some point anyway. The real difference is that linux-next allows those problems to be dealt with at leisure, while the older "merge everything in the mainline" model compressed much of that work into the merge window. How beneficial that really is will be seen for the first time in the 2.6.27 merge window; if linux-next is serving its intended function, 2.6.27 should come together with rather less hassle than its immediate predecessors did.

But, regardless of the value provided by linux-next for integration and testing purposes, the fact remains that it is a difficult platform upon which to develop patches. That process is somewhat like building a house on a sand bar; overnight the tide comes in and completely reshapes the land underneath you. That is why most (possibly all) of the subsystem trees used to assemble linux-next are, themselves, based on the mainline.

The solution to that problem will have to evolve over time. The linux-next tree is a new institution which is still finding its proper place in the development process. Easier ways to develop patches against the linux-next tree will certainly be worked out; it may well turn out that quilt-like tools work better for this task than git. But, for now, linux-next is an excellent integration and testing resource, but it has not quite yet managed to become the true Linux kernel development tree.

Comments (23 posted)

Multiqueue networking

By Jonathan Corbet
July 8, 2008
One of the fundamental data structures in the networking subsystem is the transmit queue associated with each device. The core networking code will call a driver's hard_start_xmit() function to let the driver know that a packet is ready for transmission; it is then the driver's job to feed that packet into the hardware's transmit queue. The result is a data structure which looks vaguely like this:

[Network transmit queue]

"Vaguely" because the list of sk_buff structures (SKBs - the internal representation of packets) does not exist in this form within the kernel; instead, the driver maintains the queue in a way that the hardware can process it.

This is a scheme which has worked well for years, but it has run into a fundamental limitation: it does not map well to devices which have multiple transmit queues. Such devices are becoming increasingly common, especially in the wireless networking area. Devices which implement the Wireless Multimedia Extensions, for example, can have four different classes of service: video, voice, best-effort, and background. Video and voice traffic may receive higher priority within the device - it is transmitted first - and the device can also take more of the available air time for such packets. On the other hand, the queues for this kind of traffic may be relatively short; if a video packet doesn't get sent on its way quickly, the receiving end will lose interest and move on. So it might be better to just drop video packets which have been delayed for too long.

On the other hand, the "background" level only gets transmitted if there is nothing else to do; it is well-suited to low-priority traffic like bittorrent or email from the boss. It would make sense to have a relatively long queue for background packets, though, to be able to take full advantage of a lull in higher-priority traffic.

Within these devices, each class of service has its own transmit queue. This separation of traffic makes it easy for the hardware to choose which packet to transmit next. It also allows independent limits on the size of each queue; there is no point in filling the device's queue space with background traffic which is not going to be transmitted in any case. But the networking subsystem does not have any built-in support for multiqueue devices. This hardware has been driven using a number of creative techniques which have gotten the job done, but not in an optimal way. That may be about to change, though, with the advent of David Miller's multiqueue transmit patch series.

The current code treats a network device as the fundamental unit which is managed by the outgoing packet scheduler. David's patches change that idea somewhat, since each transmit queue will need to be scheduled independently. So there is a new netdev_queue structure which encapsulates all of the information about a single transmit queue, and which is protected by its own lock. Multiqueue drivers then set up an array of these structures. So the new data structure can, with sufficient imagination, be seen to look something like this:

[Multiqueue tx structure]

Once again, the actual lists of outgoing packets normally exist in the form of special data structures in device-accessible memory. Once the device has these queues set up for it, the various policies associated with each class of service can be implemented. Each queue is managed independently, so more voice packets can be queued even if some other queue (background, say) is overflowing.

David would appear to have worked hard to avoid creating trouble for network driver developers. Drivers for single-queue devices need not be changed at all, and the addition of multiqueue support is relatively straightforward. The first step is to replace the alloc_etherdev() call with a call to:

    struct net_device *alloc_etherdev_mq(int sizeof_priv, 
                                         unsigned int queue_count);

The new queue_count parameter describes the maximum number of transmit queues that the device might support. The actual number in use should be stored in the real_num_tx_queues field of the net_device structure. Note that this value can only be changed when the device is down.

A multiqueue driver will get packets destined for any queue via the usual hard_start_xmit() function. To determine which queue to use, the driver should call:

    u16 skb_get_queue_mapping(struct sk_buff *skb);

The return value is an index into the array of transmit queues. One might well wonder how the networking core decides which queue to use in the first place. That is handled via a new net_device callback:

    u16	(*select_queue)(struct net_device *dev, struct sk_buff *skb);

The patch set includes an implementation of select_queue() which can be used with WME-capable devices.

About the only other required change is for multiqueue drivers to inform the networking core about the status of specific queues. To that end, there is a new set of functions:

    struct netdev_queue *netdev_get_tx_queue(struct net_device *dev,
                                             u16 index);

    void netif_tx_start_queue(struct netdev_queue *dev_queue);
    void netif_tx_wake_queue(struct netdev_queue *dev_queue);
    void netif_tx_stop_queue(struct netdev_queue *dev_queue);

A call to netdev_get_tx_queue() will turn a queue index into the struct netdev_queue pointer required by the other functions, which can be used to stop and start the queue in the usual manner. Should the driver need to operate on all of the queues at once, there is a set of helper functions:

    void netif_tx_start_all_queues(struct net_device *dev);
    void netif_tx_wake_all_queues(struct net_device *dev);
    void netif_tx_stop_all_queues(struct net_device *dev);

Naturally, there are a few other details to deal with, and the multiqueue interface is likely to evolve somewhat over time. At one point, David was hoping to have this feature ready for inclusion into 2.6.27, but that goal looks overly ambitious now. It does seem that much of the ground work will be merged in the next development cycle, though, meaning that full multiqueue support should be in good shape for merging in 2.6.28.

Comments (9 posted)

Enhanced printk() merged

By Jake Edge
July 9, 2008

A change very late in the development cycle for 2.6.26 provides a framework for extending printk() to handle new kinds of arguments. Linus Torvalds just merged the change—after -rc9—presumably partially because he knew he could trust the author, but also because it should have no effect on the kernel. It will provide for better debugging output once code is changed to take advantage of it.

The core idea is to extend printk() so that kernel data structures can be formatted in kernel-specific ways. In order to get some compile-time checking, the %p format specifier has been overloaded. For example, %pI might be used to indicate that the associated pointer is to be formatted as a struct inode, which could print the most interesting fields of that structure. GCC will be able to check for the presence of a pointer argument, but because it does not understand the I part, cannot enforce that it is a pointer of the right type.

Extending printk() in this manner allowed Torvalds—who authored the patch—to add two new types to printk(): %pS for symbolic pointers and %pF for symbolic function pointers. In both cases, the code uses kallsyms to turn the pointer value into a symbol name. Instead of a kernel developer having to read long address strings and then trying to find them in the system map, the kernel will do that work for them.

The %pF specifier is for architectures like ppc and ia64 that use function descriptors rather than pointers. For those architectures, a function pointer points to a structure that contains the actual function address. By using the %pF specifier, the proper dereferencing is done.

As an example of how the augmented printk() could be used, Torvalds converted printk_address(). The CONFIG_KALLSYMS dependency and the kallsyms_lookup() were removed, essentially leaving a one-line function:

    printk(" [<%016lx>] %s%pS\n", address, reliable ? "": "? ", (void *) address);
If kallsyms is not present, the new printk() just reverts to printing the address in hexadecimal, which allows the special case handling to be done there.

The clear intent is to allow additional extensions to printk() to support other kernel data structures. The change to vsprintf(), which underlies printk(), actually allows for any sequence of alphanumeric characters to appear after the %p. The new pointer() helper function currently only implements the two new specifiers, but others have been mentioned.

The mostly likely additions are for things like IPv4, IPv6, and MAC addresses. Torvalds specifically mentions using %p6N as a possibility for IPv6 addresses. Some would rather have seen a different syntax be used, %p{feature} was suggested, but that would conflict with some current uses of %p in the kernel. Torvalds is happy with his choice:

I _expressly_ chose '%p[alphanumeric]*' because it's basically totally insane to have that in a *real* printk() string: the end result would be totally unreadable.

The patch took an interesting route to the kernel, with much of the discussion evidently going on in private between Torvalds, Andrew Morton, and others before popping up on the linuxppc-dev and linux-ia64 mailing lists. The patch itself has not been posted to linux-kernel in its complete form, but was committed on July 6. While it is a bit strange to see such a change this late in the development cycle, it is a change that should have no impact as there are no plans to actually use the new specifiers in 2.6.26.

Comments (6 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds