User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel remains 2.6.19. The 2.6.20 merge window has opened, and the first pile of patches has been merged (see below); it will probably be at least another week before 2.6.20-rc1 comes out, however.

There have also been no -mm releases over the last week. Andrew Morton has posted the -mm merge plan for 2.6.20, however, so one can see how -mm is expected to shrink as patches move to the mainline.

Older release news: was released on December 1. It contains a couple dozen important fixes.

Adrian Bunk has released; it contains a rather long list of fixes.

Willy Tarreau has announced 2.4.34-rc1 with one security update and a relatively small number of other fixes.

Comments (none posted)

Kernel development news

Quote of the week

-static void stli_dohangup(void *arg)
+static void stli_dohangup(struct work_struct *ugly_api)
-    stliport_t *portp = (stliport_t *) arg;
+    stliport_t *portp = container_of(ugly_api, stliport_t, tqhangup);
-- Al Viro adapts to the new workqueue API

Comments (5 posted)

The 2.6.20 cycle begins

Toward the end of the 2.6.19 cycle, there was a brief linux-kernel discussion on whether 2.6.20 should be a bugfix-only release. Just in case anybody thought that might actually happen, the patches merged for 2.6.20 will make the situation clear. There will be a lot of new stuff in the next stable kernel release.

That said, the rate of patches into the kernel has been lower than in some previous cycles. It may be that the workqueue patches have created some conflicts which are slowing things down.

As of this writing, the user-visible changes merged include:

  • New drivers for NetXen 1G/10G Ethernet controllers, Atmel MACB Ethernet modules, Tsi108/9 Ethernet controllers, and Chelsio Ethernet controllers (but without TCP offload support).

  • Numerous serial and parallel ATA driver improvements.

  • SCSI busses can optionally be scanned asynchronously. On large systems with many SCSI peripherals, this can speed the bootstrap process considerably.

  • The set of TCP congestion control algorithms which can be selected by unprivileged process has been restricted to those which are known to be robust and fair. The system administrator can still select any algorithm supported by Linux.

  • Various improvements have been made to the DCCP code, including SELinux support.

  • Some obsolete, unsupported, and presumably unused capabilities have been removed, including the frame diverter and the floppy tape (ftape) driver.

  • MD5 protection for TCP sessions (RFC 2385) has been added; this capability is normally only used with the BGP routing protocol.

  • The UDP-Lite protocol (RFC 3828) is now supported; see the UDP-Lite page for more information on this protocol, which is oriented toward the needs of streaming multimedia applications.

Changes visible to kernel developers include:

  • The workqueue API changes have been merged, resulting in changes throughout the tree. David Howells has posted a detailed set of instructions on how to fix code broken by these changes.

  • Much of the sysfs-related code has been changed to use struct device in place of struct class_device. The latter structure will eventually go away as the class and device mechanisms are merged.

  • There is a new function:

        int device_move(struct device *dev, struct device *new_parent);

    This function will reparent the given device to new_parent, making the requisite sysfs changes and generating a special KOBJ_MOVE event for user space.

  • The networking subsystem has been heavily annotated for automated checking using sparse.

  • A number of kernel header files which included other headers no longer do so. For example, <linux/fs.h> no longer includes <linux/sched.h>. These changes should speed kernel build times by getting rid of large number of unneeded includes, but might break some out-of-tree modules which do not explicitly include all the headers they need.

The merge window should stay open for another week or so, so there's plenty of time for more stuff to be added. Those who can't wait might want to take a look at Andrew Morton's -mm merge plan posting for some previews of what's coming.

Comments (16 posted)

The timer API: size or type safety?

The timer API allows kernel code to request that a function be called at some point in the future. At its core is the timer_list structure, which contains a few fields of interest:

    struct timer_list {
	unsigned long expires;
	void (*function)(unsigned long);
	unsigned long data;
	/* ... */

To request an action in the future, a kernel function places a relative expiration time (expressed in jiffies) in expires and some sort of useful private value in data. function() is a pointer to a routine which will be called after (at least) the requested number of jiffies have passed; data will be its only parameter. After the timer_list structure has been set up, a call to add_timer() puts the request into the system.

This API has not changed much in some time; as a result, the description of timers in Chapter 7 of Linux Device Drivers is still useful for those wanting details. It may, in fact, be the only part of LDD3 which is not yet thoroughly obsolete.

That situation may change soon, however, as there are developers with their eyes on this interface. Interestingly, there are two very different ideas of how the timer API should be changed.

The conversation was started by Al Viro who, for some time now, has been working on improving the type safety of the kernel API. He notes that the unsigned long argument to timer functions is, in fact, almost always a pointer value. So there is a lot of code in the kernel which is busily casting pointers to unsigned long values and back - or engaging in lazy trickery to avoid having to do those casts. Casts like this make compile-time type checking almost impossible, so every one is an opportunity to introduce hard-to-find bugs.

Al would like to fix this problem by creating a more type-safe interface to the kernel timer subsystem. His approach involves changing the type of the timer function argument to void *, reflecting the fact that it's usually a pointer type. He then has a SETUP_TIMER() macro which involves the following bit of code:

    typeof(*data) *p = data;
    timer->function = (void (*)(void *)) func;
    timer->data = (void *) p;
    (void)(0 && (func(p), 0));

The middle two lines are simply initializing the relevant fields of the timer_list structure. What the last line is doing, however, is creating a call to the timer function with the provided argument; if there is a type mismatch between that argument and the function's prototype, the compiler will complain. The call is written in such a way that it will be optimized out, so that call does not make it through to the kernel image. But, in the running kernel, it will be known that the timer function is receiving an appropriately-typed argument.

There are a lot of timers in the kernel, so this is the sort of change which makes people nervous. Al's plan involves creating the SETUP_TIMER() macro, but leaving the callback function's prototype unchanged. Then parts of the kernel could be converted at leisure, with the callback function prototype being changed once the conversion of in-kernel code is complete.

Thomas Gleixner joined in with an alternative suggestion: remove the data value from struct timer_list altogether, and pass a pointer to the timer_list structure into the callback function. If that structure is embedded within some other structure which has the information the callback really needs, a simple recast with container_of() will yield the needed pointer. The result would be a smaller timer_list structure. This approach mirrors the proposed workqueue API changes discussed here last week.

Al doesn't like that idea. He has been working to get rid of casts in the kernel, but this API would require the introduction of hundreds more of them. There is little type safety built into container_of(). To him, the space required for a pointer is more than justified by the extra compile-time checking that comes from its use.

Ingo Molnar, in disagreeing, makes the tradeoff clear:

The question is: which is more important, the type safety of a container_of() [or type cast], which if we get it wrong produces a /very/ trivial crash that is trivial to fix - or embedded timers data structure size all around the kernel? I believe the latter is more important.

Not too many other developers have joined the discussion so far. It's an important one, though; how this decision goes could shape how kernel APIs are designed in the future. Perhaps somebody will come up with a way to have both type safety and smaller size. Until such a time, however, there is a tradeoff to be made, and it's not clear which way the decision will go.

Comments (19 posted)

Secure deletion and trash bin support

A look at the man page for the chattr command reveals some interesting functionality; users may set special bits on files to request either that the file be undeletable, or that deletion be "secure" - meaning that the file's contents truly disappear from the disk. The key word here, however, is "request." Those bits have existed for many years, but few - if any - Linux filesystems actually implement those features. The undeletable and secure deletion flags are just placeholders for a "would be nice" feature to be added in the future. Someday.

That day may be a little closer thanks to this patch posted by Nikolai Joukov. It adds support for those two flags to ext4 in a relatively simple and straightforward way.

The patch works like this: whenever the last link is removed from a file, the undeletable and secure deletion flags are checked. Should either one be set, the file will be moved over to the .trash/<uid>/ directory in the root of the filesystem. Each per-uid directory has restrictive permissions, keeping users from perusing each others' deleted files. There are no subdirectories, so the path information is lost; preserving paths might be added in a future version. A number is appended to the file name when collisions with files already in the trash happen.

That's it for the kernel side. Undeletion is easily handled from user space by simply moving the file back out of the trash. The secure deletion feature is also to be done in user space, however. A special daemon can overwrite the file data in whatever way best suits the user's paranoia, then delete the file for real. A possible addition to the patch is a notification mechanism to force that daemon to run when filesystem space gets tight. In any case, all of the policy decisions on how to handle secure deletion requests would live in user space.

One might wonder why the trash can needs to be implemented in the kernel. The desktop projects have, after all, had a trash can available for some time. There seem to be two reasons why this patch adds that functionality. The first is that it comes for free with this approach to secure deletion. More importantly, however: it is not really possible for a user-space solution to intercept every attempt to delete a file. The nicest file manager available will not be able do do anything about an "rm" command typed into a shell, or an unlink() call from within a non-cooperating application. Catching file deletion within the kernel ensures that none will slip through the cracks.

The patch has not received a whole lot of comments as of this writing. One question which has come up is: why not do this at the VFS layer, rather than within ext4? There is little that is ext4-specific about the patch, and doing the work within the VFS would make this feature available to all filesystems - at least those which support the relevant file flags. Mr. Joukov agrees that moving this feature up might be the right thing to do, so there may be a reworked version of this patch coming in the future.

Comments (22 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds