User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.17-rc1, released by Linus on April 2. Patches merged since last week include the splice() and sync_file_range() system calls (see below), hotplug memory support for User-mode Linux, an LED subsystem, the conversion of local_t into a signed type, basic support for braille devices in the input layer, and the "ipath" driver for PathScale InfiniPath devices. See last week's and the previous week's summaries for detailed lists of changes in 2.6.17-rc1. For even more detail, see the short-form and long-format changelogs.

No patches have been merged into the mainline since 2.6.17-rc1 was released.

The current -mm tree is 2.6.17-rc1-mm1. Recent changes to -mm include a lot of fixes and a new version of the kgdb debugger, but little in the way of major changes.

Comments (2 posted)

Kernel development news

Quote of the week

Which part of "sysfs patches can be written by idiots and usually are" is too hard to understand? Oh, wait. I see... Well, nevermind, then...

-- Al Viro is back.

Comments (7 posted)

The LWN Kernel Page in Czech

Your editor, who has enough trouble putting together a Kernel Page in English, has never seriously thought about supporting other languages. It is, however, pleasant to see that this page can now be found translated into Czech, thanks to Robert Kratky. Hopefully this work will be useful to Czech readers.

Comments (5 posted)

Two new system calls: splice() and sync_file_range()

The 2.6.17 kernel will include two new system calls which expand the capabilities available to user-space programs in some interesting ways. This article contains a look at the current form of these new interfaces.


The splice() system call has a long history. First proposed by Larry McVoy in 1998; it was seen as a way of improving I/O performance on server systems. Despite being often mentioned in the following years, no splice() implementation was ever created for the mainline Linux kernel. That situation changed, however, just before the 2.6.17 merge window was closed when Jens Axboe's splice() patch, along with a number of modifications, was merged.

As of this writing, the splice() interface looks like this:

    long splice(int fdin, int fdout, size_t len, unsigned int flags);

A call to splice() will cause the kernel to move up to len bytes from the data source fdin to fdout. The data will move through kernel space only, with a minimum of copying. In the current implementation, at least one of the two file descriptors must refer to a pipe device. That requirement is a limitation of the current code, and it could be removed at some future time.

The flags argument modifies how the copy is done. Currently implemented flags are:

  • SPLICE_F_NONBLOCK: makes the splice() operations non-blocking. A call to splice() could still block, however, especially if either of the file descriptors has not been set for non-blocking I/O.

  • SPLICE_F_MORE: a hint to the kernel that more data will come in a subsequent splice() call.

  • SPLICE_F_MOVE: if the output is a file, this flag will cause the kernel to attempt to move pages directly from the input pipe buffer into the output address space, avoiding a copy operation.

Internally, splice() works using the pipe buffer mechanism added by Linus in early 2005 - that is why one side of the operation is required to be a pipe for now. There are two additions to the ever-growing file_operations structure for devices and filesystems which wish to support splice():

    ssize_t (*splice_write)(struct inode *pipe, struct file *out, 
                            size_t len, unsigned int flags);
    ssize_t (*splice_read)(struct file *in, struct inode *pipe, 
                           size_t len, unsigned int flags);

The new operations should move len bytes between pipe and either in or out, respecting the given flags. For filesystems, there are generic implementations of these operations which can be used; there is also a generic_splice_sendpage() which is used to enable splicing to a socket. As of this writing, there are no splice() implementations for device drivers, but there is nothing preventing such implementations in the future, for char devices at least.

Discussions on the linux-kernel suggest that the splice() interface could change before it is set in stone with the 2.6.17 release. Andrew Tridgell has requested that an offset argument be added to specify where copying should begin - either that, or a separate psplice() should be added. There is also some concern about error handling; if a splice() call returns an error, how does the application tell whether the problem is with the input or the output? Resolving those issues may require some interface changes over the next month or so.


Early in the 2.6.17 process, some changes to the posix_fadvise() system call were merged. The new, Linux-specific options were meant to give applications better control over how data written to files is flushed to the physical media. The capabilities provided are needed, but there were concerns about extending a POSIX-defined function in a Linux-specific way. So, after some discussions, Andrew Morton pulled that patch back out and replaced it with a new system call:

    long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);

This call will synchronize a file's data to disk, starting at the given offset and proceeding for nbytes bytes (or to the end of the file if nbytes is zero). How the synchronization is done is controlled by flags:

  • SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any already in-progress writeout of pages (in the given range) completes.

  • SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the given range which are not already under I/O.

  • SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the newly-initiated writes complete.

An application which wants to initiate writeback of all dirty pages should provide the first two flags. Providing all three flags guarantees that those pages are actually on disk when the call returns.

The new implementation avoids distorting the posix_fadvise() system call. It also allows synchronization operations to be performed with a single call, instead of the multiple calls required by the previous attempt. In the future, it may also be possible to add other operations to the flags list; the ability to request metadata synchronization seems to be high on the list.

(Thanks to Michael Kerrisk - who agitated for this change - for providing some of the background information).

Comments (14 posted)

Priority inheritance in the kernel

Imagine a system with two processes running, one at high priority and the other at a much lower priority. These processes share resources which are protected by locks. At some point, the low-priority process manages to run and obtains a lock for one of those resources. If the high-priority process then attempts to obtain the same lock, it will have to wait. Essentially, the low-priority process has trumped the high-priority process, at least for as long as it holds the contended lock.

Now imagine a third process, one which uses a lot of processor time, and which has a priority between the other two. If that process starts to crank, it will push the low-priority process out of the CPU indefinitely. As a result, the third process can keep the highest-priority process out of the CPU indefinitely.

This situation, called "priority inversion," tends to be followed by system failure, upset users, and unemployed engineers. There are a number of approaches to avoiding priority inversion, including lockless designs, carefully thought-out locking scenarios, and a technique known as priority inheritance. The priority inheritance method is simple in concept: when a process holds a lock, it should run at (at least) the priority of the highest-priority process waiting for the lock. When a lock is taken by a low-priority process, the priority of that process might need to be boosted until the lock is released.

There are a number of approaches to priority inheritance. In effect, the kernel performs a very simple form of it by not allowing kernel code to be preempted while holding a spinlock. In some systems, each lock has a priority associated with it; whenever a process takes a lock, its priority is raised to the lock's priority. In others, a high-priority process will have its priority "inherited" by another process which holds a needed lock. Most priority inheritance schemes have shown a tendency to complicate and slow down the locking code, and they can be used to paper over poor application designs. So they are unpopular in many circles. Linus was reasonably clear about how he felt on the subject last December:

"Friends don't let friends use priority inheritance".

Just don't do it. If you really need it, your system is broken anyway.

Faced with this sort of opposition, many developers would quietly shelve their priority inheritance designs and go back to working on accounting code. The kernel development community, however, happens to have a member who has a track record of getting code merged in spite of this sort of objection: Ingo Molnar. History may well repeat itself, as Ingo (working with Thomas Gleixner) has posted a priority-inheriting futex implementation with a request that it be merged into the mainline. This approach, says Ingo, provides a useful functionality to user space (it is not meant to provide priority-inheriting kernel mutual exclusion primitives) while avoiding the pitfalls which have hit other implementations.

The PI-futex patch adds a couple of new operations to the futex() system call: FUTEX_LOCK_PI and FUTEX_UNLOCK_PI. In the uncontended case, a PI-futex can be taken without involving the kernel at all, just like an ordinary futex. When there is contention, instead, the FUTEX_LOCK_PI operation is requested from the kernel. The requesting process is put into a special queue, and, if necessary, that process lends its priority to the process actually holding the contended futex. The priority inheritance is chained, so that, if the holding process is blocked on a second futex, the boosted priority will propagate to the holder of that second futex. As soon as a futex is released, any associated priority boost is removed.

As with regular futexes, the kernel only needs to know about a PI-futex while it is being contended. So the number of futexes in the system can become quite large without serious overhead on the kernel side.

Within the kernel, the PI-futex type is implemented by way of a new primitive called an rt_mutex. The rt_mutex is superficially similar to regular mutexes, with the addition of the priority inheritance capability. They are, however, an entirely different type, with no code shared with the mutex implementation. The API will be familiar to mutex users, however; in brief, it is:

    #include <linux/rtmutex.h>

    void rt_mutex_init(struct rt_mutex *lock);
    void rt_mutex_destroy(struct rt_mutex *lock);

    void rt_mutex_lock(struct rt_mutex *lock);
    int rt_mutex_lock_interruptible(struct rt_mutex *lock, 
                                    int detect_deadlock);
    int rt_mutex_timed_lock(struct rt_mutex *lock,
                            struct hrtimer_sleeper *timeout,
			    int detect_deadlock);
    int rt_mutex_trylock(struct rt_mutex *lock);
    void rt_mutex_unlock(struct rt_mutex *lock);
    int rt_mutex_is_locked(struct rt_mutex *lock);

The alert reader may have noticed that this looks much like the realtime mutex type found in the realtime preemption patch. Ingo once said that the realtime patches would slowly trickle into the mainline, and that is what appears to be happening here. With this patch set, the PI-futex code is the only user of the new rt_mutex type, but that could certainly change over time.

The PI-futex patch also includes a new, priority-sorted list type which could find users elsewhere in the kernel.

There has been relatively little discussion of this patch so far; it has been included in recent -mm trees. It is too late for 2.6.17, but, if no real opposition develops, the PI-futex code might just find its way into a subsequent kernel.

Comments (6 posted)

On the safety of the sysfs interfaces

One of the patches in the upcoming stable kernel release is a fix for a security vulnerability designated as CVE-2006-1055. It makes a small change to the code which implements the ability to write to sysfs attributes; with this change, the maximum amount of data which can be written to an attribute is PAGE_SIZE-1 bytes, or 4095 on most systems. Since last June, the limit had simply been PAGE_SIZE, allowing a full page to be written.

Since the page is zeroed before being filled, this change ensures that the data coming from user space will be null-terminated when it is passed to the specific sysfs store() function. Without that assurance, that function might have proceeded merrily off the end of the one-page buffer, accessing data which did not come from user space and possibly overwriting buffers elsewhere. The possibility of this happening was enough to raise security fears and motivate a quick fix.

The interesting thing is that the prototype for the store() function is:

    ssize_t (*store)(struct kobject *kobj, struct attribute *attr,
                     const char *buffer, size_t size);

The size parameter is the amount of user data being passed in. So, one might ask, why bother null-terminating the buffer, when its size has already been made available to the receiving code? Certain developers, whose code was receiving 4096-byte data via sysfs attributes, have, indeed, asked that question.

The question was answered, in one way, in the message featured in the quote of the week. More diplomatically, one might say that, regardless of how the interface was designed, a number of sysfs attribute implementations have be coded on the assumption that the incoming data will be null-terminated. So they do not bother to check the length of that data, and they will do bad things in the absence of the expected terminator.

With the patch, the situation will be fixed and those implementations made safe again. But it is hard not to be a little nervous about the situation. If there is carelessly-written code in the tree, there may be other issues with it as well, and the return of null-termination may not help much. It would be nicer if there were a way to verify that the interfaces were being used correctly. In the mean time, people writing sysfs interfaces - each of which is an interface to user space and a possible target of attack - may want to look a little more carefully at their code before submitting it.

Comments (2 posted)

Patches and updates

Kernel trees


  • Chris Leech: I/OAT. (March 30, 2006)

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds