User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.32-rc6; there have been no 2.6.32 prepatches released over the last week.

The current stable kernel is, released (along with on November 9. Both kernels contain a long list of important fixes, including some which are security-related.

Comments (none posted)

Quotes of the week

I, Paul E. McKenney, maintainer of the RCU implementation embodied in the Linux kernel and co-inventor of RCU, being of sound mind and body, notwithstanding the wear and tear inherent in my numerous decades sojourn on this planet, hereby declare that the following usage of work queues constitutes a valid RCU implementation...
-- Acked-by: Paul McKenney style (thanks to Amit Shah)

Hugh's hypothesis: for every variable x initialized by a subfunction, there exists at least one version V of gcc, such that V reports that x may be used uninitialized.
-- Hugh Dickins

Comments (1 posted)

The block I/O controller

By Jonathan Corbet
November 7, 2009
I/O bandwidth allocation and control is a tricky problem; as was discussed here earlier this year, several groups have tried to solve this problem at different levels in the block I/O stack. At a minisummit in Tokyo, the developers behind various I/O controllers got together and decided on a multi-level approach to the problem. Since then, developer Vivek Goyal has remarked:

IO control is a huge problem and the moment we start addressing all the issues in one patchset, it bloats to unmanageable proportions and then nothing gets inside the kernel. So at io mini summit we agreed that lets take small steps and once a piece of code is inside the kernel and stabilized, take the next step. So this is the first step.

This first small step is a 20-part patch series implementing a proportional-weight I/O controller within the CFQ I/O scheduler. A new control group (cgroup) class (called "blkio") is created, with each group given an I/O weight between 100 and 1000. The available I/O bandwidth is then handed out to groups as determined by their configured weight.

Working within CFQ gives the I/O controller fine-grained control over how much bandwidth each group gets. On the other hand, CFQ is poorly positioned to identify block I/O traffic which is not directly generated by user space (writeback, for example) or to throttle activity at the virtual memory level. Achieving that will require the integration of a controller which hooks into the system at a much higher level. The plan is to write such a controller and have it work transparently within the same blkio cgroup. But a couple more small steps will probably be required before that is ready to go in.

Comments (1 posted)

Removing binary sysctl

By Jonathan Corbet
November 11, 2009
The "sysctl" mechanism is used by the kernel to export a wide variety of tuning options to user space. Sysctl is actually two interfaces which have been awkwardly joined together: the sysctl() system call and the /proc/sys directory hierarchy. Of the two, /proc/sys is much more widely used, to the point that developers rarely even think about the system call. But the sysctl() implementation is a significant amount of code which suffers from chronic neglect. It has thus been the target of removal attempts for years.

The problem with removing sysctl(), of course, is that it is part of the kernel ABI. As long as the possibility of broken applications exists, this ABI cannot be removed. So it continues to sit in the kernel, despite the fact that its absence would be noted by few people.

Eric Biederman has come up with a new approach to the problem. His patch set removes the current sysctl() implementation, getting rid of a few thousand lines of unloved code. He then adds back a new wrapper which emulates the sysctl() ABI by way of /proc/sys. So any applications using sysctl() should continue to work, but the code dedicated to making it work is much reduced from what was there before.

The patch set still concerns some developers. The compatibility wrapper has its own configuration option, leading some to worry that distributions might disable it and cause obscure things to break. Going through /proc/sys will make access to these variables much slower than it was before. That should not really be a problem: access to sysctl variables is not normally a performance-critical operation. So there does not appear to be any sort of real obstacle to the merging of these patches; maybe, someday, binary sysctl() will truly vanish into the past.

Comments (4 posted)

Next-generation Linux file systems: NiLFS(2) and exofs (developerWorks)

IBM developerWorks has posted an introduction to the NilFS2 and exofs filesystems. "An interesting aspect of NiLFS(2) is its technique of continuous snap-shotting. As NILFS is log structured, new data is written to the head of the log while old data still exists (until it's necessary to garbage-collect it). Because the old data is there, you can step back in time to inspect epochs of the file system. These epochs are called checkpoints in NiLFS(2) and are an integral part of the file system. NiLFS(2) creates these checkpoints as changes are made, but you can also force a checkpoint to occur."

Comments (8 posted)

Kernel development news

Supporting transactions in btrfs

By Jonathan Corbet
November 11, 2009
Much of the fuss involving fsync() and crash robustness over the last year was really about how applications can get transactional semantics out of filesystem operations. An application developer often wants to see a set of operations either succeed or fail as a unit, without the possibility of partially-completed operations. Providing ways for applications to get that behavior can be a challenge, though.

Btrfs has tried to make this capability available to user space by way of the BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END ioctl() calls. There are some real problems with this approach, though. They operate as a pair of system calls, with any operations between the two being treated as a transaction within the filesystem. But if something fails, or if the application never quite gets around to ending the transaction, things will eventually come to a halt. That is why the btrfs code includes this comment:

There are many ways the trans_start and trans_end ioctls can lead to deadlocks. They should only be used by applications that basically own the machine, and have a very in depth understanding of all the possible deadlocks and enospc problems.

It is, in other words, a dangerous capability which cannot be made generally available.

Sage Weil has posted a patch taking a rather different approach to the problem. The key idea is to avoid the problem of never-completed transactions by encapsulating the entire thing within a single system call. The result is a new BTRFS_IOC_USERTRANS command for ioctl(); chances are it will require a bit of work yet, but it could be the base for user-space transactions in the future.

This command takes a structure which looks something like the following:

    struct btrfs_ioctl_usertrans {
	__u64 num_ops;
	struct btrfs_ioctl_usertrans_op *ops_ptr;
	__u64 num_fds;
	__u64 data_bytes, metadata_ops;
	__u64 flags;
	__u64 ops_completed;

The ops_ptr argument points to an array of num_ops individual operations to complete:

    struct btrfs_ioctl_usertrans_op {
	__u64 op;
	__s64 args[5];
	__s64 rval;
	__u64 flags;

Here, op describes an individual operation. It can be BTRFS_IOC_UT_OP_OPEN (open a file), BTRFS_IOC_UT_OP_CLOSE (close a file), BTRFS_IOC_UT_OP_PWRITE (write to a file), BTRFS_IOC_UT_OP_UNLINK (unlink a file), BTRFS_IOC_UT_OP_LINK (make a link to a file), BTRFS_IOC_UT_OP_MKDIR (create a directory), BTRFS_IOC_UT_OP_RMDIR (remove a directory), BTRFS_IOC_UT_OP_TRUNCATE (truncate a file), BTRFS_IOC_UT_OP_SETXATTR (set an extended attribute), BTRFS_IOC_UT_OP_REMOVEXATTR (remove an extended attribute), or BTRFS_IOC_UT_OP_CLONERANGE (copy a range of data from one file to another). For each operation, the args field contains a set of arguments similar to what one would see for the corresponding system call. One interesting difference is that there are no hard-coded file descriptor numbers; instead, the transaction gets a new file descriptor table and all operations work with indexes into that table. Essentially, transactions work within a file descriptor space separated from that used by the calling process.

The flags field describes how the return value from each operation should be interpreted. It can be contain any of: BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE, and BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE. In each case, the flag causes the return value to be compared against the passed-in rval field; if the comparison is successful, the transaction will fail.

What happens if the transaction fails? The partially-completed transaction will not be rolled back; btrfs, not being a database, is not really set up to do that. Instead, the number of successfully-completed operations will be passed back to user space. Optionally, the application can provide the BTRFS_IOC_UT_FLAG_WEDGEONFAIL flag, causing btrfs to leave the transaction open, locking up the filesystem until the system is rebooted. This may seem like a rather antisocial approach to transaction atomicity, but, if failure is both highly unlikely and highly problematic, that might be the right thing to do.

A patch like this raises a lot of questions. The first obstacle may be the fact that it requires exporting a number of system call implementations to modules, a change which has been resisted in the past. Kernel code need not normally call functions like sys_mkdir(), but this patch does exactly that. Calling system call implementations directly can be a bit tricky on some architectures, and there are good reasons for not making these functions available to modules in general.

Another problem is that the filesystem has no real way of determining whether a transaction will succeed before jumping into it; the best it can do is reserve some metadata space in accordance with an estimate provided by the application. If transactions are allowed to complete partially, they are no longer transactions. But the alternative of locking up the system can only leave people wondering if there isn't a better way.

Then, there is a question which was raised on the list: btrfs provides cheap snapshots, why not use them to implement transactions? Using a snapshot would take advantage of existing functionality and should make proper rollback possible. The problem would appear to be performance: btrfs snapshots are not quite that cheap, especially when one considers the need to exclude other activity on the filesystem while the transaction is active. So Chris Mason has suggested that the old standby - write-ahead logging - would be preferable because it will perform much better. But, he thinks, the multi-operation ioctl() could maybe perform better yet.

Finally, there would appear to be some real similarities between this API and the rather more general syslets mechanism. Syslets have been on the back burner for some time now, but they could come back forward if they seemed like a good way to solve this problem.

Clearly, like much of btrfs, this new ioctl() is a work in progress. If it gets into the mainline, it will be likely to have changed quite a bit on the way. But the problem this patch is trying to solve is real; it's clearly an issue which is worth thinking about.

Comments (18 posted)

Another new ABI for fanotify

By Jonathan Corbet
November 11, 2009
"Fanotify" is a much-revised system for providing filesystem event notifications to user space, and, possibly, allowing user space to block open() operations on specific files. The intended use case is malware-scanning utilities, but there are others: hierarchical storage has been cited as one possibility. This code has had a long, hard path into the kernel for a couple of reasons: kernel developers are not big fans of malware scanning, and nailing down the user-space ABI has been challenging.

The first obstacle has been more-or-less overcome. Even developers who think that malware scanning is the worst sort of security snake oil can agree that having these utilities use a well-defined kernel interface is better than having them employ nasty tricks like hooking into the system call table. ABI difficulties can be harder to overcome, though. With the latest fanotify posting, developer Eric Paris may have resolved this issue for at least a portion of the fanotify functionality.

The new version does away with the novel interface using setsockopt() in favor of a couple of new system calls. The first of these is fanotify_init():

    int fanotify_init(unsigned int flags, unsigned int event_f_flags,
		      int priority);

This system call initializes the fanotify subsystem, returning a file descriptor which is used for further operations. There are two flags values implemented: FAN_NONBLOCK creates a nonblocking file descriptor, and FAN_CLOEXEC sets the close-on-exec flag. Currently, event_f_flags and priority are unused; they should be set to zero.

Management of notification events is then done with fanotify_mark():

    int fanotify_mark(int fanotify_fd, unsigned int flags,
		      int dfd, const char *pathname, u64 mask,
		      u64 ignored_mask);

This call is used to "mark" specific parts of the filesystem hierarchy, indicating an interest in events involving those files. fanotify_fd is the file descriptor returned by fanotify_init(). The flags parameter must be one of FAN_MARK_ADD or FAN_MARK_REMOVE, indicating whether this call adds new marks or removes existing ones; there are also a couple of flags to control following of symbolic links and the marking of directories (without their contents).

The file(s) to be marked are determined by dfd and pathname; these parameters work much like in any of the *at() system calls. If dfd is AT_FDCWD, the pathname is resolved using the current working directory. If, instead, dfd points to a directory, the pathname lookup starts at that directory. If pathname is null, though, then dfd is interpreted as the actual object to mark.

Finally, mask and ignored_mask control which events are reported. To generate a specific event, a file must have the appropriate flag set in mask and clear in ignored_mask. The flags are FAN_ACCESS (file access), FAN_MODIFY (a file is modified), FAN_CLOSE_WRITE (a writable file has been closed), FAN_CLOSE_NOWRITE (a read-only file has been closed), FAN_OPEN (a file has been opened), and FAN_EVENT_ON_CHILD (events on children of a directory). There is also a FAN_Q_OVERFLOW event for event queue overflows, but that is not currently implemented.

Once files have been marked, the application can simply read from the fanotify file descriptor to get events. The events look like:

    struct fanotify_event_metadata {
	__u32 event_len;
	__u32 vers;
	__s32 fd;
	__u64 mask;

Here, event_len is the length of the structure, vers indicates which version of fanotify generated the structure, fd is an open file descriptor for the object being accessed, and mask describes what is actually happening.

There is one crucial component missing in these patches: there is no way for the fanotify user to react to these events. In particular, the ability to block an open() call, a core part of the malware-scanning process, is missing. That, presumably, is to be added in a future revision. Meanwhile, Eric has requested permission to put the notification code into linux-next, presumably with a 2.6.33 merge in mind. As of this writing, objections have not been forthcoming.

Comments (11 posted)

A futex overview and update

November 11, 2009

This article was contributed by Darren Hart

The futex [PDF] mechanism, introduced in 2.5.7 by Rusty Russell, Hubertus Franke, and Mathew Kirkwood, is a fast, lightweight kernel-assisted locking primitive for user-space applications. It provides for very fast uncontended lock acquisition and release. The futex state is stored in a user-space variable (an unsigned 32-bit integer on all platforms). Atomic operations are used in order to change the state of the futex in the uncontended case without the overhead of a syscall. In the contended cases, the kernel is invoked to put tasks to sleep and wake them up.

Futexes are the basis of several mutual exclusion constructs commonly used in threaded programming. These include pthread mutexes, condvars, semaphores, rwlocks, and barriers. They have been through a lot of reconstructive and cosmetic surgery over the last several years, and are now more efficient, more functional, and better documented than ever before.


While few application developers will use futexes directly, a cursory knowledge of how to do so is necessary to appreciate the improvements presented a bit later. By way of a simple example, futexes can be used to store the state of a lock and provide a kernel waitqueue for tasks blocking on the lock. To minimize syscall overhead, this state should allow for atomic lock acquisition when the lock is uncontended. The state could be defined as:

  1. unlocked
  2. locked

In order to acquire the lock, an atomic test-and-set instruction (such as cmpxchg()) can be used to test for 0 and set to 1. In this case, the locking thread acquires the lock without involving the kernel (and the kernel has no knowledge that this futex exists). When the next thread attempts to acquire the lock, the test for zero will fail and the kernel needs to be involved. The blocking thread can then use the futex() system call with the FUTEX_WAIT opcode to put itself to sleep on the futex, passing the address of the futex state variable as an argument. To release the lock, the owner changes the lock state to zero (unlocked) and issues the FUTEX_WAKE opcode, which will wake the blocked thread to return to user space and try to acquire the lock (as described above). This is a an obviously trivial example with lots of room for optimization. Ulrich Drepper's "Futexes are Tricky" [PDF] is still the undisputed reference for using futexes to build locking primitives such as mutexes. It explores the many race conditions involved with using futexes as well as optimizations to improve on the example given here.

When the user threads call into the kernel with the futex() system call, they pass the address of the futex state (uaddr), the opcode to perform (op), and various other arguments. The uaddr is used by the kernel to generate a unique "futex_key" to reference the futex. When a thread requests to block on a futex, as with FUTEX_WAIT, a "futex_q" is created and queued in the "futex_queues" hash table. There is one futex_q for every task blocked on a futex, possibly many futex_q's per futex. The futex_queues themselves (the hash table lists, not the "futex_q's") are shared among futexes, since multiple futex_keys will hash to the same queue. These relationships are illustrated below:

[Futex diagram]

In most cases, there is no policy defining how the user space state variable is to be used (despite what the futex man pages may or may not say). The application (or a library such as glibc) uses this value to define the state of the locking construct being implemented. This can be as simple as a boolean variable (as in the example above), however optimized implementations and other locking mechanisms require more complex state values.

In addition to the simple FUTEX_WAIT and FUTEX_WAKE operations, the kernel also manages special operations that require more knowledge of the locking construct's state than can be had in user space, most notably the priority inheritance (PI) chains and robust lists. PI and robust futexes are exceptions to the user-defined-policy rule regarding the state variable. Their state depends not only on the locked state of the mutex, but also on the identity of the owner and whether or not there are waiters. As such, the futex value is defined as the thread identifier (TID) of the owner and a bit to indicate pending owners. This policy still allows for user space atomic operations to avoid calling into the kernel in the uncontended case.


Futexes have seen numerous improvements from a handful of developers since their debut appearance in 2.5.7. Some notable improvements include priority based wake-up for real-time tasks (by Pierre Peiffer) and robust and PI futexes (by Ingo Molnar and Thomas Gleixner). These latter features have been in use for some time and have been adequately covered here on as well as in the excellent discussions on the kernel mailing list. Your author's foray into futexes picks up here, about two and a half years ago. Aside from several fixes to address rare corner cases and race conditions, the futex code has seen several functional and performance improvements since those earlier contributions.

Significant effort went into reducing futex overhead. Eric Dumazet introduced private futexes as an optimization for PTHREAD_PROCESS_PRIVATE pthread mutexes. Private futexes can only be used by threads of the same process. They are distinguishable from each other simply by their virtual address, while shared futexes have different virtual addresses in each process, requiring the kernel to lookup their physical address for unique identification. This optimization eliminates the use of the mmap_sem semaphore for private futexes, reducing system-wide contention. It also eliminates the atomic operations used in reference counting for private futexes, resulting in less cache-line bouncing on SMP machines. Glibc now uses private futexes by default.

Peter Ziljstra further reduced the futex dependency on mmap_sem by using get_user_pages_fast() in the fast paths, making use of get_user_pages(), and pushing the mmap_sem locks down tightly around the slow paths (September 2008). These changes had the added benefit of removing much of the virtual memory related logic from kernel/futex.c, simplifying the code considerably. Due to their dependence on user space addresses, futexes are burdened with several possible fault points. Holding mmap_sem complicated the fault logic since it had to be released prior to calling get_user(). With mmap_sem usage reduced, your author greatly simplified the fault logic (March 2009), resulting in far more legible code.

Bitset conditional wakeup was added by Thomas Gleixner (February 2008) in order to enable optimized rwlock implementations in glibc. FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET allow the user to specify a bitmask which limits the woken tasks to those which specified the same bitset (or a superset, such as FUTEX_BITSET_MATCH_ANY) at wait time.

Since the introduction of PI futexes, the glibc condvar implementation of pthread_cond_broadcast() (with a PI mutex) has been forced to wake all waiters, rather than take advantage of FUTEX_REQUEUE, due to the lack of support for requeueing to PI futexes. This leads to a wake-up storm as all the waiters race back to user space to contend for the lock. It also fails to ensure that the highest priority task acquires the lock first. Recent kernels (2.6.31-rt* and 2.6.32-rc*) now have your author's FUTEX_CMP_REQUEUE_PI patches (April 2009) which provide the kernel-side support for requeueing waiters from a non-PI futex to a PI futex. With glibc patches in the works by Dinakar Guniguntala, real-time applications will soon be able to use pthread condition variables with guaranteed wake-up order and fewer overall wake-ups.

Now What?

While there are several items that futex developers may consider in the future, they are hopeful that kernel/futex.c and all its brain-bending, liver-killing insanity can be put to rest for at least a little while. However, since no article is complete without a list of next steps, the following items may receive some attention in the future:

Man pages: The current man pages do not include some of the new futex operations. They suggest a policy for the value of the futex which has led to some confusion regarding usage of futexes. Worst of all, the user space futex() definition has been removed from /usr/include/linux/futex.h, rendering the man pages not only incomplete, but also inaccurate. Users of futexes must use the syscall interface directly.

Adaptive futexes: It is possible that some of the scheduling overhead of futexes can be reduced by some optional amount of spinning prior to going to sleep in the kernel. However, as futexes expose their state to user space, this spinning can also be done in user space, as is done with adaptive mutexes in glibc now, albeit without the knowledge of whether the owner is running, so spinning is reduced to a simple maximum-retries loop.

Interruptible futexes: There is some interest in interruptible blocking lock operations from large proprietary software projects. Futex operations currently restart themselves in the event of a signal, rather than returning -EINTR to user space. Futexes could be flagged with FUTEX_INTERRUPTIBLE which would be checked on signal-induced wakeup to determine if the syscall should be restarted or if -ECANCELED should be returned to user space. Exposing such a feature in the pthread locking primitives would involve non-POSIX compliant changes to the pthread library, but this is not without precedent.

Scalability enhancements: There has been some discussion on LKML regarding private as well as NUMA-optimized hash tables. The futex hash table is shared across all processes and is protected by spinlocks which can lead to real overhead, especially on large systems. This overhead is not serving any useful purpose if these systems are partitioned on NUMA nodes, or even for processes that use private futexes exclusively.

Futex test suite: Your author has been compiling a list of requirements for an exhaustive test suite to validate futex functionality. This test suite would serve as a regression suite for future development. The many corner cases and misuse cases possible with futexes complicate the test suite and present a challenge to its design.


I would like to extend my thanks to John Stultz, Will Schmidt, Paul McKenney, Nivedita Singhvi, and, of course, Jon Corbet, whose reviews have made this article far more legible and complete than it would have been otherwise.

Comments (6 posted)

Patches and updates

Kernel trees


Build system

  • nconfig v5 . (November 6, 2009)

Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds