User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.21-rc4, released by Linus on March 16. It consists mostly of fixes, but there is also a patch adding device_schedule_callback(), which lets device-oriented code request a callback (from process context) in the near future. See the long-format changelog for more details on 2.6.21-rc4.

The current -mm tree is 2.6.21-rc4-mm1. Recent changes to -mm include a new version of the lumpy reclaim patch, some anti-fragmentation work, an updates RSDL scheduler, and the revoke() system call.

There is a stable kernel update in the works as this is written; it may well be released by the time you read it.

For older kernels: was released on March 20 with a fair number of fixes, a couple of which are security-related.

Comments (none posted)

Kernel development news

Quote of the week

Quite frankly, I was *planning* on merging RSDL very early after 2.6.21, but there is one thing that has turned me completely off the whole thing:

  • the people involved seem to be totally unwilling to even admit there might be a problem.

This is like alcoholism. If you cannot admit that you might have a problem, you'll never get anywhere. And quite frankly, the RSDL proponents seem to be in denial ("we're always better", "it's your problem if the old scheduler works better", "just one report of old scheduler being better").

-- Linus Torvalds

Comments (none posted)

Toward improved page replacement

When memory gets tight (a situation which usually comes about shortly after starting an application like tomboy), the kernel must find a way to free up some pages. To an extent, the kernel can free memory by cleaning up its own internal data structures - reducing the size of the inode and dentry caches, for example. But, on most systems, the bulk of memory will be occupied by user pages - that is what the system is there for in the first place, after all. So the kernel, in order to accommodate current demands for user pages, must find some existing pages to toss out.

To help in the choice of pages to remove, the kernel maintains two big linked lists for each memory zone. The "active" list contains pages which have been recently accessed, while the "inactive" list has those which have not been used in the recent past. When the kernel looks for pages to evict, it will scan through the inactive list, in the theory that the pages least likely to be needed soon are to be found there.

There is an additional complication, though: there are two fundamental types of pages to be found on these lists. "Anonymous" pages are those which are not associated with any files on disk; they are process memory pages. "Page cache" pages, instead, are an in-memory representation of (portions of) files on the disks. A proper balance between anonymous and page cache pages must be maintained, or the system will not perform well. If either type of page is allowed to predominate at the expense of the other, thrashing will result.

The kernel offers a knob called swappiness which controls how this balance is struck. If the system administrator sets a higher value of swappiness, the kernel will allow the page cache to occupy a larger portion of memory. Setting swappiness to a very low value is a way to tell the kernel to keep anonymous pages around at the expense of the page cache. In general, the system can be expected to perform better if page cache pages are reclaimed first; they can often be reclaimed without needing to be written back to disk, and their layout on the disk can make recovery faster should they be needed again. For this reason, the default value for swappiness favors the eviction of page cache pages; anonymous pages will only be targeted when memory pressure becomes relatively severe.

Swappiness clearly affects how the process of scanning pages for eviction candidates is done. If swappiness is low, anonymous pages will simply be passed over. As it turns out, this behavior can lead to performance problems; there may be a lot of anonymous pages which must be scanned over before the kernel finds any page cache pages, which are the ones it was looking for in the first place. It would be nice to avoid all of that extra work, especially since it comes at a time when the system is already under stress.

Rik van Riel has posted a patch which tries to improve this situation. The approach taken is quite simple: the active and inactive lists are each split into two new lists: one pair (active and inactive) for anonymous pages and one pair for page cache pages. With separate lists for the page cache, the kernel can go after those pages without having to iterate over a bunch of uninteresting anonymous pages on the way. The result should be better scalability on larger systems.

The idea is simple, but the patch is reasonably large. Any code which puts pages onto one of the lists must be changed to specify which list is to be used; that requires a number of small changes throughout the memory management and filesystem code. Beyond that, the current patch does not really change how the page reclamation code works, though Rik does note:

For now the swappiness parameter can be used to tweak swap aggressiveness up and down as desired, but in the long run we may want to simply measure IO cost of page cache and anonymous memory and auto-adjust.

There tends to be a lot of sympathy for changes which remove tuning knobs in favor of automatic adaptation within the kernel itself. So if this approach could be made to work, it might well be adopted. Getting system tuning right is hard; it's often better if the computer can figure it out by itself.

Meanwhile, the list-splitting patch, so far, lacks widespread testing or benchmarking. So, at this point, it is difficult to say when (or in what form) this patch will find its way into the mainline.

Comments (17 posted)


Applications do not normally worry about the allocation of blocks for files they create; instead, they simply write the data and assume the the kernel will do a proper job of finding a home for that data. There are times when it is useful to take a more active role in block allocation, though. If an application knows how much data it will be writing, it can request the needed blocks ahead of time, enabling the kernel to allocate them all at once, contiguously on the disk. Application developers concerned about reliability may also want to know that the needed disk space has already been procured before beginning a critical operation.

Unix systems have not traditionally provided a way for applications to control block allocation. An application on a current Linux kernel has only one way to force allocation: write a stream of data to the relevant portion of the file. This technique works, but it loses one of the advantages of preallocation: letting the kernel do all the work at once and ensure that the blocks are contiguous on disk if possible. Writing useless data to the disk solely for the purpose of forcing block allocation is also wasteful.

The POSIX way of preallocating disk space is the posix_fallocate() system call, defined as:

     int posix_fallocate(int fd, off_t offset, off_t len);

On success, this call will ensure that the application can write up to len bytes to fd starting at the given offset and know that the disk space is there for it.

Linux does not currently have an implementation of posix_fallocate() in the kernel. This patch by Amit Arora may change that situation, however. Amit's patch has been through a couple of rounds of review which have changed the interface considerably; the current form of the proposed system call is:

    long fallocate(int fd, int mode, loff_t offset, loff_t len);

The fd, offset, and len arguments have the same meaning as with posix_fallocate(), making it easy for the C library to implement the standard interface. The additional mode argument changes the way the call operates; normal usage will be to specify FA_ALLOCATE, which causes the requested blocks to be allocated. If, instead, FA_DEALLOCATE is given, the requested block range will be deallocated, allowing an application to punch a hole in the file.

Internally, the system call does not do much of the work; instead, it calls the new fallocate() inode operation. Thus, each filesystem must implement its own fallocate() support. The future plans call for a possible generic implementation for filesystems which lack fallocate() support, but the generic version would almost certainly have to rely on writing zeroes to the file. By pushing the operation into the filesystem itself, the kernel gives the filesystem the opportunity to satisfy the allocation in a more efficient way, without the need to write filler data. Filesystems do need to be sure that applications cannot use fallocate() to read old data from the allocated blocks, though.

For now, filesystem-level support is scarce. There are patches circulating which add fallocate() support to ext4. The XFS filesystem has supported preallocation (through a special ioctl() call) for some time, but will need to be modified to do preallocation through the new inode operation. It's not clear when other filesystems may get native support; the tracking of allocated but unwritten blocks is a significant addition. So, for the near future, the efficiency benefits of fallocate() may be unavailable for most users.

Comments (7 posted)

The 2007 Linux Storage and File Systems Workshop

March 19, 2007

This article was contributed by Brandon Philips

Fifty members of the Linux storage and file system communities met February 12 and 13 in San Jose, California to give status updates, present new ideas and discuss issues during the 2007 Linux Storage and File Systems Workshop. The workshop was chaired by Ric Wheeler and sponsored by EMC, NetApp, Panasas, Seagate and Oracle.

Day 1: Joint Session

Ric Wheeler opened with an explanation of the basic contract that storage systems make with the user: the complete set of data will be stored, bytes are correct and in order, and raw capacity is utilized as completely as possible. It is so simple that it seems that there should be no open issues, right?

Today, this contract is met most of the time but Ric posed a number of questions. How do we validate that no files have been lost? How do we verify that the bytes are correctly stored? How can we utilize disks efficiently for small files? How do errors get communicated between the layers?

Through the course of the next two days some of these questions were discussed, others were raised and a few ideas proposed. Continue reading for the details.

Ext4 Status Update

Mingming Cao gave a status update on ext4, the recent fork of the ext3 file system. The primary goal of the fork was the move to 48-bit block numbers; this change allows the file system to support up to 1024 petabytes of storage. This feature was originally designed to be merged into ext3, but was seen as too disruptive. The patch is also built on top of the new extents system. Support for greater than 32K directory entries will also be merged into ext4.

On top of these changes a number of ext3 options will be enabled by default including: directory indexing which improves file access for large directories, "resize inodes" which reserve space in the block group descriptor for online growing, and 256-byte inodes. Ext3 users can use these features today with a command like:

    mkfs.ext3 -I 256 -O resize_inode dir_index /dev/device

A number of other features are also being considered for inclusion into ext4 and have been posted on the list as RFCs. This includes a patch that will add nanosecond timestamps and the creation of persistent file allocations, which will be similar to posix_fallocate() but won't waste time writing zeros to the disk.

Ext4 currently stores a limited number of extended attributes in-inode and has space for one additional block of extended attribute data, but this may not be enough to satisfy xattr-hungry applications. For example, Samba needs additional space to support Vista's heavy use of ACLs, and eCryptFS can store arbitrarily large keys in extended attributes. This led everyone to the conclusion that data needs to be collected on how extended attributes are being used to help developers decide how to best implement them. Until larger extended attributes are supported, application developers need to pay attention to the limits that exist on current file systems e.g. one block on ext3 and 64K on XFS.

Online shrinking and growing was briefly discussed and it was suggested that online defragmentation, which is a planned feature, will be the first step toward online shrinking. A bigger issue however is storage management and Ted Ts'o suggested that the Linux file system community can learn from ZFS on how to create easy to manage systems. Christoph Hellwig sees the disk management issue as being a user space problem that can be solved with kernel hooks and sees ZFS as a layering violation. Either way it is clear that disk management should be improved.

The fsck Problem

Zach Brown and Valerie Henson were slated to speak on the topic of file system repair. While Val booted her laptop, she introduced us to the latest fashion: laptop rhinestones, a great discussion piece if you are waiting on a fsck. If Val's estimates for fsck time in 2013 come true, having a way to pass the time will become very important.

Val presented an estimate of 2013 fsck times. She first measured a fsck of her 37GB /home partition (with 21GB in use) which took 7.5 minutes and read 1.3GB of file system data. Next, she used projections of disk technology from Seagate to estimate the time to fsck a circa-2013 home partition, which will be 16 times larger. Although 2013 disks will have a five-fold bandwidth increase, seek times will only improve about 1.2 times (to 10ms) leading to an increase in fsck time from about 8 minutes to 80 minutes! The primary reason for long fscks is seek latency, since fsck spends most of its time seeking over the disk discovering and fetching dynamic file system data like directory entries, indirect blocks and extents.

Reducing seeks and avoiding the seek latency punishment is key to reducing fsck times. Val suggested one solution would be keeping a bitmap on disk that tracks the blocks that contain file system metadata; this would allow for reading all data in a single arm sweep. This optimization, in the best case, would make a single sequential sweep over the disk and, on the future disk, reading all file system metadata would only take around 134 seconds, a large improvement over 80 minutes. A full explanation of the findings and possible solutions can be found in the paper Repair-Driven File System Design [PDF]. Also, Val announced that she is working full time on a file system called chunkfs [PDF] that will make speed and ease of repair a primary design goal.

Zach Brown presented some blktrace output from e2fsck. The outcome of the trace is that, while the disk can stream data at 26 Mb/s, fsck is achieving only 12 Mb/s. This situation could be improved to some degree without on-disk layout changes if the developers had a vectorized I/O call. Zach explained that in many cases you know the block locations that you need, but with the current API you can only read one at a time.

A vectorized read would take a number of buffers and a list of blocks to read as arguments. Then the application could submit all of the reads at once. Such a system call could save a significant amount of time since the I/O scheduler can reorder requests to minimize seeks and merge requests that are nearby. Also, reads to blocks that are located on different disks could be parallelized. Although a vectorized read could speed up the fsck eventually file system layout changes will be needed to make fsck faster.

libata: bringing the ATA community together

Jeff Garzik gave an update on the progress of libata, the in-kernel library to support ATA hosts and devices. He first presented the ATAPI/SATA features that libata now supports including: PATA+C/H/S, NCQ, FUA, SCSI SAT, and CompactFlash. The growing support for parallel ATA (PATA) drives in libata will eventually deprecate the IDE driver; Fedora developers are helping to accelerate testing and adoption of the libata PATA code by disabling the IDE driver in Fedora 7 test 1.

Native Command Queuing (NCQ) is a new command protocol introduced in the SATA II extensions and is now supported under libata. With NCQ the host can have multiple outstanding requests on the drive at once. The drive can reorder and reschedule these requests to improve disk performance. A useful feature of NCQ drives is the force unit access (FUA) bit which will ensure the data, in write commands with this bit set, will be written to disk before returning success. This has the potential of enabling the kernel to have both synchronous and non-synchronous commands in flight. There was a recent discussion about both NCQ FUA and SATA FUA in libata on the linux-ide mailing list.

Jeff briefly discussed libata's support for SCSI ATA translation (SAT) which lets an ATA device appear to be a SCSI device to the system. The motivation for this translation is the reuse of error handling and support for distribution installers which already know how to handle SCSI devices.

There are also a number of items slated as future work for libata. Many drivers need better suspend/resume support and the driver API is due for a sane initialization model using a allocate/register/unallocate/free system and "Greg blessed" kobjects. Currently libata is written under the SCSI layer and debate continues on how to restructure libata to minimize or eliminate its SCSI dependence. Error handling has been substantially improved by Tejun Heo and his changes are now in mainline. If you have had issues with SATA or libata error handling, try an updated kernel to see if those issues have been resolved. Tejun and others continue to add features and tune the libata stack.

Communication Breakdown: I/O and File Systems

During the morning a number of conversations sprung up about communication between I/O and file systems. A hot topic was getting information from the block layer about non-retryable errors that affect an entire range of bytes and passing that data up to user space. There are situations when retries are happening on a large range of bytes even when the I/O layer knows that an entire range of blocks are missing or bad.

A "pipe" abstraction was discussed to communicate data on byte ranges that are currently in error, under performance strain (because of a RAID5 disk failure), or temporarily unplugged. If a file system were aware of ranges that are currently handling a recoverable error, have unrecoverable errors or are temporarily slow, it may be able to handle the situation more gracefully.

File systems currently do not receive unplug events and handling unplug situations can be tricky. For example, if a fibre channel disk is pulled for a moment and plugged back in it may be down for only 30 seconds but how should the file system handle the situation? Ext3 currently remounts the entire file system as read only. XFS has a configurable timeout for fibre channel disks that must be reached before it sends an EIO error. And what should be done with USB drives that are unplugged? Should the file system save state and hope the device gets plugged back in? How long should it wait and should it still work if it is plugged into a different hub? All of these questions were raised but there are no clear answers.

The Filesystems Track

The workshop split into different tracks; your author decided to follow the one dedicated to filesystems.

Security Attributes

Michael Halcrow, eCryptFS developer, presented an idea to use SELinux to make file encryption/decryption dependent on application execution. For example, a policy could be defined so that the data would be unencrypted when OpenOffice is using the file but encrypted when the user copies the file to a USB key. After presenting the mechanism and mark-up language for this idea Michael opened the floor to the audience. The general feeling was that SELinux is often disabled by users and that per-mount-point encryption may be a more useful and easy to understand user interface.

Why Linux Sucks for Stacking

Josef Sipek, Unionfs maintainer, went over some of the issues involved with stacking file systems under Linux. A stacking file system, like Unionfs, provides an alternative view of a lower file system. For example, Unionfs takes a number of mounted directories, which could be NFS/ext3/etc, as arguments at mount time and merges their name space.

The big unsolved issue with stacking file systems is handling modifications to the lower file systems in the stack. Several people suggested that leaving the lower file system available to the user is just broken and that by default the lower layers should only be mounted internally.

The new fs/stack.c file was discussed too. This file currently contains a simple inode copy routines that is used by Unionfs and eCryptfs, but in the future more stackable file system routines should be pushed to this file.

Future work for Unionfs includes getting it working under lockdep and additional experimentation with an on-disk format. The on-disk format for Unionfs is currently under development; it will store white-out files (representing files which have been deleted by a user but which still exist on the lower-level filesystems) and persistent Unionfs inode data.

B-trees for a Shadowed FS

Many file systems use B-trees to represent files and directories. These structures keep data sorted, are balanced, and allow for insertion and deletion in logarithmic time. However, there are difficulties in using them with shadowing. Ohad Rodeh presented his approach to using b-trees and shadowing in an object storage device, but the methods are general and useful for any application.

Shadowing may also be called copy-on-write (COW); the basic idea is that when a write is made the block is read into memory, modified, and written to a new location on disk. Then the tree is recursively updated starting at the child and using COW until the root node is atomically updated. In this way the data is never in an inconsistent state; if the system crashes before the root node is updated then the write is lost but the previous contents remain intact.

Replicating the details of his presentation would be a wasted effort as his paper, B-trees, Shadowing and Clones [PDF], is well written and easy to read. Enjoy!

eXplode the code

Storage systems have a simple and important contract to keep: given user data they must save that data to disk without loss or corruption even in the face of system crashes. Can Sar gave an overview of eXplode [PDF], a systematic approach to finding bugs in storage systems.

eXplode systematically explores all possible choices that can be made at each choice point in the code to make low-probability events, or corner cases, just as probable as the main running path. And it does this exploration on a real running system with minimal modifications.

This system has the advantage of being conceptually simple and very effective. Bugs were found in every major Linux file system, including a fsync bug that can cause data corruption on ext2. This bug can be produced by doing the following: create a new file, B, which recycles an indirect block from a recently truncated file, A, then call fsync on file B and crash the system before file A's truncate gets to disk. There is now inconsistent data on disk and when e2fsck tries to fix the inconsistency it corrupts file B's data. A discussion of the bug has been started on the linux-fsdevel mailing list.


The second day of the file systems track started with a discussion of an NFS race. The race appears when a client opens up a file between two writes that occur during the same second. The client that just opened the file will be unaware of the second write and will keep an out-of-date version of the file in cache. To fix the problem, a "change" attribute was suggested. This number would be consistent across reboots, unit-less and would increment on every write.

In general everyone agreed that a change attribute is the right solution, however Val Henson pointed out that implementing this on legacy file systems will be expensive and will require on disk format changes.

Discussion then turned to NSFv4 access control lists (ACLs). Trond Myklebust said they are becoming a standard and Linux should support them. Andreas Gruenbacher is working on patches to add NFSv4 support to Linux but currently only ext3 is supported; more information can be found on the Native NFSv4 ACLs on Linux page. A possibly difficult issue will be mapping current POSIX ACLs to NFSv4 ACLs, but a draft document, Mapping Between NFSv4 and Posix Draft ACLs, lays out a mapping scheme.

GFS Updates

Steven Whitehouse gave an overview of the recent changes in the Global File System 2 (GFS2), a cluster file system where a number of peers share access to the storage device. The important changes include a new journal layout that can support mmap(), splice() and other system calls on journaled files, page cache level locking, readpages() and partial writepages() support, and ext3 standard ioctls lsattr and chattr.

readdir() was discussed at some length, particularly the ways in which it is broken. A directory insert on GFS2 may cause a reorder of the extensible hash structure GFS2 uses for directories. In order to support readdir() every hash chain must be sorted. The audience generally agreed that readdir() is difficult to implement and Ted Ts'o suggested that someone should try to go through committee to get telldir/seekdir/readdir fixed or eliminated.


A brief OCFS2 status report was given by Mark Fasheh. Like GFS2, OCFS2 is a cluster file system, designed to share a file system across nodes in a cluster. The current development focus is on adding features, as the basic file system features are working well.

After the status update the audience asked a few questions. The most requested OCFS2 feature is forced unmount and several people suggested that this should be a future virtual file system (vfs) feature. Mark also said that users really enjoy the easy setup of OCFS2 and the ability to use it as a local file system. A performance hot button for OCFS2 are the large inodes and occupy an entire block.

In the future Mark would like to mix extent and extended attribute data in-inode to utilize all of the available space. However, as the audience pointed out, this optimization can lead to some complex code. In the future Mark would also like to move to GFS's distribute lock manager.

DualFS: A New Journaling File System for Linux

DualFS is a file system by Juan Piernas that separates data and meta data into separate file systems. The on-disk format for the data disk is similar to ext2 without meta-data blocks. The meta data file system is a log file system, a design that allows for very fast writes since they are always made at the head of the log which reduces expensive seeks. A few performance numbers were presented; under a number of micro- and macro-benchmarks DualFS performs better than other Linux journaling file systems. In its current form, DualFS uses separate partitions for data and metadata, forcing the user to answer a difficult question: how much metadata do I expect to have?

More information, including performance comparisons, can be found on the DualFS LKML announcement and the project homepage. The currently available code is a patch on top of 2.4.19 and can be found on SourceForge.

pNFS Object Storage Driver

Benny Halevy gave an overview of pNFS (parallel NFS), which is part of the IETF NFSv4.1 draft and tries to solve the single server performance bottleneck of NFS storage systems. pNFS is a mechanism for an NFS client to talk directly to a disk device without sending requests through the NFS server, fanning the storage system out to the number of SAN devices. There are many proprietary systems that do a similar thing including EMC's High Road, IBM's TotalStorage SAN, SGI's CXFS and Sun's QFS. Having an open protocol would be a good thing.

However, Jeff Garzik was skeptical of including pNFS in the NFSv4.1 draft particularly because to support pNFS the kernel will need to provide implementations of all three access protocols: file storage, object storage and block storage. This will add significant complexity to the Linux NFSv4 implementation.

Benny explained that the pNFS implementation in Linux is modular to support multiple layout-type specific drivers which are optional. Each layout driver dynamically registers itself using its layout type and the NFS client calls it across a well-defined API. Support for specific layout types is optional. In the absence of a layout driver for some specific layout type the NFS client falls back to doing I/O through the server.

After this overview Benny turned to the topic of OSDs, or object based storage devices. These devices provide a more abstract view of the disk than the classic "array of blocks" abstraction seen in todays disks. Instead of blocks, objects are the basic unit of an OSD, and each object contains both meta-data and data. The disk manages the allocation of the bytes on disk and presents the object data as a contiguous array to the system. Having this abstraction in hardware would make file system implementation much simpler. To support OSDs in Linux Benny and others are working to get bi-directional SCSI command support into the Kernel and support for variable length command descriptor blocks (CDBs).

Hybrid Disks

Hybrid disks with an NVCache (flash memory) will be in consumers' hands soon. Timothy Bisson gave an overview of this new technology. The NVCache will have 128-256Mb of non-volatile flash memory that the disk can manage as a cache (unpinned) or the operating system can manage by pinning specified blocks to the non-volatile memory. This technology can reduce power consumption or increase disk performance.

To reduce power consumption the block layer can enable the NVCache Power Mode, which tells the disk to redirect writes to the NVCache, reducing disk spin-up operations. In this mode the 10 minute writeback threshold of Linux laptop mode can be removed. Another strategy is to pin all file system metadata in the NVCache, but spin-ups will still occur on non-metadata reads. An open question is how this pinning should be managed when two or more file systems are using the same disk.

Performance can be increased by using the NVCache as a cache for writes requiring a long seek. In this mode the block layer would pin the target blocks ensuring a write to the cache instead of incurring the expensive seek. Also, a file system can use the NVCache to store its journal and boot files for additional performance and reduced system start-up time.

If Linux developers decide to manage the NVCache there are many open questions. Which layer should manage the NVCache? The file system or block layer? And what type of API should be created to leverage the cache? Another big question is how much punishment can these caches take? According to Timothy it takes about a year (using a desktop workload) to a fry the cache if you are using it as a write cache.

Scaling Linux to Petabytes

Sage Weil presented Ceph, a network file system that is designed to scale to petabytes of storage. Ceph is based on a network of object based storage devices and complete copies of each object is distributed across multiple nodes using an algorithm called CRUSH. This distribution makes it possible for nodes to be added and removed from the system dynamically. More information on the design and implementation can be found on the Ceph homepage


The workshop concluded with the general consensus that bringing together SATA, SCSI and file system people was a good idea and that the status updates and conversations were useful. However, the workshop was a bit too large for code discussion and more targeted workshops will need to be held to workout the details of some of the issues discussed at LSF'07. Topics for future workshops include virtual memory and file system issues and extensions that are needed to the VFS.

Comments (52 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds