User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.28-rc7, released on December 1. Linus says: "I was gone for a week, and it wasn't quite as quiet as I was hoping for, but there's a new -rc out there now with the merges of the fallout." Along with the usual fixes, 2.6.28-rc7 includes a new set of resource limits intended to prevent excessive kernel memory usage via the epoll_wait() system call.

Details, as usual, can be found in the long-format changelog.

There have been no stable kernel updates over the last week. The update is in the review process as of this writing; this 104-patch monster can be expected sometime on or after December 5.

Comments (none posted)

Kernel development news

Quotes of the week

Too many people seem to think that documentation is the "final" argument. It's not. Not even close. It's a hint and a help, but it's _secondary_ to code. Anybody who doesn't understand that should never be allowed to write code (or documentation, for that matter).
-- Linus Torvalds


	unsigned i;

is an act of insane vandalism, punishable by spending five additional years coding in fortran.

-- Andrew Morton

Meanwhile, 10 years and counting, the Linux kernel still generates a stupid write IO for every file read that apps do. Fortunately hardware designers will get rid of rotating disks faster than we can fix our glaring process problems in this space - but it's still a bit sad.
-- Ingo Molnar

Comments (none posted)

Sam Leffler releases HAL source

Sam Leffler, maintainer of the Atheros hardware abstraction layer (HAL), has released the source under an ISC license. Previously, this piece of the MadWifi driver for Atheros wireless chipsets was only available in binary form. "In his announcement, Sam states: Coincident with the release of this code I have concluded my agreement with Atheros whereby I had access to information about their devices. This means that in the future all fixes, updates for new chips, etc. will need to be a community effort. Atheros states the Linux platform will be the reference public code base so folks wanting to add support for other platforms will have to scrape the information from there."

Comments (11 posted)

Tux3: the other next-generation filesystem

By Jonathan Corbet
December 2, 2008
There is a great deal of activity around Linux filesystems currently. Of the many ongoing efforts, two receive the most attention: ext4, the extension of ext3 expected to keep that filesystem design going for a few more years, and btrfs, which is seen by many as the long-term filesystem of the future. But there is another project out there which is moving quickly and is worth a look: Daniel Phillips's Tux3 filesystem.

Daniel is not a newcomer to filesystem development. His Tux2 filesystem was announced in 2000; it attracted a fair amount of interest until it turned out that Network Appliance, Inc. held patents on a number of techniques used in Tux2. There was some talk of filing for defensive patents, and Jeff Merkey popped up for long enough to claim to have hired a patent attorney to help with the situation. What really happened is that Tux2 simply faded from view. Tux3 is built on some of the same ideas as Tux2, but many of those ideas have evolved over the eight intervening years. The new filesystem, one hopes, has changed enough to avoid the attention of NetApp, which has shown a willingness to use software patents to defend its filesystem turf.

Like any self-respecting contemporary filesystem, Tux3 is based on B-trees. The inode table is such a tree; each file stored within is also a B-tree of blocks. Blocks are mapped using extents, of course - another obligatory feature for new filesystems. Most of the expected features are present. In many ways, Tux3 looks like yet another POSIX-style filesystem, but there are some interesting differences.

Tux3 implements transactions through a forward-logging mechanism. A set of changes to the filesystem will be batched together into a "phase," which is then written to the journal. Once the phase is committed to the journal, the transaction is considered to be safely completed. At some future time, the filesystem code will "roll up" the journal changes and write them back to the static version of the filesystem.

The logging implementation is interesting. Tux3 uses a variant of the copy-on-write mechanism employed by Btrfs; it will not allow any filesystem block to be overwritten in place. So writing to a block within a file will cause a new block to be allocated, with the new data written there. That, in turn, will require that the filesystem data structure which maps file-logical blocks to physical blocks (the extent) will need to be changed to reflect the new block location. Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.

The end result is that the results of a specific change can remain in the log for some time. In Tux3, the log can be thought of as an integral part of the filesystem's metadata. This is true to the point that Tux3 doesn't even bother to roll up the log when the filesystem is unmounted; it just initializes its state from the log when the next mount happens. Among other things, Daniel says, this approach ensures that the journal recovery code will be well-tested and robust - it will be exercised at every filesystem mount.

In most filesystems, on-disk inodes are fixed-size objects. In Tux3, instead, their size will be variable. Inodes are essentially containers for attributes; in Tux3, normal filesystem data and extended attributes are treated in almost the same way. So an inode with more attributes will be larger. Extended attributes are compressed through the use of an "atom table" which remaps attribute names onto small integers. Filesystems with extended attributes tend to have large numbers of files using attributes with a small number of names, so the space savings across an entire filesystem could be significant.

Also counted among a file's attributes are the blocks where the data is stored. The Tux3 design envisions a number of different ways in which file blocks can be tracked. A B-tree of extents is a common solution to this problem, but its benefits are generally seen with larger files. For smaller files - still the majority of files on a typical Linux system - data can be stored either directly in the inode or at the other end of a simple block pointer. Those representations are more compact for small files, and they provide quicker data access as well. For the moment, though, only extents are implemented.

Another interesting - but unimplemented - idea for Tux3 is the concept of versioned pointers. The btrfs filesystem implements snapshots by retaining a copy of the entire filesystem tree; one of these copies exists for every snapshot. The copy-on-write mechanism in btrfs ensures that those snapshots share data which has not been changed, so it is not as bad as it sounds. Tux3 plans to take a different approach to the problem; it will keep a single copy of the filesystem tree, but keep track of different versions of blocks (or extents, really) within that tree. So the versioning information is stored in the leaves of the tree, rather than at the top. But the versioned extents idea has been deferred for now, in favor of getting a working filesystem together.

Also removed from the initial feature list is support for subvolumes. This feature initially seemed like an easy thing to do, but interaction with fsync() proved hard. So Daniel finally concluded that volume management was best left to volume managers and dropped the subvolume feature from Tux3.

One feature which has never been on the list is checksumming of data. Daniel once commented:

Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time, I consider the spectre of disks passing bad data as good to be largely vendor FUD. That said, checksumming will likely appear in the feature list at some point, I just consider it a decoration, not an essential feature.

Tux3 development is far from the point where the developers can worry about "decorations"; it remains, at this point, an embryonic project being pushed by a developer with a bit of a reputation for bright ideas which never quite reach completion. The code, thus far, has been developed in user space using FUSE. There is, however, an in-kernel version which is now ready for further development. According to Daniel:

The functionality we have today is roughly like a buggy Ext2 with missing features. While it is very definitely not something you want to store your files on, this undeniably is Tux3 and demonstrates a lot of new design elements that I have described in some detail over the last few months. The variable length inodes, the attribute packing, the btree design, the compact extent encoding and deduplication of extended attribute names are all working out really well.

The potential user community for a stripped-down ext2 with bugs is likely to be relatively small. But the Tux3 design just might have enough to offer to make it a contender eventually.

First, though, there are a few little problems to solve. At the top of the list, arguably, is the complete lack of locking - locking being the rocks upon which other filesystem projects have run badly aground. The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review. Work on an fsck utility does not appear to have begun. There has been no real benchmarking work done; it will be interesting to see how Daniel can manage the "never overwrite a block" policy in a way which does not fragment files (and thus hurt performance) over time. And so on.

That said, a lot of these problems could end up being resolved rather quickly. Daniel has put the code out there and appears to have attracted an energetic (if small) community of contributors. Tux3 represents the core of a new filesystem with some interesting ideas. Code comments may be scarce, but Daniel - never known as a tight-lipped developer - has posted a wealth of information which can be found in the Tux3 mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a reserved unilateral right to relicense the code to anything else - but developers who are comfortable with that are likely to find an interesting and fast-moving project to play in.

Comments (53 posted)

Debugfs and the making of a stable ABI

By Jonathan Corbet
December 3, 2008
Remi Colinet recently proposed the addition of a new virtual file, /proc/mempool, which would display the usage of memory pools within the kernel. Nobody really disagreed with the idea of making this information available, but there were some grumbles about putting it into /proc. Once upon a time, just about anything could go into that directory, but, in recent years, there has been a real attempt to confine /proc to its original intent: providing information about processes. /proc/mempool is not about processes, so it was considered procfile-non-grata. It was suggested that another home should be found for this file.

Where that other home should be is not obvious, though. Somewhere like /sys/kernel might seem to make sense, but sysfs has rules of its own. In particular, the one-value-per-file rule makes it hard to create an easy file where developers can simply query the state of a kernel subsystem, so sysfs is not a suitable home for this file either.

The next option is debugfs, which was created in December, 2004. Debugfs is meant to be an aid for kernel developers; it explicitly disclaims any rules on the types of files that can be put there. All rules except for one: debugfs is not a mandatory part of any kernel installation, and nothing found therein should be considered to be a part of the stable user-space ABI. It is, instead, a dumping ground where kernel developers can quickly export information which is useful to them.

Since debugfs is not a part of the user-space ABI, it seems like a poor place to put things that users might depend on. When this was pointed out, it became clear that the non-ABI status of debugfs is not as well established as one might think. Quoting Matt Mackall:

The problem with debugfs is that it claims to not be an ABI but it is lying. Distributions ship tools that depend on portions of debugfs. And they also ship debugfs in their kernel. So it is effectively the same as /proc, except with the 1.0-era everything-goes attitude rather than the 2.6-era we-should-really-think-about-this one.

Pushing stuff from procfs to debugfs is thus just setting us up for pain down the road. Don't do it. In five years, we'll discover we can't turn debugfs off or even clean it up because too much relies on it.

As an example, Matt pointed out the extensively-documented usbmon interface which provides a great deal of information about what's happening on a USB bus. If it is not an ABI, he says, nobody should be upset if he submits a patch which breaks it.

That is a perennial problem with interfaces between the kernel and user space; changing them causes pain for users. That is why incompatible changes to user-space interfaces are almost never allowed; an important goal for the kernel development process is to avoid breaking user-space programs. One might think that this problem could be avoided for a specific interface by explicitly documenting it as an unstable interface. The files in Documentation/ABI/testing are meant to serve that role; anything found there should be considered to be unstable. But, as soon as people start using programs which depend on a specific interface, it has, for all practical purposes, hardened into part of the kernel ABI.

Linus put it this way:

The fact that something is documented (whether correctly or not) has absolutely _zero_ impact on anything at all. What makes something an ABI is that it's useful and available. The only way something isn't an ABI is by _explicitly_ making sure that it's not available even by mistake in a stable form for binary use.

Example: kernel internal data structures and function calls. We make sure that you simply _cannot_ make a binary that works across kernel versions. That is the only way for an ABI to not form.

So a given kernel interface can be kept away from ABI status if it is so hard to get to, and so unstable, that nothing ever comes to depend on it. The kernel module interface certainly fits this bill. Modules must generally be built for the exact kernel they are intended to work with, and they must often be built with the same configuration options and the same compiler. Anybody who has gotten into the dark business of distributing binary-only modules has learned what a challenge it can be.

Debugfs is different, though. It is enabled in a number of distributor kernels, even if, perhaps, it is not mounted by default. Once a set of files gets placed there, their format tends to change rarely. So it is possible for people to write programs which depend on debugfs files. And the end result of that is that debugfs files can become part of the stable kernel ABI. That is generally not a result that was intended by anybody involved, but it happens anyway. The only way to avoid it would be to deliberately shake up debugfs every kernel cycle - and few developers have much desire to do that.

This is a discussion without a whole lot in the way of useful conclusions; it leaves /proc/mempool without a home. ABI design, it turns out, is still hard. In the longer term, dealing with an ABI which was never really designed, but which just sort of settled into being, is even harder. There does not appear to be any substitute for thinking seriously about every interface between kernel and user space, even if it's just for a developer's debugging tool.

Comments (8 posted)

Variations on fair I/O schedulers

December 3, 2008

This article was contributed by Goldwyn Rodrigues

An I/O scheduler is a subsystem of the kernel which schedules I/O operations to the various storage devices to get the best possible throughput from those devices. The algorithm is often reminiscent of the algorithm used by elevators when dealing with requests coming from different floors to go up or down. This is the reason I/O scheduling algorithms are also called "elevators." I/O requests are submitted in an order designed to minimize disk head movement (thus minimizing disk seek times), yet guaranteeing good I/O rates. The next request chosen will be dependent on the current disk head position, in order to service the requests quickly, and spend less time seeking, or moving the disk head. However, algorithms may also consider other aspects such as fairness or time guarantees.

The Completely Fair Queuing (CFQ) I/O scheduler, is one of the most popular I/O scheduling algorithms; it is used as the default scheduler in most distributions. As the name suggests, the CFQ scheduler tries to maintain fairness in its distribution of bandwidth to processes, and yet does not compromise much on the throughput. The elevator's fairness is accomplished by servicing all processes and not penalizing those which have requests far from the current disk head position. It grants a time slice to every process; once the task has consumed its slice, this slice is recomputed and task is added to the end of the queue. The I/O priority is used to compute the time slice granted and the offset in the request queue.

The Budget Fair Queuing scheduler

The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, still suffers from fairness problems, especially towards processes which make the best possible use of the disk bandwidth. If the same time slice is assigned to two processes, they may each get different throughput, as a function of the positions on the disk of their requests. Moreover, due to its round robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk.

The Budget Fair Queuing (BFQ) scheduler, developed by Fabio Checconi and Paolo Valente, changes the CFQ round-robin scheduling policy based on time slices into a fair queuing policy based on sector budgets. Each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of the Worst-case Fair Weighted Fair Queuing+ (WF2Q+) algorithm (described in this paper [compressed PS]), which guarantees a worst case complexity of O(logN) and boils down to O(1) in most cases. The budget assigned to each task varies over time as a function of its behavior. However, one can set the maximum value of the budget that BFQ can assign to any task.

BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured sectors. There are limits, though: processes spending too much time to exhaust their budget are penalized and the scheduler selects the next process to dispatch I/O. The next budget is calculated on the feedback provided by the request serviced.

BFQ also introduces I/O scheduling within control groups. Queues are collected into a tree of groups, and there is a distinct B-WF2Q+ scheduler on each non-leaf node. Leaf nodes are request queues as in the non-hierarchical case. BFQ supports I/O priority classes at each hierarchy level, enforcing a strict priority ordering among classes. This means that idle queues or groups are served only if there are no best effort queues or groups in the same control group, and best effort queues and groups are served only if there are no real-time queues or groups. As compared to cfq-cgroups (explained later), it lacks per device priorities. The developers however claim that this feature can be incorporated easily.


Requests coming to an I/O scheduler fall into two categories, synchronous and asynchronous. Synchronous requests are those for which the application must wait before continuing to send further requests - typically read requests. On the other hand, asynchronous requests - typically writes - do not block the application's progress while they are executed. In BFQ, as in CFQ, synchronous requests are collected in per-task queues, while asynchronous requests are collected in per-device (or, in the case of hierarchical scheduling, per group) queues.

When the underlying device driver asks for the next request to serve and there is no queue being served, BFQ uses B-WF2Q+, a modified version of WF2Q+, to choose a queue. It then selects the first request from that queue in C-LOOK order and returns it to the driver. C-LOOK is a disk scheduling algorithm, where the next request picked is the one with the immediate next highest disk sector to the current position of the disk head. Once the disk has serviced the maximum sector number in the request queue, it positions the head to the sector number of the request having the lowest sector number.

When a new queue is selected it is assigned a budget, in disk sector units, decremented each time a request from the same queue is served. When the device driver asks for new requests and there is a queue under service, they are chosen from that queue until one of the following conditions is met: (1) the queue exhausts its budget, (2) the queue is spending too much time to consume its budget, or (3) the queue has no more requests to serve

On termination of a request, the scheduler recalculates the budget allocated to each process depending on the feedback it gets. For example, for greedy processes which have exhausted their budgets, the budget is increased, whereas if it has been idle for long, its budget is decreased. The maximum budget a process can get is a configurable system parameter (max_budget). Two other parameters, timeout_sync and timeout_async, control the timeout time for consuming the budget of the synchronous and asynchronous queues respectively. In addition, max_budget_async_rq limits the maximum number of requests serviced from an asynchronous queue.

If a synchronous queue has no more requests to serve, but it has some budget left, the scheduler idles (i.e., it tells to the device driver that it has no requests to serve even if there are other active queues) for a short period, in anticipation of a new request from the task owning the queue.

Test Results

The developers compared six different I/O scheduling algorithms: BFQ, YFQ, SCAN-EDF, CFQ, the Linux anticipatory scheduler, and C-LOOK. They compared a multitude of test scenarios analogous to real-life scenarios, including throughput, bandwidth distribution, latency, and short-term time guarantees. With respect to bandwidth distribution, BFQ can be concluded as the best, and a good algorithm for most scenarios. There were also extensive tests comparing BFQ against CFQ, and the results are available here. The throughput of BFQ is more or less the same as CFQ, but it scores well in distributing I/O bandwidth fairly among the processes, and displays lower latency with streaming data.

Using sector budgets instead of time as a factor of granting slice for fair bandwidth distribution is an interesting concept. The algorithm also employs timeouts to terminate requests of "seeky" processes taking too much time to consume their budget and penalizes them. The feedback from current requests help determine future budgets, making the algorithm self-learning. Such tighter bandwidths distribution would be a requirement for systems running virtual machines, or container classes. However, it depends on how BFQ stands the test of time against the tried-and-tested stable CFQ.

See the BFQ technical report [PDF] for (much) more information.

Expanded CFQ

Control Groups provide a mechanism for aggregating sets of tasks, and all their future children, into hierarchical groups. These groups can be allocated dedicated portions of the available resources, or resource sharing can be prioritized within these groups. Control groups are controlled by the cgroups pseudo-filesystem. Once mounted, the top level directory shows the complete set of existing control groups. Each directory made under the root filesystem makes a new group, and resources can be allocated to the tasks listed in the tasks file in the individual groups directory.

Control groups can be used to regulate access to CPU time, memory, and more. There are also several projects working toward the creation of I/O bandwidth controllers for control groups. One of those is the expanded CFQ scheduler patch for cgroups by Satoshi Uchida. This patch set introduces a new I/O scheduler called cfq-cgroups, which introduces cgroups for the I/O scheduling subsystem.

This scheduler, as the name suggests, is based on Completely Fair Queuing I/O scheduler. It can take advantage of hierarchical scheduling of processes, with respect to the cgroup they belong to, each cgroup having its own CFQ scheduler. I/O devices in a control group can be prioritized. The time slice given to each hierarchical group per device is a function of the device priority. This helps shaping of I/O bandwidth per group, per device.


To use, cfq-cgroups, select it as a default scheduler at boot by passing elevator=cfq-cgroups as a boot parameter. This can also be dynamically changed for individual devices by writing cfq-cgroups to /sys/block/<device>/queue/scheduler. There are two levels of control: through the cgroups filesystem, for individual groups, and through sysfs, for individual devices.

Like any other control group, cfq-cgroup is managed through the cgroup pseudo-filesystem. To access the cgroups, mount the pseudo cgroups filesystem:

    # mount -t cgroup -o cfq cfq /mnt/cgroup

The cgroup directory, by default, will have a file called cfq.ioprio, which contains the individual priority on a per-device basis. The time slice received per device per group is a function of the I/O priority listed in cfq.ioprio. The tasks file represents the list of tasks in the particular group. To make more groups, create a directory in the mounted cgroup directory:

    # mkdir /mnt/cgroup/group1

The new directories are automatically populated with files, cfq.ioprio, tasks etc, which are used to control the resources in this group. To add tasks in a group, write the process ID of the task to the tasks file:

    #echo <pid> > /mnt/cgroup/group1/tasks

The cfq.ioprio file contains the list of devices and their respective priorities. Each device in the cgroup has a default I/O priority of 3, while the valid values are 0 to 7. To change the priority of a device for the cgroup group1, run:

    # echo 2 > /mnt/cgroup/group1/cfq.ioprio

This would change the priority of the entire group. To change the I/O priority of a specific device:

    # echo 2 sda > /mnt/cgroup/group1/cfq.ioprio

To change the default priority while keeping the priority of the devices unchanged:

    # echo 4 defaults > /mnt/cgroup/group1/cfq.ioprio

The device view shows the list of cgroups and their respective priorities on a per-group basis. This can be changed by:

    # echo 2 group1 > /sys/block/sda/queue/iosched/ioprio

The device view contain other parameters similar to the CFQ scheduler, such as back_seek_max or back_seek_penalty, which are specific to the control of the individual device, same as the traditional CFQ.


The patch introduces a new data structure called cfq_driver_data for the control of I/O bandwidth for cgroups. All driver-related data has been moved from the traditional cfq_data structure to cfq_driver_data structure. Similarly, cfq_cgroups is a new data structure to control the cgroup parameters. The organization of data can be assumed as a matrix with cfq_cgroups as rows and cfq_driver_data as columns, as shown in the diagram below.

[cfq-cgroup diagram]

At each intersection, there is a cfqd_data structure which is responsible for all CFQ related queue handling, so that each cfq_data corresponds to one cfq_cgroup and cfq_driver_data combination.

When a new cgroup is created, the cfq_data from the parent cgroup is copied into the new group. While inserting new nodes of cfq_data into the cgroup, the cfq_data structure is initialized with the priority of the cfq_cgroup. This way all data of the parent is inherited by the child cgroup, and shows up in the respective files per group in the cgroup filesystem.

Scheduling of cfq_data within the CFQ scheduler is similar to that of the native CFQ scheduler. Each node is assigned a time slice. This slice is calculated according to the I/O priority of the device, using the per-device base time slice. The time slice offset forms the key of the red-black node to be inserted in the service tree. One cfq_data entry is picked from the start of the red-black tree and scheduled. Once its time slice expires it is added to the tree again, after recalculation of its time slice offset. So, each cfq_data structure acts as a queue node per device, and, within each CFQ data structure, requests are queued as with a regular CFQ queue.

Both BFQ and cfq-cgroups are attempts to bring a higher degree of fairness to I/O scheduling, with "fairness" being tempered by the desire to add more administrative control via the control groups mechanism. They both appear to be useful solutions, but they must contend with the wealth of other I/O bandwidth control implementations out there. Coming to some sort of consensus on which approach is the right one could prove to be a rather longer process than simply implementing these algorithms in the first place.

Comments (6 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds