User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.36-rc3, which was released on August 29. "Nothing in particular stands out that I can recall. As usual, it's mostly driver updates (65%), of which a large piece (by line count) is just the removal of a staging driver that isn't really ready nor making any progress. But on the 'somewhat more likely to cause excitement' front, there's some radeon/nouveau drm updates too." See the full changelog for all the details.

Stable updates: The,,, and stable kernels were released on August 27. As usual, they contain fixes throughout the kernel tree. Note that this is the last stable update for the 2.6.34 kernel series.

Comments (none posted)

Quotes of the week

SIGEV_THREAD is the best proof that the whole posix timer interface was comitte[e]d under the influence of not to be revealed mind-altering substances.
-- Thomas Gleixner

Reaching into the first bag, quite worn out with a faded "27" on the outside of it, he grabbed one of the remaining kernels in there and tossed it into the group. A few small people at the back of the crowd caught the kernel, and slowly walked off toward the village. Og wondered about these people, constantly relying on the old kernels to save them for another day, while resisting the change to move on, harboring some kind of strange reverence for this specific brand that Og just could not understand.
-- Greg Kroah-Hartman

Comments (9 posted)

Kernel development news

Some numbers and thoughts on the stable kernels

By Jonathan Corbet
August 27, 2010
Much attention goes toward mainline kernel releases, but relatively few users are actually running those kernels. Instead, they run kernels provided by their distributors, and those kernels, in turn, are based off the stable kernel series. The practice of releasing stable kernels has been going for well over five years now, so perhaps it's time to look back at how it has been going.

Back in March 2005, the community was discussing ways of getting important fixes out to users of mainline releases. There was talk of maintaining a separate tree containing nothing but fixes; Linus, at the time, thought that any such attempt was doomed to failure:

I'll tell you what the problem is: I don't think you'll find anybody to do the parallel "only trivial patches" tree. They'll go crazy in a couple of weeks. Why? Because it's a _damn_ hard problem. Where do you draw the line? What's an acceptable patch? And if you get it wrong, people will complain _very_ loudly, since by now you've "promised" them a kernel that is better than the mainline. In other words: there's almost zero glory, there are no interesting problems, and there will absolutely be people who claim that you're a dick-head and worse, probably on a weekly basis.

With such strong words of encouragement, somebody clearly had to step up to the job; that somebody turned out to be Greg Kroah-Hartman and Chris Wright. They released on March 4, 2005 with all of three fixes. More than five years later, Greg (a.k.a. "Og") is still at it (Chris has not been active with stable updates for a while). During that time, the stable release history has looked like this:

Kernel UpdatesChanges
TotalPer release
2.6.27 531553 29
2.6.32 21179385
2.6.35 422857

In the table above, the kernels appearing in bold are those which are still receiving stable updates as of this writing (though 2.6.27 is clearly reaching the end of the line).

A couple of conclusions immediately jump out of the table above. The first is that the number of fixes going into stable updates has clearly increased over time. From this one might conclude that our kernel releases have steadily been getting buggier. That is hard to measure, but one should bear in mind that there is another important factor at work here: the kernel developers are simply directing more fixes toward the stable tree. Far more developers are looking at patches with stable updates in mind, and suggestions that a patch should be sent in that direction are quite common. So far fewer patches fall through the cracks than they did in the early days.

There is another factor at work here as well. The initial definition of what constituted an appropriate stable tree patch was severely restrictive; if a bug did not cause a demonstrable oops or vulnerability, the fix was not considered for the stable tree. By the time we get to the update, though, we see a wider variety of "fixes," including Acer rv620 laptop support, typo fixes, tracepoint improvements to make powertop work better, the optimistic spinning mutex scalability work, a new emu10k1 sound driver module parameter, and oprofile support for a new Intel processor. These enhancements are, arguably, all things that stable kernel users would like to have. But they definitely go beyond the original charter for this tree.

Your editor has also, recently, seen an occasional complaint about regressions finding their way into stable updates; given the volume of patches going into stable updates now, a regression every now and then should not be surprising. Regressions in the stable tree are a worrisome prospect; one can only hope that the problem does not get worse.

Another noteworthy fact is that the number of stable updates for most kernels appears to be falling slowly; the five updates for the entire 2.6.34 update history is the lowest ever, matched only by the 2.6.13 series. Even then, 2.6.34 got one more update than had been originally planned as the result of a security issue. It should seem obvious that handling this kind of patch flow for as many as four kernels simultaneously will be a lot of work; Greg, who has a few other things on his plate as well, may be running a little short on time.

Who is actually contributing patches to stable kernels? Your editor decided to do a bit of git data mining. Two kernels were chosen: 2.6.32, which is being maintained for an extended period as the result of its use in "enterprise" distributions, and 2.6.34, being the most recent kernel which has seen its final stable update. Here are the top contributors for both:

Most active stable contributors
Greg Kroah-Hartman362.0%
Daniel T Chen321.8%
Linus Torvalds231.3%
Trond Myklebust231.3%
Borislav Petkov231.3%
Ben Hutchings211.2%
David S. Miller201.1%
Theodore Ts'o201.1%
Tejun Heo201.1%
Dmitry Monakhov201.1%
Takashi Iwai181.0%
Ian Campbell181.0%
Jean Delvare170.9%
Henrique de Moraes Holschuh170.9%
Yan, Zheng170.9%
Zhao Yakui170.9%
Alan Stern170.9%
Al Viro160.9%
Alex Deucher150.8%
Dan Carpenter150.8%
Alex Deucher142.8%
Joerg Roedel142.8%
Tejun Heo102.0%
Daniel T Chen91.8%
Neil Brown81.6%
Rafael J. Wysocki81.6%
Linus Torvalds71.4%
Greg Kroah-Hartman71.4%
Alan Stern71.4%
Jesse Barnes71.4%
Trond Myklebust71.4%
Ben Hutchings71.4%
Tilman Schmidt71.4%
Avi Kivity71.4%
Sarah Sharp71.4%
Ian Campbell61.2%
Johannes Berg61.2%
Jean Delvare61.2%
Johan Hovold61.2%
Will Deacon51.0%

Some names on this list will be familiar. Linus never shows up on the list of top mainline contributors anymore, but he does generate a fair number of stable fixes. Other names are seen less often in the kernel context: Daniel Chen, for example, is an Ubuntu community contributor; his contributions are mostly in the welcome area of making audio devices actually work. Some of the people are in the list above because they introduced the bugs that their patches fix - appearing in that role is not necessarily an honor. But - admittedly without having done any sort of rigorous study - your editor suspects that most of the people listed above are fixing bugs introduced by others. They are performing an important and underappreciated service, turning mainline releases into kernels that the rest of the world actually wants to run.

We can also look at who is supporting this work:

Most active stable contributors by employer
Red Hat26714.9%
Linux Foundation241.3%
Red Hat6112.0%
Linux Foundation71.4%

These numbers quite closely match those for mainline kernel contributions, especially at the upper end. Fixing bugs is said to be boring and unglamorous work, but volunteers are still our leading source of fixes.

We did without a stable tree for the first ten 2.6.x releases, though, at this point, it's hard to imagine just how. In an ideal world, a mainline kernel release would not happen until there were no bugs left; the history of (among others) the 2.3 and 2.5 kernel development cycles shows that this approach does not work in the real world. There comes a point where the community has to create a stable release and go on to the next cycle; the stable tree allows that fork to happen without ending the flow of fixes into the released kernel.

The tables above suggest that the stable kernel process is working well, with large numbers of fixes being directed into stable updates and with participation from across the community. There may come a point, though, where that community needs to revisit the standards for patches going into stable updates. At some point, it may also become clear that the job of maintaining these kernels is too big for one person to manage. For now, though, the stable tree is clearly doing what it is intended to do; Greg deserves a lot of credit for making it work so well for so long.

Comments (41 posted)

Another union filesystem approach

By Jake Edge
September 1, 2010

Creating a union of two (or more) filesystems is a commonly requested feature for Linux that has never made it into the mainline. Various implementations have been tried (part 1 and part 2 of Valerie Aurora's look from early 2009), but none has crossed the threshold for inclusion. Of late, union mounts have been making some progress, but there is still work to do there. A hybrid approach—incorporating both filesystem- and VFS-based techniques—has recently been posted in an RFC patchset by Miklos Szeredi.

The idea behind unioning filesystems is quite simple, but the devil is in the details. In a union, one filesystem is mounted "atop" another, with the contents of both filesystems appearing to be in a single filesystem encompassing both. Changes made to the filesystem are reflected in the "upper" filesystem, and the "lower" filesystem is treated as read-only. One common use case is to have a filesystem on read-only media (e.g. CD) but allow users to make changes by writing to the upper filesystem stored on read-write media (e.g. flash or disk).

There are a number of details that bedevil developers of unions, however, including various problems with namespace handling, dealing with deleted files and directories, the POSIX definition of readdir(), and so on. None of them are insurmountable, but they are difficult, and it is even harder to implement them in a way that doesn't run afoul of the technical complaints of the VFS maintainers.

Szeredi's approach blends the filesystem-based implementations, like unionfs and aufs, with the VFS-based implementation of union mounts. For file objects, an open() is forwarded to whichever of the two underlying filesystems contains it, while directories are handled by the union filesystem layer. Neil Brown's very helpful first cut at documentation for the patches lumped directory handling in with files, but Szeredi called that a bug. Directory access is never forwarded to the other filesystems and directories need to "come from the union itself for various reasons", he said.

As outlined in Brown's document, most of the action for unions takes place in directories. For one thing, it is more accurate to look at the feature as unioning directory trees, rather than filesystems, as there is no requirement that the two trees reside in separate filesystems. In theory, the lower tree could even be another union, but the current implementation precludes that.

The filesystem used by the upper tree needs to support the "trusted" extended attributes (xattrs) and it must also provide valid d_type (file type) for readdir() responses, which precludes NFS. Whiteouts—that is files that exist in the lower tree, but have been removed in the upper—are handled using the "trusted.union.whiteout" xattr. Similarly, opaque directories, which do not allow entries in the lower tree to "show through", are handled with the "trusted.union.opaque" xattr.

Directory entries are merged with fairly straightforward rules: if there are entries in both the upper and lower layers with the same name, the upper always takes precedence unless both are directories. In that case, a directory in the union is created that merges the entries from each. The initial mount creates a merged directory of the roots of the upper and lower directory trees and subsequent lookups follow the rules, creating merged directories that get cached in the union dentry as needed.

Write access to lower layer files is handled by the traditional "copy up" approach. So, opening a lower file for write or changing its metadata will cause the file to be copied to the upper tree. That may require creating any intervening directories if the file is several levels down in a directory tree on the lower layer. Once that's done, though, the hybrid union filesystem has little further interaction with the file, at least directly, because operations and handed off to the upper filesystem.

The patchset is relatively small, and makes very few small changes to VFS—except for a change to struct inode_operations that ripples through the filesystem tree. The permissions() member of that structure currently takes a struct inode *, but the hybrid union filesystem needs to be able to access the filesystem-specific data (d_fsdata) that is stored in the dentry, so it was changed to take a struct dentry * instead. David P. Quigley questioned the need for the change, noting that unionfs and aufs did not require it. Aurora pointed out that union mounts would require something similar and that, along with Brown's documentation, seemed to put the matter to rest.

The rest of the patches make minor changes. The first adds a new struct file_operations member called open_other() that is used to forward open() calls to the upper or lower layers as appropriate. Another allows filesystems to set a FS_RENAME_SELF_ALLOW flag so that rename() will still process renames on the identical dummy inodes that the filesystem uses for non-directories. The bulk of the code (modulo the permissions() change) is the new fs/union filesystem itself.

While "union" tends to be used for these kinds of filesystems (or mounts), Brown noted that it is confusing and not really accurate, suggesting that "overlay" be used in its place. Szeredi is not opposed to that, saying that "overlayfs" might make more sense. Aurora more or less concurred, saying that union mounts were called "writable overlays" for one release. The confusion stemming from multiple uses of "union" in existing patches (unionfs, union mounts) may provide additional reason to rename the hybrid union filesystem to overlayfs.

The readdir() semantics are a bit different for the hybrid union as well. Changes to merged directories while they are being read will not appear in the entries returned by readdir(), and offsets returned from telldir() may not return to the same location in a merged directory on subsequent directory opens. The lists of directory entries in merged directories are created and cached on the first readdir() call, with offsets assigned sequentially as they are read. For the most part, these changes are "unlikely to be noticed by many programs", as Brown's documentation says.

A bigger issue is one that all union implementations struggle with: how to handle changes to either layer that are done outside of the context of the union. If users or administrators directly change the underlying filesystems, there are a number of ugly corner cases. Making the lower filesystem be read-only is an attractive solution, but it is non-trivial to enforce, especially for filesystems like NFS.

Szeredi would like to define the problem away or find some way to enforce the requirements that unioning imposes:

The easiest way out of this mess might simply be to enforce exclusive modification to the underlying filesystems on a local level, same as the union mount strategy. For NFS and other remote filesystems we either

a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.

There was some discussion of the problem, without much in the way of conclusions other than a requirement that changing the trees out from under the union filesystem not cause deadlocks or panics.

In some ways, hybrid union seems a simpler approach than union mounts. Whether it can pass muster with Al Viro and other filesystem maintainers remains to be seen however. One way or another, though, some kind of solution to the lack of an overlay/union filesystem in the mainline seems to be getting closer.

Comments (6 posted)

A look inside the OCFS2 filesystem

September 1, 2010

This article was contributed by Goldwyn Rodrigues

The Oracle Cluster Filesystem (ocfs2) is a filesystem for clustered systems accessing a shared device, such as a Storage Area Network (SAN). It enables all nodes on the cluster to see the same files; changes to any files are visible immediately on other nodes. It is the filesystem's responsibility to ensure that nodes do not corrupt data by writing into each other's files. To guarantee integrity, ocfs2 uses the Linux Distributed Lock Manager (DLM) to serialize events. However, a major goal of a clustered filesystem is to reduce cluster locking overhead in order to improve overall performance. This article will provide an overview of ocfs2 and how it is structured internally.

A brief history

Version 1 of the ocfs filesystem was an early effort by Oracle to create a filesystem for the clustered environment. It was a basic filesystem targeted to support Oracle database storage and did not have most of the POSIX features due to its limited disk format. Ocfs2 was a development effort to convert this basic filesystem into a general-purpose filesystem. The ocfs2 source code was merged in the Linux kernel with 2.6.16; since this merger (in 2005), a lot of features have been added to the filesystem which improve data storage efficiency and access times.


Ocfs2 needs a cluster management system to handle cluster operations such as node membership and fencing. All nodes must have the same configuration, so each node knows about the other nodes in the cluster. There are currently two ways to handle cluster management for ocfs2:

  • Ocfs2 Cluster Base (O2CB) - This is the in-kernel implementation of cluster configuration management; it provides only the basic services needed to have a clustered filesystem running. Each node writes to a heartbeat file to let others know the node is alive. More information on running the ocfs2 filesystem using o2cb can be found int the ocfs2 user guide [PDF]. This mode does not have the capability of removing nodes from a live cluster and cannot be used for cluster-wide POSIX locks.

  • Linux High Availability - uses user-space tools, such as heartbeat and pacemaker, to perform cluster management. These packages are complete cluster management suites which can be used for advanced cluster activities such as different fail-over, STONITH (Shoot The Other Node In The Head - yes, computing terms can be barbaric), migration dependent services etc. It is also capable of removing nodes from a live cluster. It supports cluster-wide POSIX locks, as opposed to node-local locks. More information about cluster management tools can be found at and

Disk format

ocfs2 separates the way data and metadata are stored on disk. To facilitate this, it has two types of blocks:
  • Metadata or simply "blocks" - the smallest addressable unit. These blocks contain the metadata of the filesystem, such as the inodes, extent blocks, group descriptors etc. The valid sizes are 512 bytes to 4KB (incremented in powers of two). Each metadata block contains a signature that says what the block contains. This signature is cross-checked while reading that specific data type.

  • Data Clusters - data storage for regular files. Valid cluster sizes range from 4KB to 1MB (in powers of two). A larger data cluster reduces the size of filesystem metadata such as allocation bitmaps, making filesystem activities such as data allocation or filesystem checks faster. On the other hand, a large cluster size increases internal fragmentation. A large cluster size is recommended for filesystems storing large files such as virtual machine images, while a small data cluster size is recommended for a filesystem which holds lots of small files, such as a mail directory.


An inode occupies an entire block. The block number (with respect to the filesystem block size) doubles as the inode number. This organization may result in high disk space usage for a filesystem with a lot of small files. To minimize that problem, ocfs2 packs the data files into the inode itself if they are small enough to fit. This feature is known as "inline data." Inode numbers are 64 bits, which gives enough room for inode numbers to be addressed on large storage devices.

Ocfs2 inode layout Data in a regular file is maintained in a B-tree of extents; the root of this B-tree is the inode. The inode holds a list of extent records which may either point to data extents, or point to extent blocks (which are the intermediate nodes in the tree). A special field called l_tree_depth contains the depth of the tree. A value of zero indicates that the blocks pointed to by extent records are data extents. The extent records contain the offset from the start of the file in terms of cluster blocks, which helps in determining the path to take while traversing the B-tree to find the block to be accessed.

The basic unit of locking is the inode. Locks are granted on a special DLM data structure known as a lock resource. For any access to the file, the process must request a DLM lock on the lock resource over the cluster. DLM offers six lock modes to differentiate between the type of operation. Out of these, ocfs2 uses only three: exclusive, protected read, and null locks. The inode maintains three types of lock resources for different operations:

  • read-write lock resource: is used to serialize writes if multiple nodes perform I/O at the same time on a file.

  • inode lock resource: is used for metadata inode operations

  • open lock resource: is used to identify deletes of a file. When a file is open, this lock resource is opened in protected-read mode. If the node intends to delete it, it will request for a exclusive lock. If successful, it means that no other node is using the file and it can be safely deleted. If unsuccessful, the inode is treated as an orphan file (discussed later)


Directory entries are stored in name-inode pairs in blocks known as directory blocks. Access to the storage pattern of directory blocks is the same as for a regular file. However, directory blocks are allocated as cluster blocks. Since a directory block is considered to be a metadata block, the first allocation uses only a part of the cluster block. As the directory expands, the remaining unused blocks of the data cluster are filled until the data cluster block is fully used.

ocfs2 directory
layout on disk A relatively new feature is indexing the directory entries for faster retrievals and improved lookup times. Ocfs2 maintains a separate indexed tree based on the hash of the directory names; the hash index points to the directory block where the directory entry can be found. Once the directory block is read, the directory entry is searched linearly.

A special directory trailer is placed at the end of a directory block which contains additional data about that block. It keeps a track of the free space in the directory block for faster free space lookup during directory entry insert operations. The trailer also contains the checksum for the directory block, which is used by the metaecc feature (discussed later).

Filesystem Metadata

A special system directory (//) contains all the metadata files for the filesystem; this directory is not accessible from a regular mount. Note that the // notation is used only for the debugfs.ocfs2 tool. Files in the system directory, known as system files, are different from regular files, both in the terms on how they store information and what they store.

An example of a system file is the slotmap, which defines the mapping of a node in the cluster. A node joins a cluster by providing its unique DLM name. The slot map provides it with a slot number, and the node inherits all system files associated with the slot number. The slot number assignment is not persistent across boots, so a node may inherit the system files of another node. All node-associated files are suffixed by the slot number to differentiate the files of different slots.

A global bitmap file in the system directory keeps a record of the allocated blocks on the device. Each node also maintains a "local allocations" system file, which manages chunks of blocks obtained from the global bitmap. Maintaining local allocations decreases contention over global allocations.

The allocation units are divided into the following:

  • inode_alloc: allocates inodes for the local node.

  • extent_alloc: allocates extent blocks for the local node. Extent blocks are intermediate leaf nodes in the B-tree storage of the files.

  • local_alloc: allocates data in data cluster sizes for the use of regular file data.

Allocator layout Each allocator is associated with an inode; it maintains allocations in units known as "block groups." The allocation groups are preceded by a group descriptor which contains details about the block group, such as free units, allocation bitmaps etc. The allocator inode contains a chain of group descriptor block pointers. If this chain is exhausted, group descriptors are added to the existing ones in the form of linked list. Think of it as an array of linked lists. The new group descriptor is added to the smallest chain so that number of hops required to reach an allocation unit is small.

Things get complicated when allocated data blocks are freed because those blocks could belong to the allocation map of another node. To resolve this problem, a "truncate log" maintains the blocks which have been freed locally, but not yet returned to the global bitmap. Once the node gets a lock on the global bitmap, the blocks in the local truncate log are freed.

A file is not physically deleted until all processes accessing the file close it. Filesystems such as ext3 maintain an orphan list which contains a list of files which have been unlinked but still are in use by the system. Ocfs2 also maintains such a list to handle orphan inodes. Things are a bit more complex, however, because a node must check that a file to be deleted is not being used anywhere in the cluster. This check is coordinated using the inode lock resource. Whenever a file is unlinked, and the removed link happens to be the last link to the file, a check is made to determine whether another node is using the file by requesting an exclusive lock over inode lock resource. If the file is being used, it will be moved to the orphan directory and marked with a OCFS2_ORPHANED_FL flag. The orphan directory is later scanned to check for files not being accessed by any of the nodes in order to physically remove them from the storage device.

Ocfs2 maintains a journal to deal with unexpected crashes. It uses the Linux JBD2 layer for journaling. The journal files are maintained, per node, for all I/O performed locally. If a node dies, it is the responsibility of the other nodes in the cluster to replay the dead node's journal before proceeding with any operations.

Additional Features

Ocfs2 has a couple of other distinctive features that it can boast about. They include:

  • Reflinks is a feature to support snapshotting of files using copy-on-write (COW). Currently, a system call interface, to be called reflink() or copyfile() is being discussed upstream. Until the system call is finalized, users can access this feature via the reflink system tool which uses an ioctl() call to perform the snapshotting.

  • Metaecc is an error correcting feature for the metadata using Cyclic Redundancy Check (CRC32). The code warns if the calculated error-correcting code is different from the one stored, and re-mounts the filesystem read-only in order to avoid further corruption. It is also capable of correcting single-bit errors on the fly. A special data structure, ocfs2_block_check, is embedded in most metadata structures to hold the CRC32 values associated with the structure.

Ocfs2 developers continue to add features to keep it up to par with other new filesystems. Some features to expect in the near future are delayed allocation, online filesystem checks, and defragmentation. Since one of the main goals of ocfs2 is to support a database, file I/O performance is considered a priority, making it one of the best filesystems for the clustered environment.

[Thanks to Mark Fasheh for reviewing this article.]

Comments (7 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds