The current development kernel is 2.6.36-rc3, which was released on August 29. "Nothing in particular stands out that I can recall. As usual, it's mostly driver updates (65%), of which a large piece (by line count) is just the removal of a staging driver that isn't really ready nor making any progress. But on the 'somewhat more likely to cause excitement' front, there's some radeon/nouveau drm updates too." See the full changelog for all the details.
Stable updates: The 22.214.171.124, 126.96.36.199, 188.8.131.52, and 184.108.40.206 stable kernels were released on August 27. As usual, they contain fixes throughout the kernel tree. Note that this is the last stable update for the 2.6.34 kernel series.
Kernel development news
Back in March 2005, the community was discussing ways of getting important fixes out to users of mainline releases. There was talk of maintaining a separate tree containing nothing but fixes; Linus, at the time, thought that any such attempt was doomed to failure:
With such strong words of encouragement, somebody clearly had to step up to the job; that somebody turned out to be Greg Kroah-Hartman and Chris Wright. They released 220.127.116.11 on March 4, 2005 with all of three fixes. More than five years later, Greg (a.k.a. "Og") is still at it (Chris has not been active with stable updates for a while). During that time, the stable release history has looked like this:
Kernel Updates Changes Total Per release 2.6.11 12 79 7 2.6.12 6 53 9 2.6.13 5 44 9 2.6.14 7 96 14 2.6.15 7 110 16 2.6.16 62 1053 17 2.6.17 14 191 14 2.6.18 8 240 30 2.6.19 7 189 27 2.6.20 21 447 21 2.6.21 7 162 23 2.6.22 19 379 20 2.6.23 16 302 19 2.6.24 7 246 35 2.6.25 20 492 25 2.6.26 8 321 40 2.6.27 53 1553 29 2.6.28 10 613 61 2.6.29 6 383 64 2.6.30 10 419 42 2.6.31 14 826 59 2.6.32 21 1793 85 2.6.33 7 883 126 2.6.34 5 599 120 2.6.35 4 228 57
In the table above, the kernels appearing in bold are those which are still receiving stable updates as of this writing (though 2.6.27 is clearly reaching the end of the line).
A couple of conclusions immediately jump out of the table above. The first is that the number of fixes going into stable updates has clearly increased over time. From this one might conclude that our kernel releases have steadily been getting buggier. That is hard to measure, but one should bear in mind that there is another important factor at work here: the kernel developers are simply directing more fixes toward the stable tree. Far more developers are looking at patches with stable updates in mind, and suggestions that a patch should be sent in that direction are quite common. So far fewer patches fall through the cracks than they did in the early days.
There is another factor at work here as well. The initial definition of what constituted an appropriate stable tree patch was severely restrictive; if a bug did not cause a demonstrable oops or vulnerability, the fix was not considered for the stable tree. By the time we get to the 18.104.22.168 update, though, we see a wider variety of "fixes," including Acer rv620 laptop support, typo fixes, tracepoint improvements to make powertop work better, the optimistic spinning mutex scalability work, a new emu10k1 sound driver module parameter, and oprofile support for a new Intel processor. These enhancements are, arguably, all things that stable kernel users would like to have. But they definitely go beyond the original charter for this tree.
Your editor has also, recently, seen an occasional complaint about regressions finding their way into stable updates; given the volume of patches going into stable updates now, a regression every now and then should not be surprising. Regressions in the stable tree are a worrisome prospect; one can only hope that the problem does not get worse.
Another noteworthy fact is that the number of stable updates for most kernels appears to be falling slowly; the five updates for the entire 2.6.34 update history is the lowest ever, matched only by the 2.6.13 series. Even then, 2.6.34 got one more update than had been originally planned as the result of a security issue. It should seem obvious that handling this kind of patch flow for as many as four kernels simultaneously will be a lot of work; Greg, who has a few other things on his plate as well, may be running a little short on time.
Who is actually contributing patches to stable kernels? Your editor decided to do a bit of git data mining. Two kernels were chosen: 2.6.32, which is being maintained for an extended period as the result of its use in "enterprise" distributions, and 2.6.34, being the most recent kernel which has seen its final stable update. Here are the top contributors for both:
Most active stable contributors
2.6.32 Greg Kroah-Hartman 36 2.0% Daniel T Chen 32 1.8% Linus Torvalds 23 1.3% Trond Myklebust 23 1.3% Borislav Petkov 23 1.3% Ben Hutchings 21 1.2% David S. Miller 20 1.1% Theodore Ts'o 20 1.1% Tejun Heo 20 1.1% Dmitry Monakhov 20 1.1% Takashi Iwai 18 1.0% Ian Campbell 18 1.0% Jean Delvare 17 0.9% Henrique de Moraes Holschuh 17 0.9% Yan, Zheng 17 0.9% Zhao Yakui 17 0.9% Alan Stern 17 0.9% Al Viro 16 0.9% Alex Deucher 15 0.8% Dan Carpenter 15 0.8%
2.6.34 Alex Deucher 14 2.8% Joerg Roedel 14 2.8% Tejun Heo 10 2.0% Daniel T Chen 9 1.8% Neil Brown 8 1.6% Rafael J. Wysocki 8 1.6% Linus Torvalds 7 1.4% Greg Kroah-Hartman 7 1.4% Alan Stern 7 1.4% Jesse Barnes 7 1.4% Trond Myklebust 7 1.4% Ben Hutchings 7 1.4% Tilman Schmidt 7 1.4% Avi Kivity 7 1.4% Sarah Sharp 7 1.4% Ian Campbell 6 1.2% Johannes Berg 6 1.2% Jean Delvare 6 1.2% Johan Hovold 6 1.2% Will Deacon 5 1.0%
Some names on this list will be familiar. Linus never shows up on the list of top mainline contributors anymore, but he does generate a fair number of stable fixes. Other names are seen less often in the kernel context: Daniel Chen, for example, is an Ubuntu community contributor; his contributions are mostly in the welcome area of making audio devices actually work. Some of the people are in the list above because they introduced the bugs that their patches fix - appearing in that role is not necessarily an honor. But - admittedly without having done any sort of rigorous study - your editor suspects that most of the people listed above are fixing bugs introduced by others. They are performing an important and underappreciated service, turning mainline releases into kernels that the rest of the world actually wants to run.
We can also look at who is supporting this work:
Most active stable contributors by employer
2.6.32 (None) 275 15.3% Red Hat 267 14.9% Intel 194 10.8% (Unknown) 175 9.8% Novell 166 9.3% IBM 95 5.3% AMD 60 3.3% Oracle 40 2.2% Fujitsu 33 1.8% Atheros 30 1.7% Parallels 29 1.6% Citrix 27 1.5% (Academia) 26 1.5% Linux Foundation 24 1.3% NetApp 23 1.3% 23 1.3% (Consultant) 20 1.1% NEC 18 1.0% Canonical 15 0.8% Nokia 14 0.8%
2.6.34 (None) 95 18.7% Red Hat 61 12.0% (Unknown) 58 11.4% Novell 45 8.9% Intel 43 8.5% AMD 35 6.9% IBM 17 3.3% (Academia) 16 3.1% MontaVista 9 1.8% Fujitsu 9 1.8% ARM 8 1.6% Citrix 8 1.6% NetApp 7 1.4% Oracle 7 1.4% (Consultant) 7 1.4% Linux Foundation 7 1.4% 6 1.2% Nokia 6 1.2% NTT 5 1.0% VMWare 5 1.0%
These numbers quite closely match those for mainline kernel contributions, especially at the upper end. Fixing bugs is said to be boring and unglamorous work, but volunteers are still our leading source of fixes.
We did without a stable tree for the first ten 2.6.x releases, though, at this point, it's hard to imagine just how. In an ideal world, a mainline kernel release would not happen until there were no bugs left; the history of (among others) the 2.3 and 2.5 kernel development cycles shows that this approach does not work in the real world. There comes a point where the community has to create a stable release and go on to the next cycle; the stable tree allows that fork to happen without ending the flow of fixes into the released kernel.
The tables above suggest that the stable kernel process is working well, with large numbers of fixes being directed into stable updates and with participation from across the community. There may come a point, though, where that community needs to revisit the standards for patches going into stable updates. At some point, it may also become clear that the job of maintaining these kernels is too big for one person to manage. For now, though, the stable tree is clearly doing what it is intended to do; Greg deserves a lot of credit for making it work so well for so long.
Creating a union of two (or more) filesystems is a commonly requested feature for Linux that has never made it into the mainline. Various implementations have been tried (part 1 and part 2 of Valerie Aurora's look from early 2009), but none has crossed the threshold for inclusion. Of late, union mounts have been making some progress, but there is still work to do there. A hybrid approach—incorporating both filesystem- and VFS-based techniques—has recently been posted in an RFC patchset by Miklos Szeredi.
The idea behind unioning filesystems is quite simple, but the devil is in the details. In a union, one filesystem is mounted "atop" another, with the contents of both filesystems appearing to be in a single filesystem encompassing both. Changes made to the filesystem are reflected in the "upper" filesystem, and the "lower" filesystem is treated as read-only. One common use case is to have a filesystem on read-only media (e.g. CD) but allow users to make changes by writing to the upper filesystem stored on read-write media (e.g. flash or disk).
There are a number of details that bedevil developers of unions, however, including various problems with namespace handling, dealing with deleted files and directories, the POSIX definition of readdir(), and so on. None of them are insurmountable, but they are difficult, and it is even harder to implement them in a way that doesn't run afoul of the technical complaints of the VFS maintainers.
Szeredi's approach blends the filesystem-based implementations, like unionfs and aufs, with the VFS-based implementation of union mounts. For file objects, an open() is forwarded to whichever of the two underlying filesystems contains it, while directories are handled by the union filesystem layer. Neil Brown's very helpful first cut at documentation for the patches lumped directory handling in with files, but Szeredi called that a bug. Directory access is never forwarded to the other filesystems and directories need to "come from the union itself for various reasons", he said.
As outlined in Brown's document, most of the action for unions takes place in directories. For one thing, it is more accurate to look at the feature as unioning directory trees, rather than filesystems, as there is no requirement that the two trees reside in separate filesystems. In theory, the lower tree could even be another union, but the current implementation precludes that.
The filesystem used by the upper tree needs to support the "trusted" extended attributes (xattrs) and it must also provide valid d_type (file type) for readdir() responses, which precludes NFS. Whiteouts—that is files that exist in the lower tree, but have been removed in the upper—are handled using the "trusted.union.whiteout" xattr. Similarly, opaque directories, which do not allow entries in the lower tree to "show through", are handled with the "trusted.union.opaque" xattr.
Directory entries are merged with fairly straightforward rules: if there are entries in both the upper and lower layers with the same name, the upper always takes precedence unless both are directories. In that case, a directory in the union is created that merges the entries from each. The initial mount creates a merged directory of the roots of the upper and lower directory trees and subsequent lookups follow the rules, creating merged directories that get cached in the union dentry as needed.
Write access to lower layer files is handled by the traditional "copy up" approach. So, opening a lower file for write or changing its metadata will cause the file to be copied to the upper tree. That may require creating any intervening directories if the file is several levels down in a directory tree on the lower layer. Once that's done, though, the hybrid union filesystem has little further interaction with the file, at least directly, because operations and handed off to the upper filesystem.
The patchset is relatively small, and makes very few small changes to VFS—except for a change to struct inode_operations that ripples through the filesystem tree. The permissions() member of that structure currently takes a struct inode *, but the hybrid union filesystem needs to be able to access the filesystem-specific data (d_fsdata) that is stored in the dentry, so it was changed to take a struct dentry * instead. David P. Quigley questioned the need for the change, noting that unionfs and aufs did not require it. Aurora pointed out that union mounts would require something similar and that, along with Brown's documentation, seemed to put the matter to rest.
The rest of the patches make minor changes. The first adds a new struct file_operations member called open_other() that is used to forward open() calls to the upper or lower layers as appropriate. Another allows filesystems to set a FS_RENAME_SELF_ALLOW flag so that rename() will still process renames on the identical dummy inodes that the filesystem uses for non-directories. The bulk of the code (modulo the permissions() change) is the new fs/union filesystem itself.
While "union" tends to be used for these kinds of filesystems (or mounts), Brown noted that it is confusing and not really accurate, suggesting that "overlay" be used in its place. Szeredi is not opposed to that, saying that "overlayfs" might make more sense. Aurora more or less concurred, saying that union mounts were called "writable overlays" for one release. The confusion stemming from multiple uses of "union" in existing patches (unionfs, union mounts) may provide additional reason to rename the hybrid union filesystem to overlayfs.
The readdir() semantics are a bit different for the hybrid union as well. Changes to merged directories while they are being read will not appear in the entries returned by readdir(), and offsets returned from telldir() may not return to the same location in a merged directory on subsequent directory opens. The lists of directory entries in merged directories are created and cached on the first readdir() call, with offsets assigned sequentially as they are read. For the most part, these changes are "unlikely to be noticed by many programs", as Brown's documentation says.
A bigger issue is one that all union implementations struggle with: how to handle changes to either layer that are done outside of the context of the union. If users or administrators directly change the underlying filesystems, there are a number of ugly corner cases. Making the lower filesystem be read-only is an attractive solution, but it is non-trivial to enforce, especially for filesystems like NFS.
Szeredi would like to define the problem away or find some way to enforce the requirements that unioning imposes:
a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.
There was some discussion of the problem, without much in the way of conclusions other than a requirement that changing the trees out from under the union filesystem not cause deadlocks or panics.
In some ways, hybrid union seems a simpler approach than union mounts. Whether it can pass muster with Al Viro and other filesystem maintainers remains to be seen however. One way or another, though, some kind of solution to the lack of an overlay/union filesystem in the mainline seems to be getting closer.
The Oracle Cluster Filesystem (ocfs2) is a filesystem for clustered systems accessing a shared device, such as a Storage Area Network (SAN). It enables all nodes on the cluster to see the same files; changes to any files are visible immediately on other nodes. It is the filesystem's responsibility to ensure that nodes do not corrupt data by writing into each other's files. To guarantee integrity, ocfs2 uses the Linux Distributed Lock Manager (DLM) to serialize events. However, a major goal of a clustered filesystem is to reduce cluster locking overhead in order to improve overall performance. This article will provide an overview of ocfs2 and how it is structured internally.
Version 1 of the ocfs filesystem was an early effort by Oracle to create a filesystem for the clustered environment. It was a basic filesystem targeted to support Oracle database storage and did not have most of the POSIX features due to its limited disk format. Ocfs2 was a development effort to convert this basic filesystem into a general-purpose filesystem. The ocfs2 source code was merged in the Linux kernel with 2.6.16; since this merger (in 2005), a lot of features have been added to the filesystem which improve data storage efficiency and access times.
Ocfs2 needs a cluster management system to handle cluster operations such as node membership and fencing. All nodes must have the same configuration, so each node knows about the other nodes in the cluster. There are currently two ways to handle cluster management for ocfs2:
Data in a regular file is maintained in a B-tree of extents; the root of this B-tree is the inode. The inode holds a list of extent records which may either point to data extents, or point to extent blocks (which are the intermediate nodes in the tree). A special field called l_tree_depth contains the depth of the tree. A value of zero indicates that the blocks pointed to by extent records are data extents. The extent records contain the offset from the start of the file in terms of cluster blocks, which helps in determining the path to take while traversing the B-tree to find the block to be accessed.
The basic unit of locking is the inode. Locks are granted on a special DLM data structure known as a lock resource. For any access to the file, the process must request a DLM lock on the lock resource over the cluster. DLM offers six lock modes to differentiate between the type of operation. Out of these, ocfs2 uses only three: exclusive, protected read, and null locks. The inode maintains three types of lock resources for different operations:
A relatively new feature is indexing the directory entries for faster retrievals and improved lookup times. Ocfs2 maintains a separate indexed tree based on the hash of the directory names; the hash index points to the directory block where the directory entry can be found. Once the directory block is read, the directory entry is searched linearly.
A special directory trailer is placed at the end of a directory block which contains additional data about that block. It keeps a track of the free space in the directory block for faster free space lookup during directory entry insert operations. The trailer also contains the checksum for the directory block, which is used by the metaecc feature (discussed later).
A special system directory (//) contains all the metadata files for the filesystem; this directory is not accessible from a regular mount. Note that the // notation is used only for the debugfs.ocfs2 tool. Files in the system directory, known as system files, are different from regular files, both in the terms on how they store information and what they store.
An example of a system file is the slotmap, which defines the mapping of a node in the cluster. A node joins a cluster by providing its unique DLM name. The slot map provides it with a slot number, and the node inherits all system files associated with the slot number. The slot number assignment is not persistent across boots, so a node may inherit the system files of another node. All node-associated files are suffixed by the slot number to differentiate the files of different slots.
A global bitmap file in the system directory keeps a record of the allocated blocks on the device. Each node also maintains a "local allocations" system file, which manages chunks of blocks obtained from the global bitmap. Maintaining local allocations decreases contention over global allocations.
The allocation units are divided into the following:
Each allocator is associated with an inode; it maintains allocations in units known as "block groups." The allocation groups are preceded by a group descriptor which contains details about the block group, such as free units, allocation bitmaps etc. The allocator inode contains a chain of group descriptor block pointers. If this chain is exhausted, group descriptors are added to the existing ones in the form of linked list. Think of it as an array of linked lists. The new group descriptor is added to the smallest chain so that number of hops required to reach an allocation unit is small.
Things get complicated when allocated data blocks are freed because those blocks could belong to the allocation map of another node. To resolve this problem, a "truncate log" maintains the blocks which have been freed locally, but not yet returned to the global bitmap. Once the node gets a lock on the global bitmap, the blocks in the local truncate log are freed.
A file is not physically deleted until all processes accessing the file close it. Filesystems such as ext3 maintain an orphan list which contains a list of files which have been unlinked but still are in use by the system. Ocfs2 also maintains such a list to handle orphan inodes. Things are a bit more complex, however, because a node must check that a file to be deleted is not being used anywhere in the cluster. This check is coordinated using the inode lock resource. Whenever a file is unlinked, and the removed link happens to be the last link to the file, a check is made to determine whether another node is using the file by requesting an exclusive lock over inode lock resource. If the file is being used, it will be moved to the orphan directory and marked with a OCFS2_ORPHANED_FL flag. The orphan directory is later scanned to check for files not being accessed by any of the nodes in order to physically remove them from the storage device.
Ocfs2 maintains a journal to deal with unexpected crashes. It uses the Linux JBD2 layer for journaling. The journal files are maintained, per node, for all I/O performed locally. If a node dies, it is the responsibility of the other nodes in the cluster to replay the dead node's journal before proceeding with any operations.
Ocfs2 has a couple of other distinctive features that it can boast about. They include:
Ocfs2 developers continue to add features to keep it up to par with other new filesystems. Some features to expect in the near future are delayed allocation, online filesystem checks, and defragmentation. Since one of the main goals of ocfs2 is to support a database, file I/O performance is considered a priority, making it one of the best filesystems for the clustered environment.
[Thanks to Mark Fasheh for reviewing this article.]
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds