The current development kernel is 2.6.36-rc3, which was released on August 29. "Nothing in particular stands out that I can recall. As usual, it's
mostly driver updates (65%), of which a large piece (by line count) is just
the removal of a staging driver that isn't really ready nor making any
progress. But on the 'somewhat more likely to cause excitement' front,
there's some radeon/nouveau drm updates too." See the
full changelog for all the details.
Stable updates: The 22.214.171.124, 126.96.36.199, 188.8.131.52, and 184.108.40.206 stable kernels were released on
August 27. As usual, they contain fixes throughout the kernel tree.
Note that this is the last stable update for the 2.6.34 kernel series.
Comments (none posted)
SIGEV_THREAD is the best proof that the whole posix timer interface
was comitte[e]d under the influence of not to be revealed
-- Thomas Gleixner
Reaching into the first bag, quite worn out with a faded "27" on
the outside of it, he grabbed one of the remaining kernels in there
and tossed it into the group. A few small people at the back of
the crowd caught the kernel, and slowly walked off toward the
village. Og wondered about these people, constantly relying on the
old kernels to save them for another day, while resisting the
change to move on, harboring some kind of strange reverence for
this specific brand that Og just could not understand.
-- Greg Kroah-Hartman
Comments (9 posted)
Kernel development news
Much attention goes toward mainline kernel releases, but relatively few
users are actually running those kernels. Instead, they run kernels
provided by their distributors, and those kernels, in turn, are based off
the stable kernel series. The practice of releasing stable kernels has
been going for well over five years now, so perhaps it's time to look back
at how it has been going.
Back in March 2005, the community was discussing ways of getting important
fixes out to users of mainline releases. There was talk of maintaining a
separate tree containing nothing but fixes; Linus, at the time, thought that any
such attempt was doomed to failure:
I'll tell you what the problem is: I don't think you'll find
anybody to do the parallel "only trivial patches" tree. They'll go
crazy in a couple of weeks. Why? Because it's a _damn_ hard
problem. Where do you draw the line? What's an acceptable patch?
And if you get it wrong, people will complain _very_ loudly, since
by now you've "promised" them a kernel that is better than the
mainline. In other words: there's almost zero glory, there are no
interesting problems, and there will absolutely be people who claim
that you're a dick-head and worse, probably on a weekly basis.
With such strong words of encouragement, somebody clearly had to step up to
the job; that somebody turned out to be Greg Kroah-Hartman and Chris
Wright. They released 220.127.116.11 on March 4, 2005
with all of three fixes. More than five years later, Greg (a.k.a. "Og") is still at it
(Chris has not been active with stable updates for a while). During that
time, the stable release history has looked like this:
In the table above, the kernels appearing in bold are those which
are still receiving stable updates as of this writing (though 2.6.27 is
clearly reaching the end of the line).
A couple of conclusions immediately jump out of the table above. The first
is that the number of fixes going into stable updates has clearly increased
over time. From this one might conclude that our kernel releases have
steadily been getting buggier. That is hard to measure, but one should
bear in mind that there is another important factor at work here: the
kernel developers are simply directing more fixes toward the stable tree.
Far more developers are looking at patches with stable updates in mind, and
suggestions that a patch should be sent in that direction are quite
common. So far fewer patches fall through the cracks than they did in the
There is another factor at work here as well. The initial definition of
what constituted an appropriate stable tree patch was severely restrictive;
if a bug did not cause a demonstrable oops or vulnerability, the fix was
not considered for the stable tree. By the time we get to the 18.104.22.168 update, though, we see a wider
variety of "fixes," including Acer rv620 laptop support, typo fixes,
tracepoint improvements to make powertop work better, the optimistic spinning mutex
scalability work, a new emu10k1 sound driver module parameter, and
oprofile support for a new Intel processor. These enhancements are,
arguably, all things that stable kernel users would like to have. But they
definitely go beyond the original charter for this tree.
Your editor has
also, recently, seen an occasional complaint about regressions finding
their way into stable updates; given the volume of patches going into
stable updates now, a regression every now and then should not be
surprising. Regressions in the stable tree are a worrisome prospect; one
can only hope that the problem does not get worse.
Another noteworthy fact is that the number of stable updates for most
kernels appears to be falling slowly; the five updates for the entire
2.6.34 update history is the lowest ever, matched only by the 2.6.13
series. Even then, 2.6.34 got one more update than had been originally
planned as the result of a security issue. It should seem obvious that
handling this kind of patch flow for as many as four kernels simultaneously
will be a lot of work; Greg, who has a few other things on his plate as
well, may be running a little short on time.
Who is actually contributing patches to stable kernels? Your editor
decided to do a bit of git data mining. Two kernels were chosen: 2.6.32,
which is being maintained for an extended period as the result of its use
in "enterprise" distributions, and 2.6.34, being the most recent kernel
which has seen its final stable update. Here are the top contributors for
|Most active stable contributors|
|Daniel T Chen||32||1.8%|
|David S. Miller||20||1.1%|
|Henrique de Moraes Holschuh||17||0.9%|
|Daniel T Chen||9||1.8%|
|Rafael J. Wysocki||8||1.6%|
Some names on this list will be familiar. Linus never shows up on the list
of top mainline contributors anymore, but he does generate a fair number of
stable fixes. Other names are seen less often in the kernel context:
Daniel Chen, for example, is an Ubuntu community contributor; his
contributions are mostly in the welcome area of making audio devices
actually work. Some of the people are in the list above because they
introduced the bugs that their patches fix - appearing in that role is not
necessarily an honor. But - admittedly without having done any sort of
rigorous study - your editor suspects that most of the people listed above
are fixing bugs introduced by others. They are performing an important and
underappreciated service, turning mainline releases into kernels that the
rest of the world actually wants to run.
We can also look at who is supporting this work:
|Most active stable contributors by employer|
These numbers quite closely match those for mainline kernel contributions,
especially at the upper end. Fixing bugs is said to be boring and
unglamorous work, but volunteers are still our leading source of fixes.
We did without a stable tree for the first ten 2.6.x releases, though, at
this point, it's hard to imagine just how. In an ideal world, a mainline
kernel release would not happen until there were no bugs left; the history
of (among others) the 2.3 and 2.5 kernel development cycles shows that this approach does
not work in the real world. There comes a point where the community has to
create a stable release and go on to the next cycle; the stable tree allows
that fork to happen without ending the flow of fixes into the released
The tables above suggest that the stable kernel process is working well,
with large numbers of fixes being directed into stable updates and with
participation from across the community. There may come a point, though,
where that community needs to revisit the standards for patches going into
stable updates. At some point, it may also become clear that the job of
maintaining these kernels is too big for one person to manage. For now,
though, the stable tree is clearly doing what it is intended to do; Greg
deserves a lot of credit for making it work so well for so long.
Comments (41 posted)
Creating a union of two (or more) filesystems is a commonly requested
Linux that has never made it into the mainline. Various implementations
have been tried (part 1 and
part 2 of Valerie Aurora's
look from early 2009), but none has crossed the threshold for inclusion.
Of late, union mounts have
been making some progress, but there is still work to do there. A hybrid
approach—incorporating both filesystem- and VFS-based
techniques—has recently been posted in an RFC patchset by Miklos Szeredi.
The idea behind unioning filesystems is quite simple, but the devil is in
the details. In a union, one filesystem is mounted "atop" another, with
the contents of both filesystems appearing to be in a single filesystem
encompassing both. Changes made to the filesystem are reflected in the
"upper" filesystem, and the "lower" filesystem is treated as read-only.
One common use case is to have a filesystem on read-only media (e.g. CD)
but allow users to make changes by writing to the upper filesystem stored
on read-write media (e.g. flash or disk).
There are a number of details
that bedevil developers of unions, however, including various problems with
namespace handling, dealing with deleted files and directories, the
POSIX definition of readdir(), and so on. None of them are
but they are difficult, and it is even harder to implement them in a way
that doesn't run afoul of the technical complaints of the VFS maintainers.
Szeredi's approach blends the filesystem-based implementations, like
unionfs and aufs, with the VFS-based implementation of union mounts.
For file objects, an open() is forwarded to whichever of the two
underlying filesystems contains it, while directories are handled by the
union filesystem layer.
Neil Brown's very helpful first cut at documentation for the patches lumped directory
handling in with files, but Szeredi called
that a bug. Directory access is never forwarded to the other
filesystems and directories need to "come from the union itself
for various reasons", he said.
As outlined in Brown's document, most of the action for unions takes place
in directories. For one thing, it is more accurate to look at the feature
as unioning directory trees, rather than filesystems, as there is no
requirement that the two trees reside in separate filesystems. In theory,
the lower tree could even be another union, but the current implementation
The filesystem used by the upper tree needs to support the "trusted"
extended attributes (xattrs) and
it must also provide valid d_type (file type)
for readdir() responses, which precludes NFS.
Whiteouts—that is files that exist in the lower tree, but have been
removed in the upper—are handled using the "trusted.union.whiteout"
xattr. Similarly, opaque directories, which do not allow entries in the
lower tree to "show through", are handled with the "trusted.union.opaque"
Directory entries are merged with fairly straightforward rules: if there
in both the upper and lower layers with the same name, the upper always
takes precedence unless both are directories. In that case, a directory in
the union is created that merges the entries from each. The initial mount
creates a merged directory of the roots of the upper and lower directory
trees and subsequent lookups follow the rules, creating merged
directories that get cached in the union dentry as needed.
Write access to lower layer files is handled by the traditional "copy up"
approach. So, opening a lower file for write or changing its metadata will
cause the file to be copied to the upper tree. That may require creating
any intervening directories if the file is several levels down in a
directory tree on the lower layer. Once that's done, though, the hybrid
union filesystem has little further interaction with the file, at least
directly, because operations and handed off to the upper filesystem.
The patchset is relatively small, and makes very few small changes to
VFS—except for a change to struct inode_operations that ripples
through the filesystem tree. The permissions() member of that
structure currently takes a struct inode *, but the hybrid union
filesystem needs to be able to access the filesystem-specific data
(d_fsdata) that is stored in the dentry, so it was changed to take
a struct dentry * instead. David P. Quigley questioned the need for the change, noting
that unionfs and aufs did not require it. Aurora pointed out that union mounts would require
something similar and that, along with Brown's documentation, seemed to put
the matter to rest.
The rest of the patches make minor changes. The first adds a new
struct file_operations member called open_other()
that is used to forward open() calls to the upper or lower layers
as appropriate. Another allows filesystems to set a
FS_RENAME_SELF_ALLOW flag so that rename() will still process
renames on the identical dummy inodes that the filesystem uses for
non-directories. The bulk of the code (modulo the permissions()
change) is the new fs/union
While "union" tends to be used for these kinds of filesystems (or mounts),
Brown noted that it is confusing and not
really accurate, suggesting that "overlay" be used in its place. Szeredi
is not opposed to that, saying that
"overlayfs" might make more sense. Aurora more or less concurred, saying that union mounts were
called "writable overlays" for one release. The confusion stemming from
uses of "union" in existing patches (unionfs, union mounts) may provide
additional reason to rename the hybrid union filesystem to overlayfs.
The readdir() semantics are a bit different for the hybrid union as
well. Changes to merged directories while they are being read will not
appear in the entries returned by readdir(), and offsets returned
from telldir() may not return to the same location in a merged
directory on subsequent directory opens. The lists of directory entries in
merged directories are created and cached on the first readdir()
call, with offsets assigned sequentially as they are read. For the most
part, these changes are "unlikely to be noticed by many
programs", as Brown's documentation says.
A bigger issue is one that all union implementations struggle with: how to
handle changes to either layer that are done outside of the context of the
union. If users or administrators directly change the underlying
filesystems, there are a number of ugly corner cases. Making the lower
read-only is an
attractive solution, but it is non-trivial to enforce, especially for
filesystems like NFS.
Szeredi would like to define the problem
away or find some way to enforce the requirements that unioning
The easiest way out of this mess might simply be to enforce exclusive
modification to the underlying filesystems on a local level, same as
the union mount strategy. For NFS and other remote filesystems we
a) add some way to enforce it,
b) live with the consequences if not enforced on the system level, or
c) disallow them to be part of the union.
There was some discussion of the problem, without much in the way of
conclusions other than a requirement that changing the trees out from under
the union filesystem not cause deadlocks or panics.
In some ways, hybrid union seems a simpler approach than union mounts.
Whether it can pass muster with Al Viro and other filesystem maintainers
remains to be seen however. One way or another, though, some kind of
solution to the lack of an overlay/union filesystem in the
mainline seems to be getting closer.
Comments (6 posted)
The Oracle Cluster
Filesystem (ocfs2) is a filesystem for clustered
systems accessing a shared device, such as a Storage Area Network (SAN).
It enables all nodes on the cluster to see the same files; changes to any
files are visible immediately on other nodes. It is the filesystem's
responsibility to ensure that nodes do not corrupt data by writing into each other's
files. To guarantee integrity, ocfs2 uses the Linux Distributed Lock
Manager (DLM) to serialize events. However, a major goal of a clustered
filesystem is to reduce cluster locking overhead in order to improve overall
performance. This article will provide an overview of ocfs2 and how it is
A brief history
Version 1 of the ocfs filesystem was an early effort by Oracle to create a
filesystem for the clustered environment. It was a basic filesystem targeted to
support Oracle database storage and did not have most of the POSIX features due
to its limited disk format. Ocfs2 was a development effort to convert this basic
filesystem into a general-purpose filesystem. The ocfs2 source code was merged
in the Linux kernel with 2.6.16; since this merger (in 2005), a lot of
features have been added to the filesystem which improve data storage
efficiency and access times.
Ocfs2 needs a cluster management system to handle cluster
operations such as node membership and fencing. All nodes must have the same
configuration, so each node knows about the other nodes in the
cluster. There are currently two ways to handle cluster management for ocfs2:
- Ocfs2 Cluster Base (O2CB) - This is the in-kernel implementation of cluster
configuration management; it provides only the basic services needed to have a
clustered filesystem running. Each node writes to a heartbeat file to
let others know the node is alive. More information on running the ocfs2
filesystem using o2cb can be found int the
ocfs2 user guide
This mode does not have the capability of removing nodes from a live cluster and
cannot be used for cluster-wide POSIX locks.
- Linux High Availability - uses user-space tools, such as heartbeat
and pacemaker, to perform cluster management. These packages are complete cluster
management suites which can be used for advanced cluster activities such as
different fail-over, STONITH (Shoot The Other Node In The Head - yes, computing
terms can be barbaric), migration dependent services etc. It is also
capable of removing nodes from a live cluster. It supports cluster-wide
POSIX locks, as
opposed to node-local locks. More information about cluster management tools
can be found at clusterlabs.org and linux-ha.org
ocfs2 separates the way data and metadata are stored on disk. To facilitate this,
it has two types of blocks:
- Metadata or simply "blocks" - the smallest addressable unit. These blocks
contain the metadata of the filesystem, such as the inodes, extent blocks,
group descriptors etc. The valid sizes are 512 bytes to 4KB (incremented in
powers of two). Each metadata block contains a signature that says what
the block contains. This signature is cross-checked while reading that specific
- Data Clusters - data storage for regular files. Valid cluster sizes range
from 4KB to 1MB (in powers of two). A larger data cluster reduces the size
of filesystem metadata such as allocation bitmaps, making filesystem
activities such as data allocation or filesystem checks faster. On the
other hand, a large cluster size increases internal
fragmentation. A large cluster size is recommended for filesystems storing
large files such as virtual machine images, while a small data cluster size is
recommended for a filesystem which holds lots of small files, such as a mail
An inode occupies an entire block. The block number (with respect to the
filesystem block size) doubles as the inode number. This organization may result
in high disk space usage for a filesystem with a lot of small files. To
minimize that problem, ocfs2 packs the data
files into the inode itself if they are small enough to fit. This
feature is known as "inline data." Inode numbers are 64 bits, which gives enough
room for inode numbers to be addressed on large storage devices.
Data in a regular file is maintained in a B-tree of extents; the root of this
B-tree is the inode. The inode holds a list of extent records which may either
point to data extents, or point to extent blocks (which are the
intermediate nodes in the tree). A special field called
l_tree_depth contains the depth of the tree. A value of
zero indicates that the blocks pointed to by extent records are data extents.
The extent records contain the offset from the start of the file
in terms of cluster blocks, which helps in determining the path to take while
traversing the B-tree to find the block to be accessed.
The basic unit of locking is the inode. Locks are granted on a special DLM data
structure known as a lock
resource. For any access to the file, the process must
request a DLM lock on the lock resource over the cluster. DLM offers six lock
differentiate between the type of operation. Out of these, ocfs2 uses only
three: exclusive, protected read, and null locks. The inode
maintains three types of lock resources for different operations:
- read-write lock resource: is used to serialize writes if
multiple nodes perform I/O at the same time on a file.
- inode lock resource: is used for metadata inode operations
- open lock resource: is used to identify deletes of a file. When a
file is open, this lock resource is opened in protected-read mode. If the node
intends to delete it, it will request for a exclusive lock. If successful, it
means that no other node is using the file and it can be safely deleted. If
unsuccessful, the inode is treated as an orphan file (discussed later)
Directory entries are stored in name-inode pairs in blocks known
as directory blocks. Access to the storage pattern of directory blocks is the
same as for a regular file. However, directory blocks are allocated
as cluster blocks. Since a directory block is considered to be a metadata
block, the first allocation uses only a part of the cluster block. As the
directory expands, the remaining unused blocks of the data cluster are filled
until the data cluster block is fully used.
A relatively new feature is indexing the directory entries for faster
retrievals and improved lookup times. Ocfs2 maintains a
separate indexed tree based on the hash of the directory names; the hash
index points to the directory block where the directory entry can be found. Once
the directory block is read, the directory entry is searched linearly.
A special directory trailer is placed at the end of a directory block which
contains additional data about that block. It keeps a track of
the free space in the directory block for faster free space lookup during
directory entry insert operations. The trailer also contains the
checksum for the directory block, which is used by the metaecc feature (discussed
A special system directory (//) contains all the metadata files
for the filesystem; this directory is not accessible from a
regular mount. Note that the // notation is used only for the
Files in the system directory, known as system files, are different from regular
files, both in the terms on how they store information and what they store.
An example of a system file is the slotmap, which defines the mapping of a node
in the cluster. A node joins a cluster by providing its unique DLM name. The
slot map provides it with a slot number, and the node inherits all system files
associated with the slot number. The slot number assignment is not persistent
across boots, so a node may inherit the system files of another node. All
node-associated files are suffixed by the slot number to differentiate the files
of different slots.
A global bitmap file in the system directory keeps a record of the allocated
blocks on the device. Each node also maintains a "local allocations" system
file, which manages chunks of blocks obtained from the global
bitmap. Maintaining local allocations decreases contention over global
The allocation units are divided into the following:
- inode_alloc: allocates inodes for the local node.
- extent_alloc: allocates extent blocks for the local node. Extent
blocks are intermediate leaf nodes in the B-tree storage of the files.
- local_alloc: allocates data in data cluster sizes for the use of
regular file data.
Each allocator is associated with an inode; it maintains allocations in
units known as "block groups." The
allocation groups are preceded by a group descriptor which contains details
about the block group, such as free units, allocation bitmaps etc. The
allocator inode contains a chain of group descriptor block pointers. If this
chain is exhausted, group descriptors are added to the existing ones in the
linked list. Think of it as an array of linked lists. The new group
added to the smallest chain so that number of hops required to reach an
allocation unit is small.
Things get complicated when allocated data blocks are freed because those
blocks could belong to the allocation map of another node. To
resolve this problem, a "truncate log" maintains the blocks which have been
freed locally, but
not yet returned to the global bitmap. Once the node gets a lock on the global
bitmap, the blocks in the local truncate log are freed.
A file is not physically deleted until all processes accessing the file close
it. Filesystems such as ext3 maintain an orphan list which contains a list of
files which have been unlinked but still are in use by the system. Ocfs2 also
maintains such a list to handle orphan inodes. Things are a bit more
complex, however, because a node must check
that a file to be deleted is not being used anywhere in the cluster. This
check is coordinated using the inode lock resource.
Whenever a file is unlinked, and the removed link happens to be the last
link to the file, a check is made to determine whether another node is
using the file by requesting
an exclusive lock over inode lock resource. If the file is being used, it
moved to the orphan directory and marked with a OCFS2_ORPHANED_FL flag. The
orphan directory is later scanned to check for files not being accessed by any
of the nodes in order to physically remove them from the storage device.
Ocfs2 maintains a journal to deal with unexpected crashes. It uses the Linux
JBD2 layer for journaling. The journal files are maintained, per node, for all
I/O performed locally. If a node dies, it is the responsibility of the other
nodes in the cluster to replay the dead node's journal before proceeding
with any operations.
Ocfs2 has a couple of other distinctive features that it can boast
about. They include:
- Reflinks is a feature to support snapshotting of files using
copy-on-write (COW). Currently, a system call interface, to be called reflink() or copyfile()
discussed upstream. Until the system call is finalized, users can access this
feature via the reflink system tool which uses an ioctl()
call to perform the snapshotting.
- Metaecc is an error correcting feature for the metadata using Cyclic
Redundancy Check (CRC32). The code warns if the calculated
error-correcting code is different from the one stored, and re-mounts the
filesystem read-only in order to avoid further corruption. It is also capable of
correcting single-bit errors on the fly. A special data structure,
ocfs2_block_check, is embedded in most metadata structures to hold the CRC32
values associated with the structure.
Ocfs2 developers continue to add features to keep it
up to par with other new filesystems. Some features to expect in
the near future are delayed allocation, online filesystem checks, and
defragmentation. Since one of the main goals of ocfs2 is to support a
database, file I/O performance is considered a priority, making it one of the
best filesystems for the clustered environment.
[Thanks to Mark Fasheh for reviewing this article.]
Comments (7 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>