User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 release remains The process is underway, however, and that version (with a number of x86_64 patches and a few other fixes) may be out by the time you read this.

The current 2.6 prepatch is 2.6.12-rc5, released (without an announcement) on May 25. It includes a few security patches, some architecture updates, and a number of fixes; see the long-format changelog for the details.

The current -mm release is 2.6.12-rc5-mm1. Recent additions to -mm include the OCFS2 filesystem (see below), the dynamic scheduling domains patch (smarter scheduling on large SMP systems), the Red Hat distributed lock manager (covered here last week), a number of KProbes enhancements, a new try_to_del_timer_sync() function, the execute in place patches, Tensilica Xtensa architecture support, the voluntary preemption patch, and lots of fixes.

The current 2.4 prepatch is 2.4.31-rc1, released by Marcelo on May 25. A couple dozen new patches have been merged; most of them are networking fixes and a new bcm5752 driver.

Comments (none posted)

Kernel development news

Quote of the week

As for different trees, I'm afraid you've written something that is _too useful_ to be used in that manner.

Git has brought with it a _major_ increase in my productivity because I can now easily share ~50 branches with 50 different kernel hackers, without spending all day running rsync. Suddenly my kernel development is a whole lot more _open_ to the world, with a single "./push". And it's awesome.

-- Jeff Garzik

Comments (4 posted)

The OCFS2 filesystem

The second version of Oracle's cluster filesystem has been in the works for some time. There has been a recent increase in cluster-related code proposed for inclusion into the mainline, so it was not entirely surprising to see an OCFS2 patch set join the crowd. These patches have found their way directly into the -mm tree for those wishing to try them out.

As a cluster filesystem, OCFS2 carries rather more baggage than a single-node filesystem like ext3. It does have, at its core, an on-disk filesystem implementation which is heavily inspired by ext3. There are some differences, though: it is an extent-based filesystem, meaning that files are represented on-disk in large, contiguous chunks. Inode numbers are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however, so it does not need to bring along much of its own journaling code.

To actually function in a clustered mode, OCFS2 must have information about the cluster in which it is operating. To that end, it includes a simple node information layer which holds a description of the systems which make up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take the relevant information from a single configuration file (/etc/ocfs2/cluster.conf). It is not enough to know which nodes should be part of a cluster, however: these nodes can come and go, and the filesystem must be able to respond to these events. So OCFS2 also includes a simple heartbeat implementation for monitoring which nodes are actually alive. This code works by setting aside a special file; each node must write a block to that file (with an updated time stamp) every so often. If a particular block stops changing, its associated node is deemed to have left the cluster.

Another important component is the distributed lock manager. OCFS2 includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a VMS-like interface. Oracle's implementation is simpler, however (its core locking function only has eight parameters...), and it lacks many of the fancier lock types and functions of Red Hat's implementation. There is also a virtual filesystem interface ("dlmfs") making locking functionality available to user space.

There is a simple, TCP-based messaging system which is used by OCFS2 to talk between nodes in a cluster.

The remaining code is the filesystem implementation itself. It has all of the complications that one would expect of a high-performance filesystem implementation. OCFS2, however, is meant to operate with a disk which is, itself, shared across the cluster (perhaps via some sort of storage-area network or multipath scheme). So each node on the cluster manipulates the filesystem directly, but they must do so in a way which avoids creating chaos. The lock manager code handles much of this - nodes must take out locks on on-disk data structures before working with them.

There is more to it than that, however. There is, for example, a separate "allocation area" set aside for each node in the cluster; when a node needs to add an extent to a file, it can take it directly from its own allocation area and avoid contending with the other nodes for a global lock. There are also certain operations (deleting and renaming files, for example) which cannot be done by a node in isolation. It would not do for one node to delete a file and recycle its blocks if that file remains open on another node. So there is a voting mechanism for operations of this type; a node wanting to delete a file first requests a vote. If another node vetoes the operation, the file will remain for the time being. Either way, all nodes in the cluster can note that the file is being deleted and adjust their local data structures accordingly.

The code base as a whole was clearly written with an eye toward easing the path into the mainline kernel. It adheres to the kernel's coding standards and avoids the use of glue layers between the core filesystem code and the kernel. There are no changes to the kernel's VFS layer. Oracle's developers also appear to understand the current level of sensitivity about the merging of cluster support code (node and lock managers, heartbeat code) into the kernel. So they have kept their implementation of these functionalities small and separate from the filesystem itself. OCFS2 needs a lock manager now, for example, so it provides one. But, should a different implementation be chosen for merging at some future point, making the switch should not be too hard.

One assumes that OCFS2 will be merged at some point; adding a filesystem is not usually controversial if it is implemented properly and does not drag along intrusive VFS-layer changes. It is only one of many cluster filesystems, however, so it is unlikely to be alone. The competition in the cluster area, it seems, is just beginning.

Comments (5 posted)

The Integrity Measurement Architecture

One of the many new features in the 2.6.11 kernel was a driver for "trusted platform module" (TPM) chips. This driver made the low-level capabilities of TPM chips available, but gave no indication of what sort of applications were envisioned for those capabilities. Reiner Sailer of IBM has now taken the next step with a set of patches implementing the "Integrity Measurement Architecture" (IMA) for Linux using TPM.

IMA is a remote attestation mechanism, designed to be able to convince a remote party that a system is running (nothing but) a set of known and approved executables. It is set up as a security module, and works by hooking into the mmap() operation. Whenever a file is mapped in an executable mode (which is what happens when a program is run or a sharable library is mapped), the IMA hook will first perform and save an SHA1 hash of the file. On request, the IMA module can produce a list of all programs run and their corresponding hash values. This list can be examined by a (possibly remote) program to ensure that no unknown or known-vulnerable applications have been run.

If a hostile application has managed to take over the system, however, it will be in a position to corrupt the list from the IMA module, rendering that list useless. This is where the TPM chip comes in. The TPM contains a set of "platform configuration registers" (PCRs) which are accessible to the the rest of the system only in very specific ways. The PCRs can be reset to zero only when the system hardware itself is reset. The host system can pass data to the TPM which is to be included in a given PCR; the TPM then computes a hash with the new information and stores the value in the PCR. A given set of values, if sent to a PCR in any order, will, at the end, yield the same final hash value. The TPM can provide that value on request; it can also be made to sign the hash value using a top-secret key hidden deeply within its tamper-proof packaging.

The IMA module works by sending each hash it computes to a PCR on the TPM chip. When it provides the list of executables and hash values, it can also obtain and hand over a signed hash from the TPM. A remote party can then recompute the hash, compare it to what the TPM produced, and verify that the provided list is accurate. It is still possible for an intruder to corrupt the list, but it will then fail to match the hash from the TPM. It thus should be possible to remotely detect a compromised system.

Of course, if an attacker can gain control of the kernel at boot time, before the IMA module has been initialized, the entire battle has been lost. The TPM designers have thought of this possibility, however; it is possible to set up hardware so that it will not boot a system in the first place unless the TPM approves of the code to be booted.

There are numerous possible applications of this sort of capability. In a highly secured network, systems could refuse to talk to each other until each proves that it is running only approved software. Financial web sites could, if given access to this information, refuse access from systems running browsers with known security problems. The less flexible sort of Linux support provider could refuse to work on systems which have run programs which are not on The List Of Supported Applications. Corporate IT departments could get verifiable lists of which programs have run on each system. DRM-enabled software could refuse to unlock its valuable intellectual property if the system looks suspicious. And so on.

In the short term, however, this code looks like it will need some work before it will be considered seriously for inclusion. James Morris has questioned the security module implementation, arguing that this functionality should be implemented directly in the kernel. Loading the IMA module also makes it impossible to use any other security module (such as SELinux), which may not enhance the overall security of the system. And Greg Kroah-Hartman was unimpressed with the quality of the code in general:

Wow, for such a small file, every single function was incorrect. And you abused sysfs in a new and interesting way that I didn't think was even possible. I think this is two new records you have set here, congratulations.

The IMA authors have now gone off to rework things. At some point, however, it seems likely that this sort of functionality will be available in Linux. Whether it will then be used to increase or restrict the freedom of Linux users remains to be seen.

(For more information, see the IBM tcgLinux and Trusted Computing Group pages).

Comments (15 posted)

A filesystem from Plan 9 space

Plan 9 started as Ken Thompson and Rob Pike's attempt to address a number of perceived shortcomings in the Unix model. Among other things, Plan 9 takes the "everything is a file" approach rather further than Unix does, and tries to do so in a distributed manner. Plan 9 never took off the way Unix did, but it remains an interesting project; it has been free software since 2003.

One of the core components of Plan 9 is the 9P filesystem. 9P is a networked filesystem, somewhat equivalent to NFS or CIFS, but with its own particular approach. 9P is not as much a way of sharing files as a protocol definition aimed at the sharing of resources in a networked environment. There is a draft RFC available which describes this protocol in detail.

The protocol is intentionally simple. It works in a connection-oriented, single-user mode, much like CIFS; each user on a Plan 9 system is expected to make one or more connections to the server(s) of interest. Plan 9 operates with per-user namespaces by design, so each user ends up with a unique view of the network. There is a small set of operations supported by 9P servers; a client can create file descriptors, use them to navigate around the filesystem, read and write files, create, rename and delete files, and close things down; that's about it.

The protocol is intentionally independent of the underlying transport mechanism. Typically, a TCP connection is used, but that is not required. A 9P client can, with a proper implementation, communicate with a server over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links, etc. The protocol also puts most of the intelligence on the server side; clients, for example, perform no caching of data. An implication of all these choices is that there is no real reason why 9P servers have to be exporting filesystems at all. A server can just as easily offer a virtual filesystem (along the lines of /proc or sysfs), transparent remote access to devices, connections to remote processes, or just about anything else. The 9P protocol is the implementation of the "everything really is a file" concept. It could thus be used in a similar way as the filesystems in user space (FUSE) mechanism currently being considered for merging. 9P also holds potential as a way of sharing resources between virtualized systems running on the same host.

There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen has recently posted a v9fs patch set for review with an eye toward eventual inclusion. v9fs is a full 9P client implementation; there is also a user-space server available via the v9fs web site.

Linux and Plan 9 have different ideas of how a filesystem should work, so a fair amount of impedance matching is required. Unix-like systems prefer filesystems to be mounted in a global namespace for all users, while Plan 9 filesystems are a per-user resource. A v9fs filesystem can be used in either mode, though the most natural way is to use Linux namespaces to allow each user to set up independently authenticated connections. The lack of client-side caching does not mix well with the Linux VFS, which wants to cache heavily. The current v9fs implementation disables all of this caching. In some areas, especially write performance, this lack of caching makes itself felt. In others, however, v9fs claims better performance than NFS as a result of its simpler protocol. Plan 9 also lacks certain Unix concepts - such as symbolic links. To ease interoperability with Unix systems, a set of protocol extensions has been provided; v9fs uses those extensions where indicated.

The current release is described as "reasonably stable." The basic set of file operations has been implemented, with the exception of mmap(), which is hard to do in a way which does not pose the risk of system deadlocks. Future plans include "a more complete security model" and some thought toward implementing limited client-side caching, perhaps by using the CacheFS layer. See the patch introduction for pointers to more information, mailing lists, etc.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds