Release status
Kernel release status
The current stable 2.6 release remains 2.6.11.10. The 2.6.11.11
process is underway, however, and that version (with a number of x86_64
patches and a few other fixes) may be out by the time you read this.
The current 2.6 prepatch is 2.6.12-rc5, released (without an announcement)
on May 25. It includes a few security patches, some architecture
updates, and a number of fixes; see the long-format changelog for
the details.
The current -mm release is 2.6.12-rc5-mm1.
Recent additions to -mm include
the OCFS2 filesystem (see below), the dynamic scheduling domains
patch (smarter scheduling on large SMP systems),
the Red Hat distributed lock manager (covered here last week),
a number of KProbes
enhancements, a new try_to_del_timer_sync() function, the execute in place patches, Tensilica Xtensa
architecture support, the voluntary preemption patch, and lots of fixes.
The current 2.4 prepatch is 2.4.31-rc1, released by Marcelo on
May 25. A couple dozen new patches have been merged; most of them are
networking fixes and a new bcm5752 driver.
Comments (none posted)
Kernel development news
Quote of the week
As for different trees, I'm afraid you've written something that is
_too useful_ to be used in that manner.
Git has brought with it a _major_ increase in my productivity because
I can now easily share ~50 branches with 50 different kernel
hackers, without spending all day running rsync. Suddenly my
kernel development is a whole lot more _open_ to the world, with a
single "./push". And it's awesome.
-- Jeff Garzik
Comments (4 posted)
The OCFS2 filesystem
The
second version of
Oracle's cluster filesystem has been in the works for some time. There
has been a recent increase in cluster-related code proposed for inclusion
into the mainline, so it was not entirely surprising to see
an OCFS2 patch set join the crowd. These
patches have found their way directly into the -mm tree for those wishing
to try them out.
As a cluster filesystem, OCFS2 carries rather more baggage than a
single-node filesystem like ext3. It does have, at its core, an on-disk
filesystem implementation which is heavily inspired by ext3. There are
some differences, though: it is an extent-based filesystem, meaning that
files are represented on-disk in large, contiguous chunks. Inode numbers
are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however,
so it does not need to bring along much of its own journaling code.
To actually function in a clustered mode, OCFS2 must have information about
the cluster in which it is operating. To that end, it includes a simple
node information layer which holds a description of the systems which make
up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take
the relevant information from a single configuration file
(/etc/ocfs2/cluster.conf). It is not enough to know which nodes
should be part of a cluster, however: these nodes can come and go, and the
filesystem must be able to respond to these events. So OCFS2 also includes
a simple heartbeat implementation for monitoring which nodes are actually
alive. This code works by setting aside a special file; each node must
write a block to that file (with an updated time stamp) every so often. If
a particular block stops changing, its associated node is deemed to have
left the cluster.
Another important component is the distributed lock manager. OCFS2
includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a
VMS-like interface. Oracle's implementation is simpler, however (its core
locking function only has eight parameters...), and it lacks many of the
fancier lock types and functions of Red Hat's implementation. There is
also a virtual filesystem interface ("dlmfs") making locking functionality
available to user space.
There is a simple, TCP-based messaging system which is used by OCFS2 to
talk between nodes in a cluster.
The remaining code is the filesystem implementation itself. It has all of
the complications that one would expect of a high-performance filesystem
implementation. OCFS2, however, is meant to operate with a disk which is,
itself, shared across the cluster (perhaps via some sort of storage-area
network or multipath scheme). So each node on the cluster manipulates the
filesystem directly, but they must do so in a way which avoids creating
chaos. The lock manager code handles much of this - nodes must take out
locks on on-disk data structures before working with them.
There is more to it than that, however. There is, for example, a separate
"allocation area" set aside for each node in the cluster; when a node needs
to add an extent to a file, it can take it directly from its own allocation
area and avoid contending with the other nodes for a global lock. There
are also certain operations (deleting and renaming files, for example)
which cannot be done by a node in isolation. It would not do for one node
to delete a file and recycle its blocks if that file remains open on
another node. So there is a voting mechanism for operations of this type;
a node wanting to delete a file first requests a vote. If another node
vetoes the operation, the file will remain for the time being. Either way,
all nodes in the cluster can note that the file is being deleted and adjust
their local data structures accordingly.
The code base as a whole was clearly written with an eye toward easing the
path into the mainline kernel. It adheres to the kernel's coding standards
and avoids the use of glue layers between the core filesystem code and the
kernel. There are no changes to the kernel's VFS layer.
Oracle's developers also appear to understand the current level of
sensitivity about the merging of cluster support code (node and lock
managers, heartbeat code) into the kernel. So they have kept their
implementation of these functionalities small and separate from the
filesystem itself. OCFS2 needs a lock manager now, for example, so it
provides one. But, should a different implementation be chosen for merging
at some future point, making the switch should not be too hard.
One assumes that OCFS2 will be merged at some point; adding a filesystem is
not usually controversial if it is implemented properly and does not drag
along intrusive VFS-layer changes. It is only one of many cluster
filesystems, however, so it is unlikely to be alone. The competition in
the cluster area, it seems, is just beginning.
Comments (5 posted)
The Integrity Measurement Architecture
One of the many new features in the 2.6.11 kernel was a driver for "trusted
platform module" (TPM) chips. This driver made the low-level capabilities of TPM
chips available, but gave no indication of what sort of applications were
envisioned for those capabilities. Reiner Sailer of IBM has now taken the
next step with
a set of patches
implementing the "Integrity Measurement Architecture" (IMA) for Linux using
TPM.
IMA is a remote attestation mechanism, designed to be able to convince a
remote party that a system is running (nothing but) a set of known and
approved executables. It is set up as a security module, and works by
hooking into the mmap() operation. Whenever a file is mapped in
an executable mode (which is what happens when a program is run or a
sharable library is mapped), the IMA hook will first perform and save an
SHA1 hash of the file. On request, the IMA module can produce a list of
all programs run and their corresponding hash values. This list can be
examined by a (possibly remote) program to ensure that no unknown or
known-vulnerable applications have been run.
If a hostile application has managed to take over the system, however, it
will be in a position to corrupt the list from the IMA module, rendering
that list useless. This is where the TPM chip comes in. The TPM contains
a set of "platform configuration registers" (PCRs) which are accessible to
the the rest of the system only in very specific ways. The PCRs can be
reset to zero only when the system hardware itself is reset. The host
system can pass data to the TPM which is to be included in a given PCR; the
TPM then computes a hash with the new information and stores the value in
the PCR. A given set of values, if sent to a PCR in any order, will, at
the end, yield the same final hash value. The TPM can provide that value
on request; it can also be made to sign the hash value using a top-secret
key hidden deeply within its tamper-proof packaging.
The IMA module works by sending each hash it computes to a PCR on the TPM
chip. When it provides the list of executables and hash values, it can
also obtain and hand over a signed hash from the TPM. A remote party can
then recompute the hash, compare it to what the TPM produced, and verify
that the provided list is accurate. It is still possible for an intruder
to corrupt the list, but it will then fail to match the hash from the TPM.
It thus should be possible to remotely detect a compromised system.
Of course, if an attacker can gain control of the kernel at boot time,
before the IMA module has been initialized, the entire battle has been
lost. The TPM designers have thought of this possibility, however; it is
possible to set up hardware so that it will not boot a system in the first
place unless the TPM approves of the code to be booted.
There are numerous possible applications of this sort of capability. In a
highly secured network, systems could refuse to talk to each other until
each proves that it is running only approved software. Financial web sites
could, if given access to this information, refuse access from systems
running browsers with known security problems. The less flexible sort of
Linux support provider could refuse to work on systems which have run
programs which are not on The List Of Supported Applications. Corporate IT
departments could get verifiable lists of which programs have run on each
system. DRM-enabled software could refuse to unlock its valuable
intellectual property if the system looks suspicious. And so on.
In the short term, however, this code looks like it will need some work
before it will be considered seriously for inclusion. James Morris has questioned the security module implementation,
arguing that this functionality should be implemented directly in the
kernel. Loading the IMA module also makes it impossible to use any
other security module (such as SELinux), which may not enhance the overall
security of the system. And Greg Kroah-Hartman was unimpressed with the quality of the code
in general:
Wow, for such a small file, every single function was incorrect.
And you abused sysfs in a new and interesting way that I didn't
think was even possible. I think this is two new records you have
set here, congratulations.
The IMA authors have now gone off to rework things. At some point,
however, it seems likely that this sort of functionality will be available
in Linux. Whether it will then be used to increase or restrict the freedom
of Linux users remains to be seen.
(For more information, see the
IBM tcgLinux and Trusted Computing
Group pages).
Comments (15 posted)
A filesystem from Plan 9 space
Plan 9 started as Ken
Thompson and Rob Pike's attempt to address a number of perceived
shortcomings in the Unix model. Among other things, Plan 9 takes the
"everything is a file" approach rather further than Unix does, and tries to
do so in a distributed manner. Plan 9 never took off the way Unix
did, but it remains an interesting project; it has been free software since
2003.
One of the core components of Plan 9 is the 9P filesystem. 9P is a
networked filesystem, somewhat equivalent to NFS or CIFS, but with its own
particular approach. 9P is not as much a way of sharing files as a
protocol definition aimed at the sharing of resources in a networked
environment. There is a draft
RFC available which describes this protocol in detail.
The protocol is intentionally simple. It works in a connection-oriented,
single-user mode, much like CIFS; each user on a Plan 9 system is
expected to make one or more connections to the server(s) of interest.
Plan 9 operates with per-user namespaces by design, so each user ends
up with a unique view of the network. There is a small set of operations
supported by 9P servers; a client can create file descriptors, use them to
navigate around the filesystem, read and write files, create, rename and
delete files, and close things down; that's about it.
The protocol is intentionally independent of the underlying transport
mechanism. Typically, a TCP connection is used, but that is not required.
A 9P client can, with a proper implementation, communicate with a server
over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links,
etc. The protocol also puts most of the intelligence on the server side;
clients, for example, perform no caching of data. An implication of all
these choices is that there is no real reason why 9P servers have to be
exporting filesystems at all. A server can just as easily offer a virtual
filesystem (along the lines of /proc or sysfs), transparent remote
access to devices, connections to remote processes, or just about anything
else. The 9P protocol is the implementation of the "everything really is a
file" concept. It could thus be used in a similar way as the filesystems
in user space (FUSE) mechanism currently being considered for merging.
9P also holds potential as a way of sharing resources between virtualized
systems running on the same host.
There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen
has recently posted a v9fs patch set for
review with an eye toward eventual inclusion. v9fs is a full 9P client
implementation; there is also a user-space server available via the v9fs web site.
Linux and Plan 9 have different ideas of how a filesystem should work, so a
fair amount of impedance matching is required. Unix-like systems prefer
filesystems to be mounted in a global namespace for all users, while
Plan 9 filesystems are a per-user resource. A v9fs filesystem can be
used in either mode, though the most natural way is to use Linux namespaces
to allow each user to set up independently authenticated connections. The
lack of client-side caching does not mix well with the Linux VFS, which
wants to cache heavily. The current v9fs implementation disables all of
this caching. In some areas, especially write performance, this lack of
caching makes itself felt. In others, however, v9fs claims better
performance than NFS as a result of its simpler protocol. Plan 9 also
lacks certain Unix concepts - such as symbolic links. To ease
interoperability with Unix systems, a set of protocol
extensions has been provided; v9fs uses those extensions where
indicated.
The current release is described as "reasonably stable." The basic set of
file operations has been implemented, with the exception of
mmap(), which is hard to do in a way which does not pose the risk
of system deadlocks. Future plans include "a more complete security
model" and some thought toward implementing limited client-side caching,
perhaps by using the CacheFS layer.
See the patch introduction for pointers to
more information, mailing lists, etc.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>