The current 2.6 prepatch is 2.6.19-rc6
on November 15.
It contains a fair number of fixes, but one hopes that most of the problems
have been taken care of by now (though the
latest version of the 2.6.19 known regressions
(November 21) still contains nine entries). See the long-format changelog
Almost 90 fixes have gone into the mainline git repository since -rc6, as
of this writing. There has been no word on whether that's enough to force
the release of a 2.6.19-rc7 before this cycle comes to an end or not.
There have been no -mm releases over the last week.
The current stable 2.6 kernel is 184.108.40.206, released on November 18.
It contains a fair number of fixes, at least one of which is
Adrian Bunk has released 220.127.116.11 with quite a few
fixes. 18.104.22.168-rc1 is also
On the 2.4 front, Willy Tarreau has released 22.214.171.124 with a couple of
security patches and a number of other fixes. 2.4.34-pre6 is also out, adding
a relatively small number of patches.
Comments (none posted)
Kernel development news
Hm, I've never heard the driver model be called a "complete design
paradigm" in the past. I've heard it called a lot of real nasty
-- Greg Kroah-Hartman
So don't fall for the classic "second system syndrome". The classic
reason for getting the second system wrong is because you focus on
the issues people complain about, and not on the issues that work
well (because the issues that work fine are obviously not getting a
lot of attention).
-- Linus Torvalds
Comments (3 posted)
The 2.6.19 kernel cycle has brought the usual pile of changes visible to
kernel developers. Here is a quick summary of the most significant API
modifications in 2.6.19.
As always, API changes are tracked on the LWN 2.6 API changes page.
Comments (2 posted)
Filesystems, especially remote filesystems, may require some authentication
or a key to enable access; the kernel key management interface provides
hooks to store and manage this kind of information. The hooks come in two
flavors; one used by the kernel to find keys for subsystems that
require them and one used by userspace programs to manage keys. The
intent is to provide a fast mechanism for the kernel to access the keys
that it needs and to push the add, modify and delete operations into
'Key' is the term used, but it may not be keys in the traditional,
cryptographic sense that are stored. Any kind of authentication or
access information can be stored as a key; it is essentially an opaque
chunk of data that is only interpreted by the kernel subsystem that is
interested in it. While filesystems are the main target of the API, any
kernel subsystem that requires this kind of information could use it.
At the core, keys are stored in the aptly named struct key which
has the following kinds of fields:
a unique serial number
a key type that can identify the filesystem that the key belongs to
a description string that is used for searching for the key
a payload that contains the actual key data
user and group information including permissions
an expiration time
a key state that tracks instantiation, revocation, deletion, etc.
The key types provide a way for a filesystem to configure its own set of
key operations. The operations that a key type can specify are:
instantiate - create a key of that type
update - modify a key
match - match a key to a description, which is used in the key search
revoke - clear some key data and change the state to KEY_FLAG_REVOKED
destroy - clear all key data
describe - summarize the key's description and payload as text
read - read the key data
request_key - called when the key is not available in order to retrieve the
key from elsewhere
Two standard key types are defined:
key types can be registered by filesystems using:
int register_key_type(struct key_type *type);
When the kernel needs to find a key, it calls:
struct key *request_key(const struct key_type *type,
const char *description,
const char *callout_string);
It passes the type
and the match
function from the struct key_type
is used to try and find a matching
key. If no key is found, and callout_string
is not NULL
the kernel will invoke /sbin/request-key
attempts to obtain the necessary key from userspace.
The payload field of a key can be accessed once the key has been
found, but if it is more complex than a simple integer, some arrangement
must be made to prevent simultaneous reads and writes. Support for semaphore
locking or Read-Copy-Update (RCU) are present in the key structure and must
be used unless the key type has no modification methods. Once the
filesystem is done with the key, it should be released with:
void key_put(struct key *key);
Keyrings are, as the name implies, collections of related keys and there are
various calls to manipulate them. Each process is associated with three
keyrings: a thread-specific keyring, a process-specific keyring and a
session-specific keyring. These are the keyrings searched when a
request_key is issued. Each user on the system is associated with
a user-specific keyring; a default user session keyring used to initialize
the session-specific keyring when a process changes its real user id.
Permissions for keys are stored in a bit field, much like Linux file
permissions, but are more extensive. Each key has a user and group id
and a permissions mask for each of four potential accessors: possessor,
user, group, and other. The mask consists of six bits:
view - allows a key or keyring's attributes to be viewed
read - allows a key's payload or a keyring's list of keys to be viewed
write - allows creating or modifying a key's payload or keyring's list of keys
search - allows keys to be found and keyrings to be searched
link - allows the key or keyring to be linked into another keyring
set attribute - allows the key's user id, group id, and permissions mask to be
The userspace API consists of the three main system calls:
key_serial_t add_key(const char *type, const char *desc,
const void *payload, size_t plen,
key_serial_t request_key(const char *type, const char *description,
const char *callout_info,
key_serial_t keyctl(int cmd, key_serial_t id, int create);
adds a key to the keyring specified. request_key()
much like its kernel-side counterpart,
searches for the key based on the type and description, possibly calling
out to userspace if callout_info
is non-NULL. It can also attach
the key to the specified destination keyring if it is found. keyctl()
is an ioctl-like interface that provides for the management of keys.
contains 17 separate commands for
updating, changing permissions, searching, linking, reading and the like.
The /bin/keyctl command-line
utility, part of the keyutils package, provides an easy interface to the
userspace system calls to
facilitate working with keys from userspace. Also, the /proc/keys
and /proc/key-users entries in procfs enable a user to view
the keys and key users currently managed by the kernel.
The only filesystem in the current 2.6 tree that uses the key
management API is eCryptfs, a
stacked filesystem that encrypts its data using a password and optional
salt. It uses the user key type rather than creating its own type and
does not directly support userspace callbacks. Instead it uses the
mount.ecryptfs command to prompt the user for the password
and stores that as the key.
According to slides from Dave Howells' talk at the 2006 Ottawa Linux
several other filesystems (including CIFS, NFSv4 and AFS) are planning to use
the API in the future. For more information, extensive documentation can
be found in the kernel tree in Documentation/keys.txt and
Overall, this looks to be a useful interface for kernel subsystems that
require keys and, in keeping with kernel tradition, most of the policy and
management pieces are pushed out to userspace. It provides all of the
capabilities that one would expect and hopefully more kernel subsystems
will be using it in the future.
Comments (6 posted)
If you're a programmer, you've reviewed a lot of code - at minimum,
your own code (or at least we hope so). It doesn't take a lot of code
reviewing before you start recognizing familiar bugs - failure to drop
a lock on the error exit path, dereferencing a pointer just after it's
been proven to be null, forgetting to mark a buffer dirty. Before
long, the sense of deja vu is overpowering. You might even begin to
entertain the sneaking suspicion that half of code review work could
be done by a trained chimpanzee, a 10-line script, or someone from
(Some of) The Solutions
As it turns out, that suspicion is correct. A lot of software errors
can be found automatically, in fact, surprisingly automatically. The
automatic checking we'll discuss falls into two main categories:
static and dynamic. Static checking runs on the source code and
doesn't require integration with a running system. It is often better
at exploring all execution paths, but often explores impossible
execution paths (resulting in false positives), and usually can't deal
with things like function pointers. Dynamic checking runs on a live
system, which produces more accurate results but requires more
invasive techniques and usually can't explore as many execution paths
(though fusion with model-checking techniques can work around this;
see eXplode later in this article). The good news is that automatic
error checking techniques are compatible; we can use them all and get
the best of all worlds.
In this article, we'll review several papers describing some of the
most practical and promising approaches, all from Dawson Engler's
research group at Stanford. Many LWN readers will already be familiar
with metacompilation (known as "the Stanford checker" in kernel
circles) at a high level, but the approach rewards deeper study.
Another approach, named EXE, uses symbolic execution, in which an
instrumented program self-generates complex error-triggering inputs.
We'll also look at eXplode, a light-weight system inspired by
model-checking which quickly and efficiently checks file systems and
other software for correct crash recovery. All of these approaches
are compatible with the Linux kernel (requiring more or less code
modification but generally less) and have found many real-world bugs
resulting in system panic, data corruption, or security holes.
Finally, we'll quickly review a variety of existing open source tools
for automatically error-checking programs. With any luck, in a few
years' time we'll have scripts doing the trained chimpanzee code
review work instead of Linux kernel maintainers.
We'll start with one of the most intellectually intriguing approaches,
using code instrumentation and symbolic execution to automatically
generate complex test inputs that trigger serious bugs. The paper is
generating malicious disks using symbolic execution
, by Junfeng
Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler, and
appeared in IEEE
Security and Privacy 2006
. (Another longer, more detailed paper on
the topic is EXE:
Automatically Generating Inputs of Death
, by Cristian Cadar, Vijay
Ganesh, Peter Pawlowski, David Dill, and Dawson Engler and appeared in
Communications and Security 2006
). The basic idea is that you
begin executing the program with a "symbolic" input. As the program
runs, the EXE system uses compiled-in instrumentation to keep track of
the tests done on the input data. These tests create constraints on
what the input data can be. Once the system has a set of constraints,
it tries to solve them and come up with a set of allowed inputs. It
then checks the allowed inputs to figure out if they will cause one of
a known set of errors, such as dividing by zero, allowing access to
arbitrary memory locations, triggering an assertion, etc.
In this paper, the authors apply the system to the Linux file system
mount code for ext2, ext3, and JFS. In this case, the system starts
out with a symbolic representation of all possible disk images
("inputs"), and gradually whittles away allowed disk images at each
point in the mount code, based on actions such as:
if (sbi->s_frag_size == 0)
It then checks all disk images allowed at any particular point to see
if any of them causes one of the bugs the system can detect. For
example, the statement:
sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;
Would be flagged as triggering a divide by zero error without the
prior check pruning out all inputs with sb->s_frag_size equal to zero.
The advantage of this approach over simply generating random inputs is
that random error generation can't go very deep in testing code paths
because the random input will nearly always fail during the first few
input checks. For example, random input testing for the file system
mount code would almost always fail out at the check of the superblock
magic number. Another pleasant quality of this approach is that it
generates test inputs that trigger the bug detected by the system.
Many other automatic error checkers are plagued by false positives;
this system hands you the exact input that triggers the supposed bug. It can
be accurately described as a error test case generating system in
addition to an error checking system. The prospect is enough to make
a systems programmer salivate.
The next paper is eXplode: a
Lightweight, General System for Finding Serious Storage System
Errors, by Junfeng Yang, Can Sar, and Dawson Engler, which
appeared in OSDI
2006. eXplode tests file systems (and more complex storage
software stacks) by generating all possible disks that could be the
result of a crash, and then automatically checking them using
verification programs, such as fsck and programs that check for
"correct" file system topology (e.g., the existence of the path
"/a/b/" after creating and properly syncing it). The sequence of
events leading up to an incorrect disk is recorded through some
minor, not terribly intrusive instrumentation. Some minor
modifications to Linux are needed to deterministically replay a
sequence of events; mainly, the execution order of threads must be
maintained, which they approximate using thread priorities. They also
modify Linux to make certain error cases (such as memory allocation
failure) more common.
eXplode works for more than just file systems, it also works for
databases on top of file systems, file systems on top of RAID,
software configuration systems, or any combination of the above. This
is due to the stackable, modular nature of the routines for creating,
mutating, and checking disks. Each layer in the storage stack fills
out the following routines:
- init: one-time initialization, such as formatting a file system partition or creating a fresh database.
- mount: set up the storage system so that operations can be performed on it.
- unmount: tear down the storage system; used by eXplode to clear
the storage system's state so it can explore a different one.
- recover: repair the storage system after an eXplode-simulated
- threads: return the thread IDs for the storage system's kernel threads (to help control non-determinism).
The client code must also provide routines that mutate the storage
system (such as by creating a file) and that check the file system for
correctness, above and beyond the recover routine. When running,
eXplode (1) calls all the init() routines for each element in the
stack in order, (2) calls all the mount() routines, (3) run the mutate
routine, forking children at "choice points", places where execution
could go in one direction or another, (4) at appropriate points,
generate all possible crash disks (due to incomplete and/or reordered
writes), run the recover routines, and then run the checker routine,
(5) repeat steps 3 and 4 until the user gets bored.
A lot of hard work is needed to make this execute quickly and explore
"interesting" parts of the state space, but the results are quite
good, and a big improvement over their earlier system, FiSC.
Sections 7 through 9 of the eXplode paper describe many of the
interesting (and sometimes amusing) bugs eXplode found in Linux and
various software running on Linux, such as Berkeley DB and Subversion.
One of the least pleasant is a bug in the way ext2 handles fsync().
From the paper:
The ext2 bug is a case where an implementation error points out a
deeper design problem. The bug occurs when we: (1) shrink a file "A"
with truncate and (2) subsequently creat, write, and fsync a second
file "B." If file B reuses the indirect blocks of A freed via
truncate, then following a crash e2fsck notices that A's indirect
blocks are corrupt and clears them, destroying the contents of B. (For
good measure it then notices that A and B share blocks and "repairs" B
by duplicating blocks from A). Because ext2 makes no guarantees about
what is written to disk, fundamentally one cannot use fsync to safely
force a file to disk, since the file can still have implicit
dependencies on other file system state (in our case if it reuses an
indirect blocks for a file whose inode has been cleared in memory but
not on disk).
While it is well known that ext2 makes very few guarantees on the
state of the file system, it is surprising that even an fsync()
not make any guarantees about the state of file system on disk.
eXplode also found an error in JFS, which does make fairly strong
guarantees, in which an fsync()'d file could lose all its data when a
directory inode is reused as a file inode.
One of the primary goals of eXplode is ease of use and extension to
new systems with only minor effort. The eXplode system runs on a
live, running Linux kernel instance with only minor modifications.
These modifications could be trivially rewritten to be configurable as
a compile-time option (CONFIG_EXPLODE?), making them a reasonable
candidate for integration in the mainline kernel. The checking
interface allows programmers to check new systems (pretty much
anything that runs on Linux and stores data on disks) by writing only
a few lines of code. While the current interface uses C++, it seems
relatively easy to add other front ends using C or shell scripts. The
authors are considering open sourcing the code and are very interested
in hearing more from kernel developers about how to make eXplode more
attractive for everyday use.
Our final paper is Bugs as
Deviant Behavior: A General Approach to Inferring Errors in Systems
Code, by Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and
Benjamin Chelf, and appeared in SOSP
2001. The basic idea is to create a framework for static code
analysis which allows programmers to write extremely simple
descriptions of rules that code should follow. Most readers will be
familiar with the simpler applications of this work from the many bug
reports produced by the Stanford checker and reviewed on the
linux-kernel mailing list. This paper goes above and beyond this
level of code analysis and describes on a statistical approach to
inferring relationships between functions and variables, looking for
deviations from the norm, and then ranking and ordering the results so
that the deviations most likely to yield bugs are near the top of the
list. For example, the system can infer relationships such as "only
modify variable X in between calls to spin_lock(Y) and spin_unlock(Y)"
- without writing a rule that explicitly lays out this relationship.
It could almost be described as meta-meta-compilation - the system not
only checks the rules automatically, it infers the rules
A more recent paper, From
Uncertainty to Belief: Inferring the Specification Within, by Ted
Kremenek, Paul Twohey, Godmar Back, Andrew Y. Ng, and Dawson Engler,
which appeared in OSDI
2006, pushes these ideas even further with a technique that is
capable of inferring more complex rules using a combination of
statistical inference, compiler analysis, and machine learning. For a
system such as Linux where lines of code far outweigh lines of
documentation, this approach has great merit. I find myself doing a
human version of this statistical analysis every time I attempt to use
an undocumented network driver framework function.
A fun footnote is the slides from a talk entitled Weird
things that surprise academics trying to commercialize a static
checking tool. Check out the slides entitled "Myth: the C (or
C++) language exists" or "No, your tool is broken: that's not a bug."
What does this mean for Linux?
A lot of great, practical ideas for automatically finding errors are
coming out of research these days. The existing implementations may
not be practical or available for Linux (for example, the
metacompilation work has been commercialized and will remain closed
source for the indefinite future), but this work can often inspire
useful (though usually not as complete) open source
On the static code analysis side, both sparse and smatch implement some useful
checks. sparse is already integrated into the kernel build system; smatch,
unfortunately, appears to have stalled.
Annotations like __must_check are producing voluminous (and sometimes
mystifying) compiler warnings. A lot of checks are integrated
directly into gcc, but this requires a programmer with knowledge of
gcc and a fairly long release cycle turnaround time before the check
becomes available. The general-purpose nature of these checks also
means that they sometimes generate many false positives, especially on
systems software, and have to be explicitly turned off again. A
framework that allows gcc to be extended with metacompilation style
checks without requiring recompilation of gcc might be more helpful.
When it comes to dynamic code analysis, Linux has quite a few special
purpose error checkers which can be configured in or out of the
kernel, or turned on and off at boot time. One of the most exciting
is lockdep, the lock
correctness validator written by Ingo Molnar and Arjan van de Ven. It
observes lock acquisition during runtime, and looks for invalid or
inconsistent use of locks (such as reacquiring locks or acquiring
locks in a different order). Even nicer would be a generic framework
for implementing dynamic code checkers, perhaps using part of the
File system testers are coming back into vogue. fsx is a file system
stress tester that does a bunch of known-stressful operations to a
file system and checks the results. fsfuzz
is one of many useful tools for randomly altering file systems to
There are many other useful automatic testing/chimpanzee-replacement
tools; I encourage you to describe your favorites in the comments.
Comments (19 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>