Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.19-rc6, released on November 15. It contains a fair number of fixes, but one hopes that most of the problems have been taken care of by now (though the latest version of the 2.6.19 known regressions list (November 21) still contains nine entries). See the long-format changelog for the details.

Almost 90 fixes have gone into the mainline git repository since -rc6, as of this writing. There has been no word on whether that's enough to force the release of a 2.6.19-rc7 before this cycle comes to an end or not.

There have been no -mm releases over the last week.

The current stable 2.6 kernel is 2.6.18.3, released on November 18. It contains a fair number of fixes, at least one of which is security-related.

Adrian Bunk has released 2.6.16.32 with quite a few fixes. 2.6.16.33-rc1 is also available.

On the 2.4 front, Willy Tarreau has released 2.4.33.4 with a couple of security patches and a number of other fixes. 2.4.34-pre6 is also out, adding a relatively small number of patches.

Comments (none posted)

Kernel development news

Quotes of the week

Hm, I've never heard the driver model be called a "complete design paradigm" in the past. I've heard it called a lot of real nasty things though.

-- Greg Kroah-Hartman

So don't fall for the classic "second system syndrome". The classic reason for getting the second system wrong is because you focus on the issues people complain about, and not on the issues that work well (because the issues that work fine are obviously not getting a lot of attention).

-- Linus Torvalds

Comments (3 posted)

A summary of 2.6.19 API changes

The 2.6.19 kernel cycle has brought the usual pile of changes visible to kernel developers. Here is a quick summary of the most significant API modifications in 2.6.19.

The prototype for interrupt handler functions has changed. In short, the regs argument has been removed, since almost nobody used it. Any interrupt handler which needs the pre-interrupt register state can use get_irq_regs() to obtain it.
The latency tracking infrastructure patch has been merged.
The readv() and writev() methods in the file_operations structure have been removed in favor of aio_readv() and aio_writev() (whose prototypes have been changed). See this article for more information.
The no_pfn() address space operation has been added.
SRCU - a version of read-copy-update which allows read-side blocking - has been merged. See this article by Paul McKenney for lots of details.
The CHECKSUM_HW value has long been used in the networking subsystem to support hardware checksumming. That value has been replaced with CHECKSUM_PARTIAL (intended for outgoing packets where the job must be completed by the hardware) and CHECKSUM_COMPLETE (for incoming packets which have been completely checksummed by the hardware).
A number of memory management changes have been merged, including tracking of dirty pages in shared memory mappings, making the DMA32 and HIGHMEM zones optional, and an architecture-independent mechanism for tracking memory ranges (and the holes between them).
The pud_page() and pgd_page() macros now return a struct page pointer, rather than a kernel virtual address. Code needing the latter should use pud_page_vaddr() or pgd_page_vaddr() instead.
A number of driver core changes including experimental parallel device probing and some improvements to the suspend/resume process.
There is now a notifier chain for out-of-memory situations; the idea here is to set up functions which might be able to free some memory when things get very tight.
The semantics of the kmap() API have been changed a bit: on architectures with complicated memory coherence issues, kmap() and kunmap() are expected to manage coherency for the mapped pages, thus eliminating the need to explicitly flush pages from cache.
PCI Express Advanced Error Reporting is now supported in the PCI layer.
A number of changes have been made to the inode structure in an effort to make it smaller.
Much improved suspend and resume support for the USB layer.
A new set of functions has been added to allow USB drivers to quickly check the direction and transfer mode of an endpoint.
A somewhat reduced version of Wireless Extensions version 21. Most of the original functionality has been removed with the idea that the wireless extensions will soon be superseded by something else.
Vast numbers of annotations enabling the sparse utility to detect big/little endian errors.
The flags field of struct request has been split into two new fields: cmd_type and cmd_flags. The former contains a value describing the type of request (filesystem request, sense, power management, etc.) while the latter has the flags which modify the way the command works (read/write, barriers, etc.).
The block layer can be disabled entirely at kernel configuration time; this option can be useful in some embedded situations.
The kernel now has a generic boolean type, called bool; it replaces a number of homebrewed boolean types found in various parts of the kernel.
There is a new function for allocating a copy of a block of memory:
```
    void *kmemdup(const void *src, size_t len, gfp_t gfp);
```
A number of allocate-then-copy code sequences have been updated to use kmemdup() instead.

As always, API changes are tracked on the LWN 2.6 API changes page.

Comments (2 posted)

Kernel key management

November 21, 2006

This article was contributed by Jake Edge.

Filesystems, especially remote filesystems, may require some authentication or a key to enable access; the kernel key management interface provides hooks to store and manage this kind of information. The hooks come in two flavors; one used by the kernel to find keys for subsystems that require them and one used by userspace programs to manage keys. The intent is to provide a fast mechanism for the kernel to access the keys that it needs and to push the add, modify and delete operations into userspace.

'Key' is the term used, but it may not be keys in the traditional, cryptographic sense that are stored. Any kind of authentication or access information can be stored as a key; it is essentially an opaque chunk of data that is only interpreted by the kernel subsystem that is interested in it. While filesystems are the main target of the API, any kernel subsystem that requires this kind of information could use it.

At the core, keys are stored in the aptly named struct key which has the following kinds of fields:

a unique serial number
a key type that can identify the filesystem that the key belongs to
a description string that is used for searching for the key
a payload that contains the actual key data
user and group information including permissions
an expiration time
a key state that tracks instantiation, revocation, deletion, etc.

The key types provide a way for a filesystem to configure its own set of key operations. The operations that a key type can specify are:

instantiate - create a key of that type
update - modify a key
match - match a key to a description, which is used in the key search
revoke - clear some key data and change the state to KEY_FLAG_REVOKED
destroy - clear all key data
describe - summarize the key's description and payload as text
read - read the key data
request_key - called when the key is not available in order to retrieve the key from elsewhere

Two standard key types are defined: key_type_user and key_type_keyring. New key types can be registered by filesystems using:

    int register_key_type(struct key_type *type);

When the kernel needs to find a key, it calls:

    struct key *request_key(const struct key_type *type,
                            const char *description,
                            const char *callout_string);

It passes the type and description and the match function from the struct key_type is used to try and find a matching key. If no key is found, and callout_string is not NULL, the kernel will invoke /sbin/request-key, which attempts to obtain the necessary key from userspace.

The payload field of a key can be accessed once the key has been found, but if it is more complex than a simple integer, some arrangement must be made to prevent simultaneous reads and writes. Support for semaphore locking or Read-Copy-Update (RCU) are present in the key structure and must be used unless the key type has no modification methods. Once the filesystem is done with the key, it should be released with:

    void key_put(struct key *key);

Keyrings are, as the name implies, collections of related keys and there are various calls to manipulate them. Each process is associated with three keyrings: a thread-specific keyring, a process-specific keyring and a session-specific keyring. These are the keyrings searched when a request_key is issued. Each user on the system is associated with a user-specific keyring; a default user session keyring used to initialize the session-specific keyring when a process changes its real user id.

Permissions for keys are stored in a bit field, much like Linux file permissions, but are more extensive. Each key has a user and group id and a permissions mask for each of four potential accessors: possessor, user, group, and other. The mask consists of six bits:

view - allows a key or keyring's attributes to be viewed
read - allows a key's payload or a keyring's list of keys to be viewed
write - allows creating or modifying a key's payload or keyring's list of keys
search - allows keys to be found and keyrings to be searched
link - allows the key or keyring to be linked into another keyring
set attribute - allows the key's user id, group id, and permissions mask to be changed

The userspace API consists of the three main system calls:

    key_serial_t add_key(const char *type, const char *desc,
                         const void *payload, size_t plen,
                         key_serial_t keyring);

    key_serial_t request_key(const char *type, const char *description,
                             const char *callout_info,
                             key_serial_t dest_keyring);

    key_serial_t keyctl(int cmd, key_serial_t id, int create);

add_key() adds a key to the keyring specified. request_key(), much like its kernel-side counterpart, searches for the key based on the type and description, possibly calling out to userspace if callout_info is non-NULL. It can also attach the key to the specified destination keyring if it is found. keyctl() is an ioctl-like interface that provides for the management of keys. <linux/keyctl.h> contains 17 separate commands for updating, changing permissions, searching, linking, reading and the like.

The /bin/keyctl command-line utility, part of the keyutils package, provides an easy interface to the userspace system calls to facilitate working with keys from userspace. Also, the /proc/keys and /proc/key-users entries in procfs enable a user to view the keys and key users currently managed by the kernel.

The only filesystem in the current 2.6 tree that uses the key management API is eCryptfs, a stacked filesystem that encrypts its data using a password and optional salt. It uses the user key type rather than creating its own type and does not directly support userspace callbacks. Instead it uses the mount.ecryptfs command to prompt the user for the password and stores that as the key.

According to slides from Dave Howells' talk at the 2006 Ottawa Linux Symposium (available here), several other filesystems (including CIFS, NFSv4 and AFS) are planning to use the API in the future. For more information, extensive documentation can be found in the kernel tree in Documentation/keys.txt and Documentation/keys-request-key.txt.

Overall, this looks to be a useful interface for kernel subsystems that require keys and, in keeping with kernel tradition, most of the policy and management pieces are pushed out to userspace. It provides all of the capabilities that one would expect and hopefully more kernel subsystems will be using it in the future.

Comments (11 posted)

KHB: Automating bug hunting

November 21, 2006

This article was contributed by Valerie Aurora

If you're a programmer, you've reviewed a lot of code - at minimum, your own code (or at least we hope so). It doesn't take a lot of code reviewing before you start recognizing familiar bugs - failure to drop a lock on the error exit path, dereferencing a pointer just after it's been proven to be null, forgetting to mark a buffer dirty. Before long, the sense of deja vu is overpowering. You might even begin to entertain the sneaking suspicion that half of code review work could be done by a trained chimpanzee, a 10-line script, or someone from marketing.

(Some of) The Solutions

As it turns out, that suspicion is correct. A lot of software errors can be found automatically, in fact, surprisingly automatically. The automatic checking we'll discuss falls into two main categories: static and dynamic. Static checking runs on the source code and doesn't require integration with a running system. It is often better at exploring all execution paths, but often explores impossible execution paths (resulting in false positives), and usually can't deal with things like function pointers. Dynamic checking runs on a live system, which produces more accurate results but requires more invasive techniques and usually can't explore as many execution paths (though fusion with model-checking techniques can work around this; see eXplode later in this article). The good news is that automatic error checking techniques are compatible; we can use them all and get the best of all worlds.

In this article, we'll review several papers describing some of the most practical and promising approaches, all from Dawson Engler's research group at Stanford. Many LWN readers will already be familiar with metacompilation (known as "the Stanford checker" in kernel circles) at a high level, but the approach rewards deeper study. Another approach, named EXE, uses symbolic execution, in which an instrumented program self-generates complex error-triggering inputs. We'll also look at eXplode, a light-weight system inspired by model-checking which quickly and efficiently checks file systems and other software for correct crash recovery. All of these approaches are compatible with the Linux kernel (requiring more or less code modification but generally less) and have found many real-world bugs resulting in system panic, data corruption, or security holes.

Finally, we'll quickly review a variety of existing open source tools for automatically error-checking programs. With any luck, in a few years' time we'll have scripts doing the trained chimpanzee code review work instead of Linux kernel maintainers.

The Papers

We'll start with one of the most intellectually intriguing approaches, using code instrumentation and symbolic execution to automatically generate complex test inputs that trigger serious bugs. The paper is Automatically generating malicious disks using symbolic execution, by Junfeng Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler, and appeared in IEEE Security and Privacy 2006. (Another longer, more detailed paper on the topic is EXE: Automatically Generating Inputs of Death, by Cristian Cadar, Vijay Ganesh, Peter Pawlowski, David Dill, and Dawson Engler and appeared in ACM Computer Communications and Security 2006). The basic idea is that you begin executing the program with a "symbolic" input. As the program runs, the EXE system uses compiled-in instrumentation to keep track of the tests done on the input data. These tests create constraints on what the input data can be. Once the system has a set of constraints, it tries to solve them and come up with a set of allowed inputs. It then checks the allowed inputs to figure out if they will cause one of a known set of errors, such as dividing by zero, allowing access to arbitrary memory locations, triggering an assertion, etc.

In this paper, the authors apply the system to the Linux file system mount code for ext2, ext3, and JFS. In this case, the system starts out with a symbolic representation of all possible disk images ("inputs"), and gradually whittles away allowed disk images at each point in the mount code, based on actions such as:

    if (sbi->s_frag_size == 0)
	goto cantfind_ext2;

It then checks all disk images allowed at any particular point to see if any of them causes one of the bugs the system can detect. For example, the statement:

    sbi->s_frags_per_block = sb->s_blocksize / sbi->s_frag_size;

Would be flagged as triggering a divide by zero error without the prior check pruning out all inputs with sb->s_frag_size equal to zero.

The advantage of this approach over simply generating random inputs is that random error generation can't go very deep in testing code paths because the random input will nearly always fail during the first few input checks. For example, random input testing for the file system mount code would almost always fail out at the check of the superblock magic number. Another pleasant quality of this approach is that it generates test inputs that trigger the bug detected by the system. Many other automatic error checkers are plagued by false positives; this system hands you the exact input that triggers the supposed bug. It can be accurately described as a error test case generating system in addition to an error checking system. The prospect is enough to make a systems programmer salivate.

The next paper is eXplode: a Lightweight, General System for Finding Serious Storage System Errors, by Junfeng Yang, Can Sar, and Dawson Engler, which appeared in OSDI 2006. eXplode tests file systems (and more complex storage software stacks) by generating all possible disks that could be the result of a crash, and then automatically checking them using verification programs, such as fsck and programs that check for "correct" file system topology (e.g., the existence of the path "/a/b/" after creating and properly syncing it). The sequence of events leading up to an incorrect disk is recorded through some minor, not terribly intrusive instrumentation. Some minor modifications to Linux are needed to deterministically replay a sequence of events; mainly, the execution order of threads must be maintained, which they approximate using thread priorities. They also modify Linux to make certain error cases (such as memory allocation failure) more common.

eXplode works for more than just file systems, it also works for databases on top of file systems, file systems on top of RAID, software configuration systems, or any combination of the above. This is due to the stackable, modular nature of the routines for creating, mutating, and checking disks. Each layer in the storage stack fills out the following routines:

init: one-time initialization, such as formatting a file system partition or creating a fresh database.
mount: set up the storage system so that operations can be performed on it.
unmount: tear down the storage system; used by eXplode to clear the storage system's state so it can explore a different one.
recover: repair the storage system after an eXplode-simulated crash.
threads: return the thread IDs for the storage system's kernel threads (to help control non-determinism).

The client code must also provide routines that mutate the storage system (such as by creating a file) and that check the file system for correctness, above and beyond the recover routine. When running, eXplode (1) calls all the init() routines for each element in the stack in order, (2) calls all the mount() routines, (3) run the mutate routine, forking children at "choice points", places where execution could go in one direction or another, (4) at appropriate points, generate all possible crash disks (due to incomplete and/or reordered writes), run the recover routines, and then run the checker routine, (5) repeat steps 3 and 4 until the user gets bored.

A lot of hard work is needed to make this execute quickly and explore "interesting" parts of the state space, but the results are quite good, and a big improvement over their earlier system, FiSC. Sections 7 through 9 of the eXplode paper describe many of the interesting (and sometimes amusing) bugs eXplode found in Linux and various software running on Linux, such as Berkeley DB and Subversion. One of the least pleasant is a bug in the way ext2 handles fsync(). From the paper:

The ext2 bug is a case where an implementation error points out a deeper design problem. The bug occurs when we: (1) shrink a file "A" with truncate and (2) subsequently creat, write, and fsync a second file "B." If file B reuses the indirect blocks of A freed via truncate, then following a crash e2fsck notices that A's indirect blocks are corrupt and clears them, destroying the contents of B. (For good measure it then notices that A and B share blocks and "repairs" B by duplicating blocks from A). Because ext2 makes no guarantees about what is written to disk, fundamentally one cannot use fsync to safely force a file to disk, since the file can still have implicit dependencies on other file system state (in our case if it reuses an indirect blocks for a file whose inode has been cleared in memory but not on disk).

While it is well known that ext2 makes very few guarantees on the state of the file system, it is surprising that even an fsync() call does not make any guarantees about the state of file system on disk. eXplode also found an error in JFS, which does make fairly strong guarantees, in which an fsync()'d file could lose all its data when a directory inode is reused as a file inode.

One of the primary goals of eXplode is ease of use and extension to new systems with only minor effort. The eXplode system runs on a live, running Linux kernel instance with only minor modifications. These modifications could be trivially rewritten to be configurable as a compile-time option (CONFIG_EXPLODE?), making them a reasonable candidate for integration in the mainline kernel. The checking interface allows programmers to check new systems (pretty much anything that runs on Linux and stores data on disks) by writing only a few lines of code. While the current interface uses C++, it seems relatively easy to add other front ends using C or shell scripts. The authors are considering open sourcing the code and are very interested in hearing more from kernel developers about how to make eXplode more attractive for everyday use.

Our final paper is Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code, by Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf, and appeared in SOSP 2001. The basic idea is to create a framework for static code analysis which allows programmers to write extremely simple descriptions of rules that code should follow. Most readers will be familiar with the simpler applications of this work from the many bug reports produced by the Stanford checker and reviewed on the linux-kernel mailing list. This paper goes above and beyond this level of code analysis and describes on a statistical approach to inferring relationships between functions and variables, looking for deviations from the norm, and then ranking and ordering the results so that the deviations most likely to yield bugs are near the top of the list. For example, the system can infer relationships such as "only modify variable X in between calls to spin_lock(Y) and spin_unlock(Y)" - without writing a rule that explicitly lays out this relationship. It could almost be described as meta-meta-compilation - the system not only checks the rules automatically, it infers the rules automatically.

A more recent paper, From Uncertainty to Belief: Inferring the Specification Within, by Ted Kremenek, Paul Twohey, Godmar Back, Andrew Y. Ng, and Dawson Engler, which appeared in OSDI 2006, pushes these ideas even further with a technique that is capable of inferring more complex rules using a combination of statistical inference, compiler analysis, and machine learning. For a system such as Linux where lines of code far outweigh lines of documentation, this approach has great merit. I find myself doing a human version of this statistical analysis every time I attempt to use an undocumented network driver framework function.

A fun footnote is the slides from a talk entitled Weird things that surprise academics trying to commercialize a static checking tool. Check out the slides entitled "Myth: the C (or C++) language exists" or "No, your tool is broken: that's not a bug."

What does this mean for Linux?

A lot of great, practical ideas for automatically finding errors are coming out of research these days. The existing implementations may not be practical or available for Linux (for example, the metacompilation work has been commercialized and will remain closed source for the indefinite future), but this work can often inspire useful (though usually not as complete) open source implementations.

On the static code analysis side, both sparse and smatch implement some useful checks. sparse is already integrated into the kernel build system; smatch, unfortunately, appears to have stalled. Annotations like __must_check are producing voluminous (and sometimes mystifying) compiler warnings. A lot of checks are integrated directly into gcc, but this requires a programmer with knowledge of gcc and a fairly long release cycle turnaround time before the check becomes available. The general-purpose nature of these checks also means that they sometimes generate many false positives, especially on systems software, and have to be explicitly turned off again. A framework that allows gcc to be extended with metacompilation style checks without requiring recompilation of gcc might be more helpful.

When it comes to dynamic code analysis, Linux has quite a few special purpose error checkers which can be configured in or out of the kernel, or turned on and off at boot time. One of the most exciting is lockdep, the lock correctness validator written by Ingo Molnar and Arjan van de Ven. It observes lock acquisition during runtime, and looks for invalid or inconsistent use of locks (such as reacquiring locks or acquiring locks in a different order). Even nicer would be a generic framework for implementing dynamic code checkers, perhaps using part of the SystemTap framework.

File system testers are coming back into vogue. fsx is a file system stress tester that does a bunch of known-stressful operations to a file system and checks the results. fsfuzz is one of many useful tools for randomly altering file systems to expose bugs.

There are many other useful automatic testing/chimpanzee-replacement tools; I encourage you to describe your favorites in the comments. Happy debugging!

Comments (19 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.19-rc6 ?

Ingo Molnar 2.6.19-rc6-rt0, -rt YUM repository ?

Ingo Molnar 2.6.19-rc6-rt5 ?

Chris Wright Linux 2.6.18.3 ?

Adrian Bunk Linux 2.6.16.33-rc1 ?

Adrian Bunk Linux 2.6.16.32 ?

Willy Tarreau Linux 2.4.34-pre6 ?

Willy Tarreau Linux 2.4.33.4 ?

Architecture-specific

Vivek Goyal x86_64: Relocatable bzImage (V3) ?

Core kernel code

Rafael J. Wysocki Support for freezeable workqueues ?

Sebastien Dugue - AIO completion signal notification v2 ?

David Howells WorkStruct: Shrink work_struct by two thirds ?

Evgeniy Polyakov kevent: Generic event handling mechanism. ?

Development tools

Petr Baudis Cogito-0.18.2 ?

Device drivers

Anderson Briglia [patch 0/6] [RFC] Add MMC Password Protection (lock/unlock) support V6 ?

Alex Dubov MMC: new version of the TI Flash Media card reader driver ?

Cornelia Huck [Patch -mm 2/2] driver core: Introduce device_move(): move a device to a new parent. ?

Matthew Wilcox Async scan changes -- now available as a git tree ?

Documentation

linux@horizon.com Branching and merging with git ?

Oskar Andreasson iptables-tutorial 1.2.2 book ?

Filesystems and block I/O

Edward Falk Introduce block I/O performance histograms ?

Networking

David Howells AF_RXRPC proposal ?

Johannes Berg d80211: move away from wmaster towards wiphy ?

Virtualization and containers

Avi Kivity KVM: Save/resume support ?

menage@google.com Multi-hierarchy Process Containers ?

Rusty Russell lhype progress... ?

Page editor: Jonathan Corbet
Next page: Distributions>>