LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.8-rc5, released on January 25. The only announcement appears to be this Google+ posting. Just over 250 fixes were merged since -rc4 came out; see the short-form changelog for details.

Stable updates: 3.7.5, 3.4.28 and 3.0.61 were released on January 27.

Comments (none posted)

Quotes of the week

People really ought to be forced to read their code aloud over the phone - that would rapidly improve the choice of identifiers
Al Viro

Besides, wouldn't it be cool to see a group of rovers chasing each other across Mars, jockeying for the best positioning to reduce speed-of-light delays?
Paul McKenney

The real problem is, Moore's Law just does not work for spinning disks. Nobody really wants their disk spinning faster than [7200] rpm, or they don't want to pay for it. But density goes up as the square of feature size. So media transfer rate goes up linearly while disk size goes up quadratically. Today, it takes a couple of hours to read each terabyte of disk. Fsck is normally faster than that, because it only reads a portion of the disk, but over time, it breaks in the same way. The bottom line is, full fsck just isn't a viable thing to do on your system as a standard, periodic procedure. There is really not a lot of choice but to move on to incremental and online fsck.
Daniel Phillips

Comments (41 posted)

Kernel development news

Asynchronous block loop I/O

By Jonathan Corbet
January 30, 2013
The kernel's block loop driver has a conceptually simple job: take a file located in a filesystem somewhere and present it as a block device that can contain a filesystem of its own. It can be used to manipulate filesystem images; it is also useful for the management of filesystems for virtualized guests. Despite having had some optimization effort applied to it, the loop driver in current kernels is not as fast as some would like it to be. But that situation may be about to change, thanks to an old patch set that has been revived and prepared for merging in a near-future development cycle.

As a block driver, the loop driver accepts I/O requests described by struct bio (or "BIO") structures; it then maps each request to a suitable block offset in the file serving as backing store and issues I/O requests to perform the desired operations on that file. Each loop device has its own thread, which, at its core, runs a loop like this:

    while (1) {
	wait_for_work();
	bio = dequeue_a_request()
	execute_request(bio);
    }

(The actual code can be seen in drivers/block/loop.c.) This code certainly works, but it has an important shortcoming: it performs I/O in a synchronous, single-threaded manner. Block I/O is normally done asynchronously when possible; write operations, in particular, can be done in parallel with other work. In the loop above, though, a single, slow read operation can hold up many other requests, and there is no ability for the block layer or the I/O device itself to optimize the ordering of requests. As a result, the performance of loop I/O traffic is not what it could be.

In 2009, Zach Brown set out to fix this problem by changing the loop driver to execute multiple, asynchronous requests at the same time. That work fell by the wayside when other priorities took over Zach's time, so his patches were never merged. More recently, Dave Kleikamp has taken over this patch set, ported it to current kernels, and added support to more filesystems. As a result, this patch set may be getting close to being ready to go into the mainline.

At the highest level, the goal of this patch set is to use the kernel's existing asynchronous I/O (AIO) mechanism in the loop driver. Getting there takes a surprising amount of work, though; the AIO subsystem was written to manage user-space requests and is not an easy fit for kernel-generated operations. To make these subsystems work together, the 30-part patch set takes a bottom-up approach to the problem.

The AIO code is based around a couple of structures, one of which is struct iovec:

    struct iovec {
	void __user *iov_base;
	__kernel_size_t iov_len;
    };

This structure is used by user-space programs to describe a segment of an I/O operation; it is part of the user-space API and cannot be changed. Associated with this structure is the internal iov_iter structure:

    struct iov_iter {
	const struct iovec *iov;
	unsigned long nr_segs;
	size_t iov_offset;
	size_t count;
    };

This structure (defined in <linux/fs.h>) is used by the kernel to track progress working through an array of iovec structures.

Any kernel code needing to submit asynchronous I/O needs to express it in terms of these structures. The problem, from the perspective of the loop driver, is that struct iovec deals with user-space addresses. But the BIO structures representing block I/O operations deal with physical addresses in the form of struct page pointers. So there is an impedance mismatch between the two subsystems that makes AIO unusable for the loop driver.

Fixing that involves changing the way struct iov_iter works. The iov pointer becomes a generic pointer called data that can point to an array of iovec structures (as before) or, instead, an array of kernel-supplied BIO structures. Direct access to structure members by kernel code is discouraged in favor of a set of defined accessor operations; the iov_iter structure itself gains a pointer to an operations structure that can be changed depending on whether iovec or bio structures are in use. The end result is an enhanced iov_iter structure and surrounding support code that allows AIO operations to be expressed in either user-space (struct iovec) or kernel-space (struct bio) terms. Quite a bit of code using this structure must be adapted to use the new accessor functions; at the higher levels, code that worked directly with iovec structures is changed to work with the iov_iter interface instead.

The next step is to make it possible to pass iov_iter structures directly into filesystem code. That is done by adding two more functions to the (already large) file_operations structure:

    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);

These functions are meant to work much like the existing aio_read() and aio_write() functions, except that they work with iov_iter structures rather than with iovec structures directly. A filesystem supporting the new operations must be able to cope with I/O requests expressed directly in BIO structures — usually just a matter of bypassing the page-locking and mapping operations required for user-space addresses. If these new operations are provided, the aio_*() functions will never be called and can be removed.

After that, the patch set adds a new interface to make it easy for kernel code to submit asynchronous I/O operations. In short, it's a matter of allocating an I/O control block with:

    struct kiocb *aio_kernel_alloc(gfp_t gfp);

That block is filled in with the relevant information describing the desired operation and a pointer to a completion callback, then handed off to the AIO subsystem with:

    int aio_kernel_submit(struct kiocb *iocb);

Once the operation is complete, the completion function is called to inform the submitter of the final status.

A substantial portion of the patch set is dedicated to converting filesystems to provide read_iter() and write_iter() functions. In most cases the patches are relatively small; most of the real work is done in generic code, so it is mostly a matter of changing declared types and making use of some of the new iov_iter accessor functions. See the ext4 patch for an example of what needs to be done.

With all that infrastructural work done, actually speeding up the loop driver becomes straightforward. If the backing store for a given loop device implements the new operations, the loop driver will use aio_kernel_submit() for each incoming I/O request. As a result, requests can be run in parallel with, one hopes, a significant improvement in performance.

The patch set has been through several rounds of review, and most of the concerns raised would appear to have been addressed. Dave is now asking that it be included in the linux-next tree, suggesting that he intends to push it into the mainline during the 3.9 or 3.10 development cycle. Quite a bit of kernel code will be changed in the process, but almost no differences should be visible from user space — except that block loop devices will run a lot faster than they used to.

Comments (7 posted)

Rethinking optimization for size

By Jonathan Corbet
January 30, 2013
Contemporary compilers are capable of performing a wide variety of optimizations on the code they produce. Quite a bit of effort goes into these optimization passes, with different compiler projects competing to produce the best results for common code patterns. But the nature of current hardware is such that some optimizations can have surprising results; that is doubly true when kernel code is involved, since kernel code is often highly performance-sensitive and provides an upper bound on the performance of the system as a whole. A recent discussion on the best optimization approach for the kernel shows how complicated the situation can be.

Compiler optimizations are often aimed at making frequently-executed code (such as that found in inner loops) run more quickly. As an artificially simple example, consider a loop like the following:

    for (i = 0; i < 4; i++)
	do_something_with(i);

Much of the computational cost of a loop like this may well be found in the loop structure itself — incrementing the counter, comparing against the maximum, and jumping back to the beginning. A compiler that performs loop unrolling might try to reduce that cost by transforming the code into something like:

    do_something_with(0);
    do_something_with(1);
    do_something_with(2);
    do_something_with(3);

The loop overhead is now absent, so one would expect this code to execute more quickly. But there is a cost: the generated code may well be larger than it was before the optimization was applied. In many situations, the performance improvement may outweigh the cost, but that may not always be the case.

GCC provides an optimization option (-Os) with a different objective: it instructs the compiler to produce more compact code, even if there is some resulting performance cost. Such an option has obvious value if one is compiling for a space-constrained environment like a small device. But it turns out that, in some situations, optimizing for space can also produce faster code. In a sense, we are all running space-constrained systems, in that the performance of our CPUs depends heavily on how well those CPUs are using their cache space. Space-optimized code can make better use of scarce instruction cache space, and, as a result, perform better overall. With this in mind, compilation with -Os was made generally available for the 2.6.15 kernel in 2005 and made non-experimental for 2.6.26 in 2008.

Unfortunately, -Os has not always lived up to its promise in the real-world. The problem is not necessarily with the idea of creating compact code; it has more to do with how GCC interprets the -Os option. In the space-optimization mode, the compiler tends to choose some painfully slow instructions, especially on older processors. It also discards the branch prediction information provided by kernel developers in the form of the likely() and unlikely() macros. That, in turn, can cause rarely executed code to share cache space with hot code, effectively wasting a portion of the cache and wiping out the benefits that optimizing for space was meant to provide.

Because -Os did not produce the desired results, Linus disabled it by default in 2011, effectively ending experimentation with this option. Recently, though, Ling Ma posted some results suggesting that the situation might have changed. Recent Intel processors, it seems, have a new cache for decoded instructions, increasing the benefit obtained by having code fit into the cache. The performance of the repeated "move" instructions used by GCC for memory copies in -Os mode has also been improved in newer processors. The posted results claim a 4.8% performance improvement for the netperf benchmark and 2.7% for the volano benchmark when -Os is used on a newer CPU. Thus, it was suggested, maybe it is time to reconsider -Os, at least for some target processors.

Naturally, the situation not quite that simple. Valdis Kletnieks complained that the benchmark results may not be showing an actual increase in real-world performance. Distributors hate shipping multiple kernels, so an optimization mode that only works for some portion of a processor family is unlikely to be enabled in distributor kernels. And there is still the problem of the loss of branch prediction information which, as Linus verified, still happens when -Os is used.

What is really needed, it seems, is a kernel-specific optimization mode that is more focused on instruction-cache performance than code size in its own right. This mode would take some behaviors from -Os while retaining others from the default -O2 mode. Peter Anvin noted that the GCC developers are receptive to the idea of implementing such a mode, but there is nobody who has the time and inclination to work on that project at the moment. It would be nice to have a developer who is familiar with both the kernel and the compiler and who could work to make GCC produce better code for the kernel environment. Until somebody steps up to do that work, though, we will likely have to stick with -O2, even knowing that the resulting code is not as good as it could be.

Comments (37 posted)

Glibc and the kernel user-space API

By Michael Kerrisk
January 30, 2013

We are accustomed to thinking of a system call as being a direct service request to the kernel. However, in reality, most system call invocations are mediated by wrapper functions in the GNU C library (glibc). These wrapper functions eliminate work that the programmer would otherwise need to do in order to employ a system call. But it turns out that glibc does not provide wrapper functions for all system calls, including a few that see somewhat frequent use. The question of what (if anything) to do about this situation has arisen a few times in the last few months on the libc-alpha mailing list, and has recently surfaced once more.

A system call allows a program to request a service—for example, open a file or create a new process—from the kernel. At the assembler level, making a system call requires the caller to assign the unique system call number and the argument values to particular registers, and then execute a special instruction (e.g., SYSENTER on modern x86 architectures) that switches the processor to kernel mode to execute the system-call handling code. Upon return, the kernel places the system call's result status into a particular register and executes a special instruction (e.g., SYSEXIT on x86) that returns the processor to user mode. The usual convention for the result status is that a non-negative value means success, while a negative value means failure. A negative result status is the negated error number (errno) that indicates the cause of the failure.

All of the details of making a system call are normally hidden from the user by the C library, which provides a corresponding wrapper function and header file definitions for most system calls. The wrapper function accepts the system call arguments as function arguments on the stack, initializes registers using those arguments, and executes the assembler instruction that switches to kernel mode. When the kernel returns control to user mode, the wrapper function examines the result status, assigns the (negated) error number to errno in the case of a negative result, and returns either -1 to indicate an error or the non-negative result status as the return value of the wrapper function. In many cases, the wrapper function is quite simple, performing only the steps just described. (In those cases, the wrapper is actually autogenerated from syscalls.list files in the glibc source that tabulate the types of each system call's return value and arguments.) However, in a few cases the wrapper function may do some extra work such as repackaging arguments or maintaining some state information inside the C library.

The C library thus acts as a kind of gatekeeper on the API that the kernel presents to user space. Until the C library provides a wrapper function, along with suitable header files that define the calling signature and any constant and structure definitions used by the system call, users must do some manual work to make a system call.

That manual work includes defining the structures and constants needed by the system call and then invoking the syscall() library function, which handles the details of making the system call—copying arguments to registers, switching to kernel mode, and then setting errno once the kernel returns control to user space. Any system call can be invoked in this manner, including those for which the C library already provides a wrapper. Thus for example, one can bypass the wrapper function for read() and invoke the system call directly by writing:

    nread = syscall(SYS_read, fd, buf, len);

The first argument to syscall() is the number of the system call to be invoked; SYS_read is a constant whose definition is provided by including <unistd.h>

The C library used by most Linux developers is of course the GNU C library. Normally, glibc tracks kernel system call changes quite closely, adding wrapper functions and suitable header file definitions to the library as new system calls are added to the kernel. Thus, manually coding system calls is normally only needed when trying to use the latest system calls that have not yet appeared in the most recent iteration of glibc's six-month release cycle or when using a recent kernel on a system that has a significantly older version of glibc.

However, for some system calls, glibc support never appears. The question of how the decision is made on whether to support a particular system call in glibc has once again become a topic of discussion on the libc-alpha mailing list. The most recent discussion started when Kees Cook, the implementer of the recently added finit_module() system call, submitted a rudimentary patch to add glibc support for the system call. In response, Joseph Myers and Mike Frysinger noted various pieces that were missing from the patch, with Joseph adding that "in the kexec_load discussion last May / June, doubts were expressed about whether some existing module-related syscalls really should have had functions in glibc."

The module-related system calls—init_module(), delete_module(), and so on—are among those for which glibc does not provide support. The situation is in fact slightly more complex in the case of these system calls: glibc does not provide any header file support for these system calls but does, through an accident of history, export a wrapper function ABI for the calls.

The earlier discussion that Joseph referred to took place when Maximilian Attems attempted to add a header file to glibc to provide support for the kexec_load() system call, stating that his aim was "to axe the syscall maze in kexec-tools itself and have this syscall supported in glibc." One of the primary glibc maintainers, Roland McGrath, had a rather different take on the necessity of such a change, stating "I'm not really convinced this is worthwhile. Calling 'syscall' seems quite sufficient for such arcane and rarely-used calls." In other words, adding support for these system calls clutters the glibc ABI and requires (a small amount of) extra code in order to satisfy the needs of a handful of users who could just use the syscall() mechanism.

Andreas Jaeger, who had reviewed earlier versions of Maximilian's patch, noted that "linux/syscalls.list already [has] similar esoteric syscalls like create_module without any header support. I wouldn't object to do this for kexec_load as well". Roland agreed that the kexec_load() system call is a similar case, but felt that this point wasn't quite germane, since adding the module system calls to the glibc ABI was a "dubious" historical step that can't be reversed for compatibility reasons.

But in the recent discussion of finit_module(), Mike Frysinger spoke in favor of adding full glibc support for module-related system calls such as init_module(). Dave Miller made a similar argument even more succinctly:

It makes no sense for every tool that wants to support doing things with kernel modules to do the syscall() thing, propagating potential errors in argument signatures into more than one location instead of getting it right in one canonical place, libc.

In other words, employing syscall() can be error prone: there is no checking of argument types nor even checking that sufficient arguments have been passed.

Joseph Myers felt that the earlier kexec_load() discussions hadn't fully settled the issue, and was interested in having some concrete data on how many system calls don't have glibc wrappers. Your editor subsequently donned his man-pages maintainer hat and grepped the man pages in section 2 to determine which system calls do not have full glibc support in the form of a wrapper function and header files. The resulting list turns out to be quite long, running to nearly nearly 40 Linux system calls. However, the story is not quite so simple, since some of those system calls are obsolete (e.g., tkill(), sysctl(), and query_module()) and others are intended for use only by the kernel or glibc (e.g., restart_syscall()). Yet others have wrappers in the C library, although the wrappers have a significantly different names and provide some piece of extra functionality on top of the system call (e.g., rt_sigqueueinfo() has a wrapper in the form of the sigqueue() library function). Clearly, no wrapper is required for those system calls, and once they are excluded there remain perhaps 15 to 20 system calls that might be candidates to have glibc support added.

Motohiro Kosaki considered that the remaining system calls could be separated into two categories: those with only one or a few applications uses and those that seemed to him to have more widespread application use. Motohiro was agnostic about whether the former category (which includes the module-related system calls, kcmp(), and kcmp_load()) required a wrapper. However, in his opinion the system calls in the latter category (which includes system calls such as ioprio_set(), ioprio_get(), and gettid()) clearly merited having full glibc support.

The lack of glibc support for gettid(), which returns the caller's kernel thread ID, is an especially noteworthy case. A long-standing glibc bug report requesting that glibc add support for this system gained little traction with the previous glibc maintainer. However, excluding that system call is rather anomalous, since it is quite frequently used and the kernel exposes thread IDs via various /proc interfaces, and glibc exposes various kernel APIs that can employ kernel thread IDs (for example, sched_setaffinity(), fcntl(), and the SIGEV_THREAD_ID notification mode for POSIX timers).

The discussion has petered out in the last few days, despite Mike Frysinger's attempt to further push the debate along by reading and summarizing the various pro and contra arguments in a single email. As noted by various participants in the discussion, adding glibc wrappers for some currently unsupported system calls would seem to have some worthwhile benefits. It would also help to avoid the confusing situation where programmers sometimes end up searching for a glibc wrapper function and header file definitions that don't exist. It remains to be seen whether these arguments will be sufficient to persuade Roland in the face of his concerns about cluttering the glibc ABI and adding extra code to the library for the benefit of what he believes is a relatively small number of users.

Comments (20 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds