Brief items
The current development kernel is 3.8-rc5, released on
January 25. The only announcement appears to be
this Google+
posting. Just over 250 fixes were merged since -rc4 came out; see
the short-form changelog for details.
Stable updates: 3.7.5,
3.4.28 and 3.0.61 were released on January 27.
Comments (none posted)
People really ought to be forced to read their code aloud over the
phone - that would rapidly improve the choice of identifiers
—
Al Viro
Besides, wouldn't it be cool to see a group of rovers chasing each
other across Mars, jockeying for the best positioning to reduce
speed-of-light delays?
—
Paul
McKenney
The real problem is, Moore's Law just does not work for spinning
disks. Nobody really wants their disk spinning faster than [7200]
rpm, or they don't want to pay for it. But density goes up as the
square of feature size. So media transfer rate goes up linearly
while disk size goes up quadratically. Today, it takes a couple of
hours to read each terabyte of disk. Fsck is normally faster than
that, because it only reads a portion of the disk, but over time,
it breaks in the same way. The bottom line is, full fsck just isn't
a viable thing to do on your system as a standard, periodic
procedure. There is really not a lot of choice but to move on to
incremental and online fsck.
—
Daniel Phillips
Comments (41 posted)
Kernel development news
By Jonathan Corbet
January 30, 2013
The kernel's block loop driver has a conceptually simple job: take a file
located in a filesystem somewhere and present it as a block device that can
contain a filesystem of its own. It can be used to manipulate filesystem
images; it is also useful for the management of filesystems for virtualized
guests. Despite having had some optimization effort applied to it, the
loop driver in current kernels is not as fast as some would like it to be.
But that situation may be about to change, thanks to an old patch set that
has been revived and prepared for merging in a near-future development
cycle.
As a block driver, the loop driver accepts I/O requests described by
struct bio (or "BIO")
structures; it then maps each request to a suitable block offset in the
file serving as backing store and issues I/O requests to perform the
desired operations on that file. Each loop device has its own thread,
which, at its core, runs a loop like this:
while (1) {
wait_for_work();
bio = dequeue_a_request()
execute_request(bio);
}
(The actual code can be seen in drivers/block/loop.c.) This code
certainly works, but it has an important shortcoming: it performs I/O in a
synchronous, single-threaded manner. Block I/O is normally done
asynchronously when possible; write operations, in particular, can be done
in parallel with other work. In the loop above, though, a single, slow
read operation can hold up many other requests, and there is no
ability for the block layer or the I/O device itself to optimize the
ordering of requests. As a result, the performance of loop I/O traffic is
not what it could be.
In 2009, Zach Brown set out to fix this problem by changing the loop driver
to execute multiple, asynchronous requests at the same time. That
work fell by the wayside when other priorities took over Zach's time, so
his patches were never merged. More recently, Dave Kleikamp has
taken over this patch set, ported it to current kernels, and added support to
more filesystems. As a result, this patch set may be getting close to
being ready to go into the mainline.
At the highest level, the goal of this patch set is to use the kernel's
existing asynchronous I/O (AIO) mechanism in the loop driver. Getting
there takes a surprising amount of work, though; the AIO subsystem was
written to manage user-space requests and is not an easy fit for
kernel-generated operations. To make these subsystems work together, the
30-part patch set takes a bottom-up
approach to the problem.
The AIO code is based around a couple of structures, one of which is
struct iovec:
struct iovec {
void __user *iov_base;
__kernel_size_t iov_len;
};
This structure is used by user-space programs to describe a segment of an
I/O operation; it is part of the user-space API and cannot be changed.
Associated with this structure is the internal iov_iter structure:
struct iov_iter {
const struct iovec *iov;
unsigned long nr_segs;
size_t iov_offset;
size_t count;
};
This structure (defined in <linux/fs.h>) is used by the
kernel to track progress working through an
array of iovec structures.
Any kernel code needing to submit asynchronous I/O needs to express it in
terms of these structures. The problem, from the perspective of the loop
driver, is that struct iovec deals with user-space addresses. But
the BIO structures representing block I/O operations deal with physical
addresses in the form of struct page pointers. So there is an
impedance mismatch between the two subsystems that makes AIO unusable for
the loop driver.
Fixing that involves changing the way struct iov_iter works. The
iov pointer becomes a generic pointer called data that
can point to an array of iovec structures (as before) or, instead,
an array of kernel-supplied BIO structures. Direct access to structure
members by kernel code is discouraged in favor of a set of defined
accessor operations; the iov_iter structure itself gains a pointer
to an operations structure
that can be changed depending on whether iovec or bio
structures are in use. The
end result is an enhanced iov_iter structure and surrounding
support code that allows AIO operations to be expressed in either
user-space (struct iovec) or kernel-space (struct bio)
terms. Quite a bit of code using this structure must be adapted to use the
new accessor functions; at the higher levels, code that worked directly
with iovec structures is changed to work with the
iov_iter interface instead.
The next step is to make it possible to pass iov_iter structures
directly into filesystem code. That is done by adding two more functions
to the (already large) file_operations structure:
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);
These functions are meant to work much like the existing
aio_read() and aio_write() functions, except that they
work with iov_iter structures rather than with iovec
structures directly. A filesystem supporting the new operations must be
able to cope with I/O requests expressed directly in BIO structures —
usually just a matter of bypassing the page-locking and mapping operations
required for user-space addresses. If these new operations are provided,
the aio_*() functions will never be called and can be removed.
After that, the patch set adds a new interface to make it easy for kernel
code to submit asynchronous I/O operations. In short, it's a matter of
allocating an I/O control block with:
struct kiocb *aio_kernel_alloc(gfp_t gfp);
That block is filled in with the relevant information describing the
desired operation and a pointer to a completion callback, then handed off
to the AIO subsystem with:
int aio_kernel_submit(struct kiocb *iocb);
Once the operation is complete, the completion function is called to
inform the submitter of the final status.
A substantial portion of the patch set is dedicated to converting
filesystems to provide read_iter() and write_iter()
functions. In
most cases the patches are relatively small; most of the real work is done
in generic code, so it is mostly a matter of changing declared types and
making use of some of the new iov_iter accessor functions. See the ext4 patch for an example of what needs to
be done.
With all that infrastructural work done, actually speeding up the loop
driver becomes straightforward. If the backing store for a given loop
device implements the new operations, the loop driver will use
aio_kernel_submit() for each incoming I/O request. As a result,
requests can be run in parallel with, one hopes, a significant improvement
in performance.
The patch set has been through several rounds of review, and most of the
concerns raised would appear to have been addressed. Dave is now asking
that it be included in the linux-next tree, suggesting that he intends to
push it into the mainline during the 3.9 or 3.10 development cycle. Quite
a bit of kernel code will be changed in the process, but almost no
differences should be visible from user space — except that block loop
devices will run a lot faster than they used to.
Comments (7 posted)
By Jonathan Corbet
January 30, 2013
Contemporary compilers are capable of performing a wide variety of
optimizations on the code they produce. Quite a bit of effort goes into
these optimization passes, with different compiler projects competing to
produce the best results for common code patterns. But the nature of
current hardware is such that some optimizations can have surprising
results; that is doubly true when kernel code is involved, since kernel
code is often highly performance-sensitive and provides an upper bound on
the performance of the system as a whole. A recent discussion on the
best optimization approach for the kernel shows how complicated the
situation can be.
Compiler optimizations are often aimed at making frequently-executed code
(such as that found in inner loops)
run more quickly. As an artificially simple example, consider a loop like
the following:
for (i = 0; i < 4; i++)
do_something_with(i);
Much of the computational cost of a loop like this may well be found in the
loop structure itself — incrementing the counter, comparing against the
maximum, and jumping back to the beginning. A compiler that performs loop
unrolling might try to reduce that cost by transforming the code into something like:
do_something_with(0);
do_something_with(1);
do_something_with(2);
do_something_with(3);
The loop overhead is now absent, so one would expect this code to execute
more quickly. But there is a cost: the generated code may well be larger
than it was before the optimization was applied. In many situations, the
performance improvement may outweigh the cost, but that may not always be
the case.
GCC provides an optimization option (-Os) with a different
objective: it instructs the compiler to produce more compact code, even if
there is some resulting performance cost. Such an option has obvious value
if one is compiling for a space-constrained environment like a small
device. But it turns out that, in some situations, optimizing for space
can also produce faster code. In a sense, we are all running
space-constrained systems, in that the performance of our CPUs depends
heavily on how well those CPUs are using their cache space.
Space-optimized code can make better use of scarce instruction cache space,
and, as a result, perform better overall. With this in mind, compilation
with -Os was made
generally available for the 2.6.15 kernel in 2005 and made
non-experimental for 2.6.26 in 2008.
Unfortunately, -Os has not always lived up to its promise in the real-world.
The problem is not necessarily with the idea of
creating compact code; it has more to do with how GCC interprets the
-Os option. In the space-optimization mode, the compiler tends to
choose some painfully slow instructions, especially on older processors. It
also discards the branch prediction information provided by kernel
developers in the form of the likely() and unlikely()
macros. That, in turn, can cause rarely executed code to share cache space
with hot code, effectively wasting a portion of the cache and wiping out
the benefits that optimizing for space was meant to provide.
Because -Os did not produce the desired results, Linus disabled
it by default in 2011, effectively ending experimentation with this
option. Recently, though, Ling Ma posted some
results suggesting that the situation might have changed. Recent Intel
processors, it seems, have a new cache for decoded instructions, increasing
the benefit obtained by having code fit into the cache. The performance of
the repeated "move" instructions used by GCC for memory copies in
-Os mode has also been improved in newer processors. The posted
results claim a 4.8% performance improvement for the netperf benchmark and
2.7% for the volano benchmark when -Os is used on a newer CPU. Thus, it was
suggested, maybe it is time to reconsider -Os, at least for some
target processors.
Naturally, the situation not quite that simple. Valdis Kletnieks complained that the benchmark results may not
be showing an actual increase in real-world performance. Distributors hate
shipping multiple kernels, so an optimization mode that only works for some
portion of a processor family is unlikely to be enabled in distributor
kernels. And there is
still the problem of the loss of branch prediction information which, as
Linus verified, still happens when
-Os is used.
What is really needed, it seems, is a kernel-specific optimization mode
that is more focused on instruction-cache performance than code size in its
own right. This mode would take some behaviors from -Os while
retaining others from the default -O2 mode. Peter Anvin noted that the GCC developers are receptive to
the idea of implementing such a mode, but there is nobody who has the time
and inclination to work on that project at the moment. It would be nice to
have a developer who is familiar with both the kernel and the compiler and
who could work to make GCC produce better code for the kernel environment.
Until somebody steps up to do that work, though, we will likely have to
stick with -O2, even knowing that the resulting code is not as
good as it could be.
Comments (37 posted)
By Michael Kerrisk
January 30, 2013
We are accustomed to thinking of a system call as
being a direct service request to the kernel. However, in reality, most
system call invocations are mediated by wrapper functions in the GNU C
library (glibc). These wrapper functions eliminate work that the programmer
would otherwise need to do in order to employ a system call. But it turns
out that glibc does not provide wrapper functions for all system calls,
including a few that see somewhat frequent use. The question of what (if
anything) to do about this situation has arisen a few times in the last few
months on the libc-alpha mailing list, and has recently surfaced once more.
A system call allows a program to request a service—for example,
open a file or create a new process—from the kernel. At the assembler
level, making a system call requires the caller to
assign the unique system call number and the argument values to particular
registers, and then execute a special instruction (e.g., SYSENTER on modern
x86 architectures) that switches the processor to kernel mode to execute
the system-call handling code. Upon return, the kernel places the system
call's result status into a particular register and executes a special
instruction (e.g., SYSEXIT on x86) that returns the processor to user
mode. The usual convention for the result status is that a non-negative
value means success, while a negative value means failure. A negative
result status is the negated error number (errno) that indicates
the cause of the failure.
All of the details of making a system call are normally hidden from the
user by the C library, which provides a corresponding wrapper function and
header file definitions for most system calls. The wrapper function accepts
the system call arguments as function arguments on the stack, initializes
registers using those arguments, and executes the assembler instruction
that switches to kernel mode. When the kernel returns control to user mode,
the wrapper function examines the result status, assigns the (negated)
error number to errno in the case of a negative result, and
returns either -1 to indicate an error or the non-negative result status as
the return value of the wrapper function. In many cases, the wrapper
function is quite simple, performing only the steps just described. (In
those cases, the wrapper is actually autogenerated from
syscalls.list files in the glibc source that tabulate the types
of each system call's return value and arguments.) However, in a few cases
the wrapper function may do some extra work such as repackaging arguments
or maintaining some state information inside the C library.
The C library thus acts as a kind of gatekeeper on the API that the kernel
presents to user space. Until the C library provides a wrapper function,
along with suitable header files that define the calling signature and any
constant and structure definitions used by the system call, users must
do some manual work to make a system call.
That manual work includes defining the structures and constants needed
by the system call and then invoking the syscall() library
function, which handles the details of making the system call—copying
arguments to registers, switching to kernel mode, and then setting
errno once the kernel returns control to user space. Any system
call can be invoked in this manner, including those for which the C library
already provides a wrapper. Thus for example, one can bypass the wrapper
function for read() and invoke the system call directly by
writing:
nread = syscall(SYS_read, fd, buf, len);
The first argument to syscall() is the number of the system
call to be invoked; SYS_read is a constant whose
definition is provided by including <unistd.h>
The C library used by most Linux developers is of course the GNU C
library. Normally, glibc tracks kernel system call changes quite
closely, adding wrapper functions and suitable header file definitions to
the library as new system calls are added to the kernel. Thus, manually
coding system calls is normally only needed when trying to use the
latest system calls that have not yet appeared in the most recent iteration
of glibc's six-month release cycle or when using a recent kernel on a
system that has a significantly older version of glibc.
However, for some system calls, glibc support never appears. The
question of how the decision is made on whether to support a particular
system call in glibc has once again become a topic of discussion on the
libc-alpha mailing list. The most recent discussion started when Kees Cook,
the implementer of the recently added
finit_module() system call, submitted a rudimentary patch to add glibc
support for the system call. In response, Joseph Myers and Mike Frysinger
noted various pieces that were missing from the patch, with Joseph
adding that "in the
kexec_load discussion last May / June, doubts were expressed about whether
some existing module-related syscalls really should have had functions in
glibc."
The module-related system calls—init_module(),
delete_module(), and so on—are among those for which glibc
does not provide support. The situation is in fact slightly more complex
in the case of these system calls: glibc does not provide any header file
support for these system calls but does, through an accident of history,
export a wrapper function ABI for the calls.
The earlier discussion that Joseph referred to took place when
Maximilian Attems attempted to add a header file to glibc to provide
support for the kexec_load() system call, stating that his aim was "to axe the
syscall maze in kexec-tools itself and have this syscall supported in
glibc." One of the primary glibc maintainers, Roland McGrath, had a rather different take on the
necessity of such a change, stating "I'm not really convinced this
is worthwhile. Calling 'syscall' seems quite sufficient for such arcane
and rarely-used calls." In other words, adding support for these
system calls clutters the glibc ABI and requires (a small amount of) extra
code in order to satisfy the needs of a handful of users who could just use
the syscall() mechanism.
Andreas Jaeger, who had reviewed earlier versions of Maximilian's
patch, noted that
"linux/syscalls.list already [has] similar esoteric syscalls like
create_module without any header support. I wouldn't object to do this for
kexec_load as well". Roland agreed
that the kexec_load() system call is a similar case, but felt that
this point wasn't quite germane, since adding the module system calls to
the glibc ABI was a "dubious" historical step that can't be reversed for
compatibility reasons.
But in the recent discussion of finit_module(), Mike Frysinger
spoke in favor of adding full glibc support
for module-related system calls such as init_module(). Dave
Miller made a similar argument even more
succinctly:
It makes no sense for every tool that wants to support
doing things with kernel modules to do the syscall()
thing, propagating potential errors in argument signatures
into more than one location instead of getting it right in
one canonical place, libc.
In other words, employing syscall() can be error prone: there is
no checking of argument types nor even checking that sufficient arguments
have been passed.
Joseph Myers felt that the earlier
kexec_load() discussions hadn't fully settled the issue, and was
interested in having some concrete data on how many system calls don't have
glibc wrappers. Your editor subsequently donned his man-pages maintainer
hat and grepped the man pages in section 2 to determine which system calls
do not have full glibc support in the form of a wrapper function and header
files. The resulting list turns out to be
quite long, running to nearly nearly 40 Linux system calls. However, the
story is not quite so simple, since some of those system calls are obsolete
(e.g., tkill(), sysctl(), and query_module())
and others are intended for use only by the kernel or glibc (e.g.,
restart_syscall()). Yet others have wrappers in the C library,
although the wrappers have a significantly different names and provide some
piece of extra functionality on top of the system call (e.g.,
rt_sigqueueinfo() has a wrapper in the form of the sigqueue()
library function). Clearly, no wrapper is required for those system calls,
and once they are excluded there remain perhaps 15 to 20 system calls
that might be candidates to have glibc support added.
Motohiro Kosaki considered that the
remaining system calls could be separated into two categories: those with
only one or a few applications uses and those that seemed to him to have
more widespread application use. Motohiro was agnostic about whether
the former category (which includes the module-related system calls,
kcmp(), and kcmp_load()) required a wrapper. However, in
his opinion the system calls in the latter category (which includes system
calls such as ioprio_set(), ioprio_get(), and
gettid()) clearly merited having full glibc support.
The lack of glibc support for gettid(), which returns the
caller's kernel thread ID, is an especially noteworthy case. A
long-standing glibc bug report
requesting that glibc add support for this system gained little traction
with the previous glibc maintainer. However, excluding that system call is
rather anomalous, since it is quite frequently used and the kernel exposes
thread IDs via various /proc interfaces, and glibc exposes various
kernel APIs that can employ kernel thread IDs (for example,
sched_setaffinity(), fcntl(), and the
SIGEV_THREAD_ID notification mode for POSIX timers).
The discussion has petered out in the last few days, despite Mike
Frysinger's attempt to further push the debate along by reading and
summarizing the various pro and contra arguments in a single email. As noted by various
participants in the discussion, adding glibc wrappers for some currently
unsupported system calls would seem to have some worthwhile benefits. It
would also help to avoid the confusing situation where programmers
sometimes end up searching for a glibc wrapper function and header file
definitions that don't exist. It remains to be seen whether these arguments
will be sufficient to persuade Roland in the face of his concerns about
cluttering the glibc ABI and adding extra code to the library for the
benefit of what he believes is a relatively small number of users.
Comments (20 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>