Brief items
The current 2.6 prepatch is 2.6.17-rc1,
released by Linus on
April 2. Patches merged since last week include the
splice() and
sync_file_range() system calls (see below), hotplug memory support
for User-mode Linux, an LED subsystem, the conversion of
local_t
into a signed type, basic support for braille devices in the input layer, and
the "ipath" driver for PathScale InfiniPath devices.
See
last week's and
the previous week's summaries
for detailed lists of changes in 2.6.17-rc1.
For even more detail, see the
short-form and
long-format
changelogs.
No patches have been merged into the mainline since 2.6.17-rc1 was
released.
The current -mm tree is 2.6.17-rc1-mm1. Recent changes
to -mm include a lot of fixes and a new version of the kgdb debugger, but
little in the way of major changes.
Comments (2 posted)
Kernel development news
Which part of "sysfs patches can be written by idiots and usually
are" is too hard to understand? Oh, wait. I see... Well,
nevermind, then...
-- Al Viro is back.
Comments (7 posted)
Your editor, who has enough trouble putting together a Kernel Page in
English, has never seriously thought about supporting other languages. It
is, however, pleasant to see that this page can now be found
translated into
Czech, thanks to Robert Kratky. Hopefully this work will be useful to
Czech readers.
Comments (5 posted)
The 2.6.17 kernel will include two new system calls which expand the
capabilities available to user-space programs in some interesting ways.
This article contains a
look at the current form of these new interfaces.
splice()
The splice() system call has a long history. First
proposed by Larry McVoy in 1998; it was seen as a way of improving I/O
performance on server systems. Despite being often mentioned in the
following years, no splice() implementation was ever created for
the mainline Linux kernel. That situation changed, however, just before
the 2.6.17 merge window was closed when Jens Axboe's splice()
patch, along with a number of modifications, was merged.
As of this writing, the splice() interface looks like this:
long splice(int fdin, int fdout, size_t len, unsigned int flags);
A call to splice() will cause the kernel to move up to
len bytes from the data source fdin to fdout.
The data will move through kernel space only, with a minimum of copying. In
the current implementation, at least one of the two file descriptors must
refer to a pipe device. That requirement is a limitation of the current
code, and it could be removed at some future time.
The flags argument modifies how the copy is done. Currently
implemented flags are:
- SPLICE_F_NONBLOCK: makes the splice() operations
non-blocking. A call to splice() could still block, however,
especially if either of the file descriptors has not been set for
non-blocking I/O.
- SPLICE_F_MORE: a hint to the kernel that more data will come
in a subsequent splice() call.
- SPLICE_F_MOVE: if the output is a file, this flag will cause
the kernel to attempt to move pages directly from the input pipe
buffer into the output address space, avoiding a copy operation.
Internally, splice() works using the pipe buffer mechanism added by
Linus in early 2005 - that is why one side of the operation is required to
be a pipe for now. There are two additions to the ever-growing
file_operations structure for devices and filesystems which wish
to support splice():
ssize_t (*splice_write)(struct inode *pipe, struct file *out,
size_t len, unsigned int flags);
ssize_t (*splice_read)(struct file *in, struct inode *pipe,
size_t len, unsigned int flags);
The new operations should move len bytes between pipe and
either in or out, respecting the given flags.
For filesystems, there are generic implementations of these operations
which can be used; there is also a generic_splice_sendpage() which
is used to enable splicing to a socket. As of this writing, there are no
splice() implementations for device drivers, but there is nothing
preventing such implementations in the future, for char devices at least.
Discussions on the linux-kernel suggest that the splice()
interface could change before it is set in stone with the 2.6.17 release.
Andrew Tridgell has requested that an
offset argument be added to specify where copying should begin - either
that, or a separate psplice() should be added. There is also some
concern about error handling; if a splice() call returns an error,
how does the application tell whether the problem is with the input or the
output? Resolving those issues may require some interface changes over the
next month or so.
sync_file_range()
Early in the 2.6.17 process, some changes to the
posix_fadvise() system call were merged. The new,
Linux-specific options were meant to give applications better control over
how data written to files is flushed to the physical media. The
capabilities provided are needed, but there were concerns about extending a
POSIX-defined function in a Linux-specific way. So, after some
discussions, Andrew Morton pulled that patch back out and replaced it with
a new system call:
long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);
This call will synchronize a file's data to disk, starting at the given
offset and proceeding for nbytes bytes (or to the end of
the file if nbytes is zero). How the synchronization is done is
controlled by flags:
- SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until
any already in-progress writeout of pages (in the given range)
completes.
- SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in
the given range which are not already under I/O.
- SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until
the newly-initiated writes complete.
An application which wants to initiate writeback of all dirty pages should
provide the first two flags. Providing all three flags guarantees that
those pages are actually on disk when the call returns.
The new implementation avoids distorting the posix_fadvise()
system call. It also allows synchronization operations to be performed
with a single call, instead of the multiple calls required by the previous
attempt. In
the future, it may also be possible to add other operations to the
flags list; the ability to request metadata synchronization seems
to be high on the list.
(Thanks to Michael Kerrisk - who agitated for this change - for providing
some of the background information).
Comments (14 posted)
Imagine a system with two processes running, one at high priority and the
other at a much lower priority. These processes share resources which are
protected by locks. At some point, the low-priority process manages to run
and obtains a lock for one of those resources. If the high-priority
process then attempts to obtain the same lock, it will have to wait.
Essentially, the low-priority process has trumped the high-priority
process, at least for as long as it holds the contended lock.
Now imagine a third process, one which uses a lot of processor time, and
which has a priority between the other two. If that process starts to
crank, it will push the low-priority process out of the CPU indefinitely.
As a result, the third process can keep the highest-priority process out of
the CPU indefinitely.
This situation, called "priority inversion," tends to be followed by system
failure, upset users, and unemployed engineers. There are a number of
approaches to avoiding priority inversion, including lockless designs,
carefully thought-out locking scenarios, and a technique known as priority
inheritance. The priority inheritance method is simple in concept: when a
process holds a lock, it should run at (at least) the priority of the
highest-priority process waiting for the lock. When a lock is taken by a
low-priority process, the priority of that process might need to be boosted
until the lock is released.
There are a number of approaches to priority inheritance. In effect, the
kernel performs a very simple form of it by not allowing kernel code to be
preempted while holding a spinlock. In some systems, each lock has a
priority associated with it; whenever a process takes a lock, its priority
is raised to the lock's priority. In others, a high-priority process will
have its priority "inherited" by another process which holds a needed
lock. Most priority inheritance schemes have shown a tendency to
complicate and slow down the locking code, and they can be used to paper
over poor application designs. So they are unpopular in many circles.
Linus was reasonably clear about how he
felt on the subject last December:
"Friends don't let friends use priority inheritance".
Just don't do it. If you really need it, your system is broken
anyway.
Faced with this sort of opposition, many developers would quietly shelve
their priority inheritance designs and go back to working on accounting
code.
The kernel development community, however, happens to have a member who has a
track record of getting code merged in spite of this sort of objection:
Ingo Molnar. History may well repeat itself, as Ingo (working with Thomas
Gleixner) has posted a priority-inheriting futex
implementation with a request that it be merged into the mainline.
This approach, says Ingo, provides a useful functionality to user space (it
is not meant to provide priority-inheriting kernel mutual exclusion
primitives) while avoiding the pitfalls which have hit other
implementations.
The PI-futex patch adds a couple of new operations to the futex()
system call: FUTEX_LOCK_PI and FUTEX_UNLOCK_PI. In the
uncontended case, a PI-futex can be taken without involving the kernel at
all, just like an ordinary futex. When there is contention, instead, the
FUTEX_LOCK_PI operation is requested from the kernel. The
requesting process is put into a special queue, and, if necessary, that
process lends its priority to the process actually holding the contended
futex. The priority inheritance is chained, so that, if the holding process
is blocked on a second futex, the boosted priority will propagate to the
holder of that second futex. As soon as a futex is released, any
associated priority boost is removed.
As with regular futexes, the kernel only needs to know about a PI-futex
while it is being contended. So the number of futexes in the system can
become quite large without serious overhead on the kernel side.
Within the kernel, the PI-futex type is implemented by way of a new
primitive called an rt_mutex. The rt_mutex is
superficially similar to regular mutexes, with the addition of the priority
inheritance capability. They are, however, an entirely different type,
with no code shared with the mutex implementation. The API will be
familiar to mutex users, however; in brief, it is:
#include <linux/rtmutex.h>
void rt_mutex_init(struct rt_mutex *lock);
void rt_mutex_destroy(struct rt_mutex *lock);
void rt_mutex_lock(struct rt_mutex *lock);
int rt_mutex_lock_interruptible(struct rt_mutex *lock,
int detect_deadlock);
int rt_mutex_timed_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *timeout,
int detect_deadlock);
int rt_mutex_trylock(struct rt_mutex *lock);
void rt_mutex_unlock(struct rt_mutex *lock);
int rt_mutex_is_locked(struct rt_mutex *lock);
The alert reader may have noticed that this looks much like the realtime
mutex type found in the realtime preemption patch. Ingo once said that the
realtime patches would slowly trickle into the mainline, and that is what
appears to be happening here. With this patch set, the PI-futex code is
the only user of the new rt_mutex type, but that could certainly
change over time.
The PI-futex patch also includes a new, priority-sorted list type which
could find users elsewhere in the kernel.
There has been relatively little discussion of this patch so far; it has
been included in recent -mm trees. It is too late for 2.6.17, but, if no
real opposition develops, the PI-futex code might just find its way into a
subsequent kernel.
Comments (6 posted)
One of the
patches in the upcoming 2.6.16.2
stable kernel release is a fix for a security vulnerability designated as
CVE-2006-1055. It makes a small change to the code which implements the
ability to write to sysfs attributes; with this change, the maximum amount
of data which can be written to an attribute is
PAGE_SIZE-1 bytes,
or 4095 on most systems. Since last June, the limit had simply been
PAGE_SIZE, allowing a full page to be written.
Since the page is zeroed before being filled, this change ensures that the
data coming from user space will be null-terminated when it is passed to
the specific sysfs store() function. Without that assurance, that
function might have proceeded merrily off the end of the one-page buffer,
accessing data which did not come from user space and possibly overwriting
buffers elsewhere. The possibility of this happening was enough to raise
security fears and motivate a quick fix.
The interesting thing is that the prototype for the store()
function is:
ssize_t (*store)(struct kobject *kobj, struct attribute *attr,
const char *buffer, size_t size);
The size parameter is the amount of user data being passed in.
So, one might ask, why bother null-terminating the buffer, when its size
has already been made available to the receiving code? Certain developers,
whose code was receiving 4096-byte data via sysfs attributes, have, indeed,
asked that question.
The question was answered, in one way, in the message featured in the quote of the week. More
diplomatically, one might say that, regardless of how the interface was
designed, a number of sysfs attribute implementations have be coded on the
assumption that the incoming data will be null-terminated. So they do not
bother to check the length of that data, and they will do bad things in the
absence of the expected terminator.
With the 2.6.16.2 patch, the situation will be fixed and those
implementations made safe again. But it is hard not to be a little nervous
about the situation. If there is carelessly-written code in the tree,
there may be other issues with it as well, and the return of
null-termination may not help much. It would be nicer if there were a way
to verify that the interfaces were being used correctly. In the mean time,
people writing sysfs interfaces - each of which is an interface to user
space and a possible target of attack - may want to look a little more
carefully at their code before submitting it.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
- Chris Leech: I/OAT.
(March 30, 2006)
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>