The current 2.6 prepatch is 2.6.8-rc4
, which was announced
by Linus on August 9. There was,
he says, just a little too much new stuff in there for him to have been
comfortable putting it out directly as 2.6.8. That new stuff includes a
replaced 586-optimized AES implementation, a new internal infrastructure
for handling file positioning and seekability (see below),
a sysctl API change, and some architecture updates. See
the long-format changelog
for the details.
Linus's BitKeeper tree contains a big Prism54 driver update and various
fixes. Things are stabilizing for an official 2.6.8 release which may have
happened by the time you read this.
The current prepatch from Andrew Morton is 2.6.8-rc4-mm1. Recent additions to -mm include
a mechanism for gathering CPU scheduler statistics, the "mlock as user"
patch (covered briefly last week), some
asynchronous I/O fixes, version 17 of the wireless extensions API,
some read-copy-update enhancements, resident set size ulimit support (see
below), in-kernel cryptographic keyring management, a number of
architecture updates, and lots of fixes. The staircase scheduler has been
dropped from -mm for now ("it used up its time slice") in favor of a
simpler patch which simply disables the use of the expired array. The
quest for the best way to improve the scheduler continues.
The current 2.4 kernel is 2.4.27, released by Marcelo on August 7. 2.4.27
contains fixes for a handful of security
problems, some new crypto algorithms, a big serial
ATA update, TCP Vegas and BIC backports from 2.6, and vast numbers of
Comments (none posted)
Kernel development news
system call allows user space to move the current
read/write position within a file. It is not an operation which normally
attracts attention, since its full effect is, normally, to change an
internal integer index. It turns out, however, that lseek()
been poorly implemented in many parts of the kernel. The recent vulnerability
discovered by Paul
Starzetz has highlighted the problem, with the result that the internal
handling of lseek()
is changing significantly for 2.6.8.
Seeking within a file is straightforward; it is just a matter of changing
the current position index inside the kernel. The situation gets a little
murkier, however, when dealing with things that are not regular files.
Virtual files implemented by the kernel can often be seeked in a meaningful
way, if it's done carefully; the same is true of a very small number of
physical devices. For most devices, however, along with objects like
network connections, seeking makes no sense at all.
The default behavior for lseek() is to change the internal offset
pointer and return success; if code for the the underlying object (device,
network connection, file, etc.) has not provided its own llseek()
method, the call appears to succeed. Implementation of a non-seekable
device requires an explicit action, instead, to ensure that user space is
given the proper error.
The traditional way of handling lseek() within a device driver is
to include a simple llseek() method which looks like this:
loff_t my_llseek(struct file *file, loff_t offset, int whence)
return -ESPIPE; /* Not seekable */
More recent kernels (2.4 and beyond) also provide a no_llseek()
helper which looks like the above.
This technique works, as long as the author bothers to do things this way.
In some cases, this little step gets skipped, and the resulting object
appears seekable even though it is not. Even when this method is provided,
however, it is not a
complete solution; the pread() and pwrite() system calls,
which specify a specific offset for the operation, involve seeks. Objects
within the kernel do not see these calls directly; they just look like
regular read() and write() calls. This works because the
internal methods for these calls are always passed the offset to use.
What this means is that, for a non-seekable object, every read()
or write() method should include a test like this:
ssize_t my_read(struct file *filp, char *buf, size_t count,
/* ... */
if (ppos != &filp->f_pos)
/* ... */
This test works because, for normal read() and write()
calls, the ppos pointer goes directly to the offset
(f_pos) stored in the file structure. If ppos
points elsewhere, it means that a pread() or pwrite()
call has been made, and an error should be returned. These tests are
simple, but they are bits of boilerplate code which must be added to the
implementation of all non-seekable objects, and not all authors bother.
After all, for most uses, the code works just fine without.
The above code also forces widespread knowledge of the contents of the
file structure and how position information is passed to
read() and write() methods. For sysctl methods,
things are even worse: there is no position passed in, so there is no
alternative to getting it from the file structure.
Finally, there are some interesting race conditions associated with the
handling of file offsets. Often a device driver will test a position for
validity, sleep (while waiting for device operations or user-space copies),
then change the offset. But that offset could have changed in other ways
during the sleep, leaving its final value in an indeterminate state.
In response to all this, Linus has thrown together a set of patches
changing the way seeks are handled inside the kernel. These patches have
found their way into 2.6.8-rc4, but they were not posted
separately on any open mailing lists first. The
first patch adds a new FMODE_LSEEK bit to the file
structure, so that the virtual filesystem (VFS) code knows which files are
seekable and which are not. The idea is to move all tests for illegal
seeks to the core VFS
code. A second patch adds separate mode
bits for pread() and pwrite(); as it turns out, files
implemented with the seq_file interface are
seekable, but do not support those two calls.
A pair of patches then followed to make use of the new tests in the VFS
core. The nonseekable_open()
helper was added to enable drivers (and other code) to clear the new bits
and mark an object as not being seekable. It is meant to be called in the
corresponding open() method. Then came changes to a large number of drivers making
them use the new infrastructure; the net result was the removal of quite a
bit of code.
It's worth noting that this patch represents a change in how device drivers
should be written, but the actual API has not been changed in any
incompatible ways. Unmodified drivers will still work - at least, as well
as they did before.
The sysctl change does involve an API
change, however. All sysctl methods now have the offset passed in
explicitly as a parameter; they should no longer go digging through the
file structure for that information. Unmodified sysctl
implementations will no longer compile.
The final step is to change how the
read() and write() system calls are implemented. They
now create a copy of the f_pos field and pass that to the
appropriate methods, and copy the result back afterward. So those methods
never work with f_pos directly, regardless of how they are
invoked. As a result of all this work, the handling of seeking has become
simpler and more robust.
Comments (2 posted)
One of the problems which can afflict any virtual memory system is a
process which expands to fill all of memory. All it takes is, say, a quick
OpenOffice session, and everything else running on the system finds itself
shoved into a corner of memory and pushed out onto swap. Avoiding this
problem is a simple matter of limiting the amount of physical memory that
any given process can occupy, but Linux lacks such limits.
Rik van Riel seems to have started off on a series of relatively simple
patches which address immediate VM issues. His latest patch implements resident set size limits for
Linux processes. Once this patch is applied, a bit of appropriate limit
setting could do a lot to keep those memory hog processes in their place.
The core of the patch comes down to two lines:
if (mm->rss > mm->rlimit_rss)
referenced = 0;
This code appears in the function page_referenced_one(), which
tries to decide whether a process has actually made use of one of its
in-core pages. If the page has not been referenced, it goes directly onto
the list of pages to reclaim. All that this particular patch is doing is
pretending that a process which has exceeded its maximum resident set size
has not actually used any of its pages; as a result, the memory hog's pages
will be the first ones to be reclaimed.
This patch applies on top of the token-based mechanism discussed last week. It modifies that code by depriving
a process of the swap token once it goes over its memory limit.
Many systems in the past have chosen to implement hard resident set size
limits. On such systems, a process which incurs a page fault will, if it's
at its memory limit, immediately surrender one other page back to the
memory management system. Rik's patch works differently, in that there are
no hard limits. If there is no particular memory pressure, a process can
grow to any size. The limit is only applied when the system starts looking
for pages to reclaim for other users. This approach is simple, which is
always good; it also allows the system to make full use of its memory when
there is not a lot of contention.
Comments (1 posted)
Spinlocks, as the core kernel synchronization primitive, are highly
performance critical. They are implemented differently on each
architecture, by way of some carefully-crafted assembly code, so that not
one extra cycle is spent there, especially when the lock is not contended.
They are also implemented as inline assembly, so that no function calls get
in the way of that fast path through.
Recently, however, Zwane Mwaikambo has pulled a
patch out of the -tiny tree which moves spinlocks into normal,
out-of-line functions - at least, on the x86 and x86-64 architectures. The
reason for doing this is to shrink the kernel; there are a lot of
spinlock calls in the kernel, and the inline code gets replicated for every
one of them. Moving the spinlock code out of line gets rid of that
duplication, and shrinks the kernel text size by 50KB or so.
Zwane posted some benchmarks showing that there are no performance
regressions. In fact, on some hardware, the improved cache utilization
brought about by pulling together the spinlock code can actually improve
performance by a slight amount.
The patch comes with a configuration option allowing the spinlock code to
be built in either mode. Given that moving the code out of line seems to
be a win, some have wondered if things shouldn't always be done that way.
Linus pointed out one advantage to the
inline code: it makes the sources of lock contention very clear in kernel
profiles. With out-of-line spinlocks, all a profile will show is that a
lot of time was spent waiting for locks; with the code inline, the function
which is actually waiting for the lock shows up instead. So out-of-line
locks may be best for production kernels, but developers may want to keep
Comments (2 posted)
The Minneapolis Cluster Summit, held on July 29 and 30, was a
gathering of developers interested in
pushing forward the state of the art in Linux clustering. The slides
from the presentations
have now been posted. The topics covered
include high availability, OpenSSI, cluster block devices, GFS, lock
management, and more.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>