Brief items
The current 2.6 release is 2.6.4-rc1, which was
announced by Linus on February 27.
This large patch contains support for Intel's "ia32e"
architecture, a new
syscalls.h include file with
prototypes for the various
sys_* functions, various network driver
fixes, a UTF-8 tty mode, dynamic PTY allocation (allowing up to a million
PTY devices), sysfs support for SCSI tapes and bluetooth devices, the
"large number of groups" patch (covered in
the
October 2 Kernel Page), the generic kernel thread code (
January 7 Kernel Page), an HFS filesystem
rewrite, and a massive
number of other fixes. See
the long-format
changelog for the details.
Linus's BitKeeper tree contains a number of parallel port fixes, various
architecture updates, the reversion of a patch which had removed threads
from /proc (and broke gdb), an XFS update, a FireWire update
(including one which notes that IEEE1394 support is no longer
experimental), and numerous fixes.
The current kernel tree from Andrew Morton is 2.6.4-rc1-mm2. Recent additions to the -mm
tree include more scheduler tweaks, some big NFS updates, the POSIX message
queues patch, a 4K stack option for the x86 architecture, some VM
optimizations, the removal of some old network device API functions (see
below), and numerous other fixes and updates.
The current 2.4 kernel is 2.4.25. Marcelo has released no 2.4.26
prepatches since 2.4.26-pre1 on
February 25.
Comments (none posted)
Kernel development news
The asynchronous I/O infrastructure was added in 2.5 as a way to allow
processes to initiate I/O operations without having to wait for their
completion. The underlying mechanism is documented in
this Driver Porting Series
article. The actual implementation of asynchronous I/O in the kernel
has been somewhat spotty, however. It works for some devices (which have
specifically implemented that support) and for direct file I/O. Other
sorts of potentially interesting uses, such as with regular buffered file
I/O, have remained unimplemented.
Part of the problem is that buffered file I/O integrates deeply with the
page cache and virtual memory subsystem. It is not all that easy to graft
asynchronous I/O operations into those complex bodies of code. So the
kernel developers have, for the most part, simply punted on cases like
that.
Suparna Bhattacharya, however, has not given up so easily. For over a
year, now, she has been working on a set of patches which bring the
asynchronous mode to the buffered I/O realm. A new set of patches has
recently been posted which trims down the buffered AIO changes to the bare
minimum. So this seems like a good time to take a look at what is involved
in making asynchronous buffered I/O work.
The architecture implemented by these patches is based on retries. When an
asynchronous file operation is requested, the code gets things started and
goes as far as it can until something would block; at that point it makes a
note and returns to the caller. Later, when the roadblock has been taken
care of, the operation is retried until the next blocking point is hit.
Eventually, all the work gets done and user space can be notified that the
requested operation is complete. The initial work is done in the context
of the process which first requested the operation; the retries are handled
out of a workqueue.
For things to work in this mode, kernel code in the buffered I/O path must
be taught not to block when it is working on an asynchronous request. The
first step in this direction is the concept of an asynchronous wait queue
entry. Wait queue entries are generally used, surprisingly, for waiting;
they include a pointer to the process which is to be awakened when the wait
is complete. With the AIO retry patch, a wait queue entry which has a
NULL process pointer is taken to mean that actually waiting is not
desired. When this type of wait queue entry is encountered, functions like
prepare_to_wait() will not put the process into a sleeping state
(though it does add the wait queue entry to the associated wait queue),
and some functions will return the new error code -EIOCBRETRY
rather than actually sleeping.
The next step is to add a new io_wait entry to the task
structure. When AIO retries are being performed, that entry is pointed to
an asynchronous wait queue entry associated with the specific AIO request.
This task structure field is, for all practical purposes, being used in a
hackish manner to pass the wait queue entry into functions deep inside the
virtual memory subsystem. It might have been clearer to pass it explicitly
as a parameter, but that would require changing large numbers of internal
interfaces to support a rarely-used functionality. The io_wait
solution is arguably less clean, but it also makes for a far less invasive patch.
It does mean, however, that work can only proceed on a single AIO request
at a time.
Finally, a few low-level functions have been patched to note the existence
of a special wait queue entry in the io_wait field and to use it
instead of the local entry that would normally have been used. In
particular, page cache functions like wait_on_page_locked() and
wait_on_page_writeback() have been modified in this way. These
functions are normally used to wait until file I/O has been completed on a
page; they are the point where buffered I/O often blocks. When AIO is
being performed, instead, they will return the -EIOCBRETRY error
code immediately.
The AIO code also takes advantage of the fact that wait queue entries, in
2.6, contain a pointer to the function to be called to wake up the waiting
process. With an asynchronous request, there may be no
such process; instead, the kernel needs to attempt the next retry. So the
AIO code sets up its own wakeup function which does not actually wake any
processes, but which does restart the relevant I/O request.
Once that structure is in place, all that's left is a bit of housekeeping
code to keep track of the status of the request between retries. This work
is done entirely within the AIO layer; as each piece of the request is
satisfied, the request itself as seen by the filesystem layer is modified
to take that into account. When the operation is retried to transfer the
next chunk of data, it looks like a new request with the already-done
portion removed.
Add in a few other hacks (telling the readahead code about the entire AIO
request, for example, and an AIO implementation for pipes) and the patch
set is complete. It does not attempt to fix every spot which might block
(that would be a large task), but it should take care of the most important
ones.
Comments (7 posted)
The last few 2.6 kernel releases have seen a lot of patches removing calls
to a set of network driver support functions, including
init_etherdev(),
init_netdev(), and
dev_alloc().
With the integration of networking and sysfs, static
net_device
structures have become impossible to use in a safe way; these structures
must now be allocated dynamicly and properly reference counted. See
this Driver Porting Series
article for details on the currently supported interface.
As of 2.6.3, there are no users of those functions in the mainline kernel
tree. There are, however, certain to be out-of-tree drivers which still
use them. Those drivers will need to be fixed soon; the 2.6.3-mm4 kernel
tree added a patch which removes those functions forevermore. Once that
patch works its way into the mainline kernel, any driver relying upon
init_etherdev() and friends will cease to work until it is fixed.
Don't say you haven't been warned.
Comments (none posted)
Steve Longerbeam (of MontaVista) has sent out
an announcement for a new
filesystem called "pramfs." He would like to see pramfs merged into the
mainline kernel in the near future; let it not be said that embedded Linux
companies do not contribute to the kernel.
Pramfs (the "protected and persistent RAM special filesystem") is a
specialized filesystem; it is intended for use in embedded systems which
provide a bank of non-volatile memory for user data storage. Think, for
example, of a phone book housed within a mobile telephone. Such memory
tends to be fast, but it is not normally part of the system's regular core
memory. It also tends to be important; cell phone users will not tolerate
a phone which scrambles their phone numbers.
To meet the special needs presented by non-volatile RAM filesystems, pramfs
does a number of things differently than normal filesystems. Since there
is no need to worry about the (nonexistent) performance impacts of block
positioning, pramfs doesn't. Since pramfs filesystems are expected to live
in fast memory, there is generally no performance benefit to caching pages
in main memory. So pramfs, interestingly, forces all file I/O to be
direct; essentially, it forces the O_DIRECT flag on all file
opens. In that way, pramfs gets the benefits of shorting out the page
cache without having to change applications to use O_DIRECT
explicitly.
Pramfs also goes out of its way to avoid corruption of the filesystem. If
the underlying non-volatile RAM is represented in the system's page tables,
it is marked read-only to keep a stray write from trashing things. When an
explicit write to the filesystem is performed, the page permissions are
changed only for the time required to perform the I/O. Pramfs disallows
writes from the page cache; one practical result of that prohibition is
that shared mappings of pramfs-hosted files are not possible.
See the pramfs web site for
more information.
Comments (none posted)
Those who have been watching kernel development for a little while will
remember
the fun that
came with the 2.4.10 release, when Linus replaced the virtual memory
subsystem with a new implementation by Andrea Arcangeli. The 2.4 kernel
did end up with a stable VM some releases thereafter, but many developers
were upset that such a major change would be merged that far into a stable
series. Especially since many of those developers were not convinced that
the previous VM was not fixable.
The 2.4 changes are long past, but the memories are fresh enough that when
Andrea put forward a set of VM changes
which, while they are for 2.4, are said to be applicable to 2.6 as well,
people took notice. Andrea's goals this time are little more focused; he
is concerned with the performance of systems with at least 32GB of
installed memory and hundreds of processes with shared mappings of large
files. This, of course, is the sort of description that might fit a
high-end database server.
Andrea has found three problems which make those massive servers fail to
function well. The first has to do with how 2.4 performs swapout; it works
by scanning each process's virtual address space, and unmapping pages that
it would like to make free. When a page's mapping count reaches zero, it
gets kicked out of main memory. The problem is that this algorithm
performs poorly in situations where many processes have the same, large
file mapped. The VM will start by unmapping the entire file for the first
process, then another, and so on. Only when it has passed through all of
the processes mapping the file can it actually move pages out of main
memory. Meanwhile, all of those processes are incurring minor page faults
and remapping the pages. With enough memory and processes, the VM
subsystem is almost never able to actually free anything.
This is the problem that the reverse-mapping VM (rmap) was added to 2.5 to
solve. By working directly with physical pages and following pointers to
the page tables which map them, the VM subsystem can quickly free pages for
other use. Andrea is critical of rmap, however; with his scenario of 32GB
of memory and hundreds of processes, the rmap infrastructure grows to a
point where the system collapses. Instead, for his patches, he has
implemented a variant of the object-based
reverse mapping scheme. Object-based reverse mapping works by
following the links from the object (a shared file, say) which backs up the
shared memory; in this way it is able to dispense with the rmap structures
in many situations. There are some concerns about pathological performance
issues with the object-based approach, but those problems do not seem to
arise in real-world use.
The second problem is a simple bug in the swapout code. When shared memory
is unmapped and set up for swap, the actual I/O to write it out to the swap
file is not started right away. By the time the system gets around to
actually performing I/O, there is a huge pile of pages waiting to be shoved
out, and an I/O storm results. Even then, the way the kernel tracks this
memory means that it takes a long time to notice that it is free even after
it has been written to swap. This problem is fixed by taking frequent
breaks to actually shove dirty memory out to disk.
Andrea's final problem came about when he tried to copy a large file while
all those database processes were running. It turns out that the system
was swapping out the shared database memory (which was dirty and in use)
rather than the data from the file just copied (which is clean). Tweaking
the memory freeing code to make it prefer clean cache pages over dirty
pages straightened this problem out, at the cost of a certain amount of
unfairness.
With these patches, Andrea claims, the 2.4 kernel can run heavy loads on
large systems which will immediately lock up a 2.6 system. So he is going
to start looking toward 2.6, with an eye toward beefing it up for this sort
of load. Andrew Morton has indicated that
he might accept some of this work - but not yet:
We need to understand that right now, 2.6.x is 2.7-pre. Once 2.7
forks off we are more at liberty to merge nasty highmem hacks which
will die when 2.6 is end-of-lined.
I plan to merge the 4g split immediately after 2.7 forks. I
wouldn't be averse to objrmap for file-backed mappings either - I
agree that the search problems which were demonstrated are unlikely
to bite in real life.
The "4g split" is Ingo Molnar's 4GB user-space
patch which makes more low memory available to the kernel, but at a
performance cost. Before Andrew merges any other patches, however, he
wants to see a convincing demonstration of why the current VM patches are
not enough for large loads. The 2.6 "stable" kernel may well see some
significant virtual memory work, but, with luck, it will not be subjected
to a 2.4.10-like abrupt switch.
Comments (8 posted)
Patches and updates
Kernel trees
- Andrew Morton: 2.6.3-mm4.
(February 26, 2004)
Build system
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>