Brief items
The current development kernel is 2.5.64, which was
released by Linus on March 4. Changes in
2.5.64 include a
dentry cache performance improvement, code for finding "jiffy wraps," an
ACPI update, some new CPU frequency control code, USB updates, several
kbuild fixes, a number of sysfs tweaks, some module fixes, and, of course,
a great many spelling fixes. The
long-format
changelog has the details, as usual.
As of this writing, Linus's BitKeeper tree includes only some timekeeping
fixes and some SBUS frame buffer patches.
The current stable kernel is 2.4.20. Marcelo released the fifth 2.4.21 prepatch on
February 27; it includes some architecture updates, some IDE fixes,
fixes for the ethernet information leakage vulnerability, a JFS update,
and, of course, lots of other repairs.
Alan Cox released 2.2.24 on March 5.
There is, of course, no active development happening with 2.2, so this
release consists of fixes only - in particular, it includes fixes for the
ethernet information leakage vulnerability.
Comments (none posted)
Kernel development news
Ingo Molnar's new
remap_file_pages() system call was first merged
into the 2.5.46 kernel. The final parts of that are just now circulating
in patch form, however. So it seems like a good time to look at what this
system call does.
Many kinds of applications use mmap() to map a file into virtual
memory. mmap() makes a simple, linear mapping between a region of
virtual memory and a corresponding part of the file on disk. Some
applications, however, have more complicated needs; they typically want to
map several pieces of a file into different parts of memory. This sort of
nonlinear mapping is used, for example, by large database management
systems as a way of managing the movement of data to and from the disk.
Nonlinear mappings can be created on any system which supports
mmap(); it's just a matter of creating a separate mapping for each
piece of the file. Such mappings can be expensive to set up, however, and
even more expensive to use. In the Linux kernel, each mapping creates a
separate virtual memory area (VMA). Each VMA uses kernel memory; the
presence of large numbers of VMAs will also slow down the VM subsystem.
The remap_file_pages() system call addresses these problems by
allowing a process to rearrange the memory mapping of a file on the fly.
It is called as:
int remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
Essentially, this call says that size pages from the file,
starting at page offset pgoff, should be mapped into the process's
virtual memory beginning at start. The file should already be
mapped into a VMA which contains start. Since the system call
works entirely through page table manipulation, it is quite fast. It also
can create complicated nonlinear mappings without needing to create new
VMAs.
remap_file_pages(), as found in the 2.5.64 kernel, only has one
little problem: the remapping information is lost if the page is swapped
out. Users must thus either lock the area in memory (which is generally
not a problem for the "big database management system" scenario, which
tends to perform this locking anyway), or take pains to reestablish the
mapping on swapin. Ingo's latest patch
clears up that last bit of trouble by storing the mapping information into
the page table entry when a page is swapped out. On 32-bit systems, this
technique limits the maximum size of a nonlinear mapping to 1-2TB
(depending on the architecture) because some of the PTE bits are not
available for this use. Given the trouble most 32-bit systems have in
simply addressing that much memory, this limitation is not likely to bother
too many people.
For now, it is not possible to change protections within a single VMA (the
prot parameter to remap_file_pages() is ignored). At
some future point, that could change. Some applications (i.e. memory
debuggers) currently struggle to control memory protection in a
fine-grained manner. Being able to simply set protections on a per-page
basis (without creating new VMAs) would make things much easier.
Comments (none posted)
Development kernels typically go through a stage where half of the patches
seem to be spelling fixes. Correcting misspellings is an easy way for
people to help improve the code base without having to understand locking
rules - or even the C language. For the most part these changes are, at
worst, harmless.
2.5 seems to have inspired a more thorough than usual cleanup effort,
however. People have been fixing punctuation problems, and there is even
a special kernel source spellchecker out
there. All this work has caused some developers to wonder if things aren't
going a little too far, especially the changes start breaking things. As Alan Cox put it:
People are going to far. Fixing typos that are confusing or
blatantly daft is one thing, but if you want to pick over
documentation line by line with a copy of Fowlers in hand the Gnome
and KDE projects would both love to have you working over their
documentation and end user manuals ;)
This is a good point: very few documentation projects complain about having
too many contributors. Improving documentation may not bring the
satisfaction of seeing your name in the kernel changelog, but it could well
be a better use of available time than correcting apostrophe errors in
kernel comments.
Comments (13 posted)
Speaking of improving documentation, Mel Gorman has been working for some
time to document how memory management works in the 2.4.20 kernel. He has
now
released the results of his work in
text, HTML, and PDF formats. There is also an extensive commentary of the
VM code itself. It is a large body of work, and a substantial contribution
to the development community; worth a read.
Comments (none posted)
Driver porting
Below, you will find two new articles on porting drivers (and other kernel
code) to the 2.5 kernel; they discuss interrupt handling and asynchronous
I/O. Also new this week (but not included below) is an article describing
the completion event interface; that article, along with all the others in
this series, may be found on the
LWN
Driver Porting Series page.
Comments (none posted)
The kernel's handling of device interrupts has been massively reworked in
the 2.6 series. Fortunately, very few of those changes are visible to the
rest of the kernel; most well-written code should "just work" (almost) under 2.6.
There are, however, two important exceptions: the return type of interrupt
handlers has changed, and drivers which depend on
being able to globally disable interrupts will require some changes for
2.6.
Interrupt handler return values
Prior to 2.5.69, interrupt handlers returned
void. There is,
however, one useful thing that interrupt handlers can tell the kernel:
whether the interrupt was something they could handle or not. If a device
starts generating spurious interrupts, the kernel would like to respond by
blocking interrupts from that device. If no interrupt handler for a given
IRQ has been registered, the kernel knows that any interrupt on that number
is spurious. When interrupt handlers exist, however, they must tell the
kernel about spurious interrupts.
So, interrupt handlers now return an irqreturn_t value;
void handlers will no longer compile. If your interrupt handler
recognizes and handles a given interrupt, it should return
IRQ_HANDLED. If it knows that the interrupt was not on a device
it manages, it can return IRQ_NONE instead. The macro:
IRQ_RETVAL(handled)
can also be used; handled should be nonzero if the handler could
deal with the interrupt. The "safe" value to return, if, for some reason
you are not sure, is IRQ_HANDLED.
Disabling interrupts
In the 2.6 kernel, it is no longer possible to globally disable
interrupts.
In particular, the
cli(),
sti(),
save_flags(),
and
restore_flags() functions are no longer available. Disabling
interrupts across all processors in the
system is simply no longer done. This behavior has been strongly
discouraged for some
time, so most code
should have been converted by now.
The proper way to do this fixing, of course, is to figure out exactly which
resources were being protected by disabling interrupts. Those resources
can then be explicitly protected with spinlocks instead. The change is
usually fairly straightforward, but it does require an understanding of
what is really going on.
It is still possible to disable all interrupts locally with
local_save_flags() or local_irq_disable(). A single
interrupt can be disabled globally with disable_irq(). Some of the
spinlock operations also disable interrupts on the local processor, of
course. None of these functions are changed (at least, with regard to
their external interface) since 2.4.
Various small changes
One function that
has changed is
synchronize_irq(). In
2.6, this function takes an integer IRQ number as a parameter. It spins
until no interrupt handler is running for the given IRQ. If the IRQ is
disabled prior to calling
synchronize_irq(), the caller will know
that no interrupt handler can be running after that call. The 2.6 version
of
synchronize_irq() only waits for handlers for the given IRQ
number; it is no longer possible to wait until no interrupt handlers at all
are running.
If your code has post-interrupt logic which runs as a bottom half, or out
of a task queue, it will need to be changed for 2.6. Bottom halves are
deprecated, and the task queue mechanism has been removed altogether.
Post-interrupt processing should now be done using tasklets or work queues.
A new function was added in 2.6.1:
int can_request_irq(unsigned int irq, unsigned long flags);
This function returns a true value if the given interrupt allocation
request would succeed, but does not actually allocate anything. Potential
users should always be aware that the situation could change after calling
can_request_irq().
Finally, the declarations of request_irq() and free_irq() have
moved from <linux/sched.h> to
<linux/interrupt.h>.
Comments (none posted)
One of the key "enterprise" features added to the 2.6 kernel is
asynchronous I/O (AIO). The AIO facility allows user processes to initiate
multiple I/O operations without waiting for any of them to complete; the
status of the operations can then be retrieved at some later time. Block
and network drivers are already fully asynchronous, and thus there is
nothing special that needs to be done to them to support the new
asynchronous operations. Character drivers, however, have a synchronous
API, and will not support AIO without some additional work. For most char
drivers, there is little benefit to be gained from AIO support. In a few
rare cases, however, it may be beneficial to make AIO available to your
users.
AIO file operations
The first step in supporting AIO (beyond including
<linux/aio.h>) is the implementation of three new methods
which have been added to the
file_operations structure:
ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer,
size_t count, loff_t pos);
ssize_t (*aio_write) (struct kiocb *iocb, const char __user *buffer,
size_t count, loff_t pos);
int (*aio_fsync) (struct kiocb *, int datasync);
For most drivers, the real work will be in the implementation of
aio_read() and aio_write(). These functions are
analogous to the standard read() and write() methods,
with a couple of changes: the file parameter has been replaced
with an I/O control
block (iocb), and they (usually) need not complete the requested
operations
immediately. The iocb argument can usually be treated
as an opaque cookie used by the AIO subsystem; if you need the
struct file pointer for this file descriptor, however, you
can find it as iocb->ki_filp.
The aio_ operations can be synchronous. One
obvious example is when the requested operation can be completed without
blocking. If the operation is complete before aio_read() or
aio_write() returns, the return value should be the usual status
or error code. So, the following aio_read() method, while being
pointless, is entirely correct:
ssize_t my_aio_read(struct kiocb *iocb, char __user *buffer,
size_t count, loff_t pos)
{
return my_read(iocb->ki_filp, buf, count, &pos);
}
In some cases, synchronous behavior may actually be required. The
so-called "synchronous iocb's" allow the AIO subsystem to be used
synchronously when need be. The macro:
is_sync_kiocb(struct kiocb *iocb)
will return a true value if the request must be handled synchronously.
In most cases, though, it is assumed that the I/O request will not be
satisfied immediately by aio_read() or aio_write(). In
this case, those functions should do whatever is required to get the
operation started, then return -EIOCBQUEUED. Note that any work
that must be done within the user process's context must be done before
returning; you will not have access to that context later. In order to
access the user buffer, you will probably need to either set up a DMA
mapping or turn the buffer pointer into a series of
struct page pointers before returning.
Bear in mind also that there can be multiple asynchronous I/O requests active at
any given time. A driver which implements AIO will have to include proper
locking (and, probably queueing) to keep these requests from interfering
with each other.
When the I/O operation completes, you must inform the AIO subsystem of the
fact by calling aio_complete():
int aio_complete(struct kiocb *iocb, long res, long res2);
Here, iocb is, of course, the IOCB you were given when the request
was initiated. res is the usual result of an I/O operation: the
number of bytes transfered, or a negative error code. res2 is a
second status value which will be returned to the user; currently (2.6.0-test9),
callers of aio_complete() within the kernel always set
res2 to zero. aio_complete() can be safely called
in an interrupt handler. Once you have called aio_complete(), you
no longer own the IOCB or the user buffer, and should not touch them again.
The aio_fsync() method serves the same purpose as the
fsync() method; its purpose is to ensure that all pending data are
written to disk. As a general rule, device drivers will not need to
implement aio_fsync().
Cancellation
The design of the AIO subsystem includes the ability to cancel outstanding
operations. Cancellation may occur as the result of a specific user-mode
request, or during the cleanup of a process which has exited. It is worth
noting that, as of 2.6.0-test9, no code in the kernel actually performs
cancellation. So cancellation may not work properly, and the interface
could change in the process of making it work. That said, here is how the
interface looks today.
A driver which implements cancellation needs to implement a function for
that purpose:
int my_aio_cancel(struct kiocb *iocb, struct io_event *event);
A pointer to this function can be stored into any IOCB which can be
cancelled:
iocb->ki_cancel = my_aio_cancel;
Should the operation be cancelled, your cancellation function will be
called with pointers to the IOCB and an io_event structure. If it
is possible to cancel (or successfuly complete) the operation prior to
returning from the cancellation function, the result of the operation
should be stored into the res and res2 fields of the
io_event structure, and return zero. A non-zero return value from
the cancellation function indicates that cancellation was not possible.
Comments (1 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>