|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.64, which was released by Linus on March 4. Changes in 2.5.64 include a dentry cache performance improvement, code for finding "jiffy wraps," an ACPI update, some new CPU frequency control code, USB updates, several kbuild fixes, a number of sysfs tweaks, some module fixes, and, of course, a great many spelling fixes. The long-format changelog has the details, as usual.

As of this writing, Linus's BitKeeper tree includes only some timekeeping fixes and some SBUS frame buffer patches.

The current stable kernel is 2.4.20. Marcelo released the fifth 2.4.21 prepatch on February 27; it includes some architecture updates, some IDE fixes, fixes for the ethernet information leakage vulnerability, a JFS update, and, of course, lots of other repairs.

Alan Cox released 2.2.24 on March 5. There is, of course, no active development happening with 2.2, so this release consists of fixes only - in particular, it includes fixes for the ethernet information leakage vulnerability.

Comments (none posted)

Kernel development news

remap_file_pages()

Ingo Molnar's new remap_file_pages() system call was first merged into the 2.5.46 kernel. The final parts of that are just now circulating in patch form, however. So it seems like a good time to look at what this system call does.

Many kinds of applications use mmap() to map a file into virtual memory. mmap() makes a simple, linear mapping between a region of virtual memory and a corresponding part of the file on disk. Some applications, however, have more complicated needs; they typically want to map several pieces of a file into different parts of memory. This sort of nonlinear mapping is used, for example, by large database management systems as a way of managing the movement of data to and from the disk.

Nonlinear mappings can be created on any system which supports mmap(); it's just a matter of creating a separate mapping for each piece of the file. Such mappings can be expensive to set up, however, and even more expensive to use. In the Linux kernel, each mapping creates a separate virtual memory area (VMA). Each VMA uses kernel memory; the presence of large numbers of VMAs will also slow down the VM subsystem.

The remap_file_pages() system call addresses these problems by allowing a process to rearrange the memory mapping of a file on the fly. It is called as:

    int remap_file_pages(unsigned long start, unsigned long size,
			 unsigned long prot, unsigned long pgoff,
			 unsigned long flags);

Essentially, this call says that size pages from the file, starting at page offset pgoff, should be mapped into the process's virtual memory beginning at start. The file should already be mapped into a VMA which contains start. Since the system call works entirely through page table manipulation, it is quite fast. It also can create complicated nonlinear mappings without needing to create new VMAs.

remap_file_pages(), as found in the 2.5.64 kernel, only has one little problem: the remapping information is lost if the page is swapped out. Users must thus either lock the area in memory (which is generally not a problem for the "big database management system" scenario, which tends to perform this locking anyway), or take pains to reestablish the mapping on swapin. Ingo's latest patch clears up that last bit of trouble by storing the mapping information into the page table entry when a page is swapped out. On 32-bit systems, this technique limits the maximum size of a nonlinear mapping to 1-2TB (depending on the architecture) because some of the PTE bits are not available for this use. Given the trouble most 32-bit systems have in simply addressing that much memory, this limitation is not likely to bother too many people.

For now, it is not possible to change protections within a single VMA (the prot parameter to remap_file_pages() is ignored). At some future point, that could change. Some applications (i.e. memory debuggers) currently struggle to control memory protection in a fine-grained manner. Being able to simply set protections on a per-page basis (without creating new VMAs) would make things much easier.

Comments (none posted)

The spelling fix backlash

Development kernels typically go through a stage where half of the patches seem to be spelling fixes. Correcting misspellings is an easy way for people to help improve the code base without having to understand locking rules - or even the C language. For the most part these changes are, at worst, harmless.

2.5 seems to have inspired a more thorough than usual cleanup effort, however. People have been fixing punctuation problems, and there is even a special kernel source spellchecker out there. All this work has caused some developers to wonder if things aren't going a little too far, especially the changes start breaking things. As Alan Cox put it:

People are going to far. Fixing typos that are confusing or blatantly daft is one thing, but if you want to pick over documentation line by line with a copy of Fowlers in hand the Gnome and KDE projects would both love to have you working over their documentation and end user manuals ;)

This is a good point: very few documentation projects complain about having too many contributors. Improving documentation may not bring the satisfaction of seeing your name in the kernel changelog, but it could well be a better use of available time than correcting apostrophe errors in kernel comments.

Comments (13 posted)

Linux memory management documentation

Speaking of improving documentation, Mel Gorman has been working for some time to document how memory management works in the 2.4.20 kernel. He has now released the results of his work in text, HTML, and PDF formats. There is also an extensive commentary of the VM code itself. It is a large body of work, and a substantial contribution to the development community; worth a read.

Comments (none posted)

Driver porting

This week's driver porting articles

Below, you will find two new articles on porting drivers (and other kernel code) to the 2.5 kernel; they discuss interrupt handling and asynchronous I/O. Also new this week (but not included below) is an article describing the completion event interface; that article, along with all the others in this series, may be found on the LWN Driver Porting Series page.

Comments (none posted)

Driver porting: dealing with interrupts

This article is part of the LWN Porting Drivers to 2.6 series.
The kernel's handling of device interrupts has been massively reworked in the 2.6 series. Fortunately, very few of those changes are visible to the rest of the kernel; most well-written code should "just work" (almost) under 2.6. There are, however, two important exceptions: the return type of interrupt handlers has changed, and drivers which depend on being able to globally disable interrupts will require some changes for 2.6.

Interrupt handler return values

Prior to 2.5.69, interrupt handlers returned void. There is, however, one useful thing that interrupt handlers can tell the kernel: whether the interrupt was something they could handle or not. If a device starts generating spurious interrupts, the kernel would like to respond by blocking interrupts from that device. If no interrupt handler for a given IRQ has been registered, the kernel knows that any interrupt on that number is spurious. When interrupt handlers exist, however, they must tell the kernel about spurious interrupts.

So, interrupt handlers now return an irqreturn_t value; void handlers will no longer compile. If your interrupt handler recognizes and handles a given interrupt, it should return IRQ_HANDLED. If it knows that the interrupt was not on a device it manages, it can return IRQ_NONE instead. The macro:

    IRQ_RETVAL(handled)

can also be used; handled should be nonzero if the handler could deal with the interrupt. The "safe" value to return, if, for some reason you are not sure, is IRQ_HANDLED.

Disabling interrupts

In the 2.6 kernel, it is no longer possible to globally disable interrupts. In particular, the cli(), sti(), save_flags(), and restore_flags() functions are no longer available. Disabling interrupts across all processors in the system is simply no longer done. This behavior has been strongly discouraged for some time, so most code should have been converted by now.

The proper way to do this fixing, of course, is to figure out exactly which resources were being protected by disabling interrupts. Those resources can then be explicitly protected with spinlocks instead. The change is usually fairly straightforward, but it does require an understanding of what is really going on.

It is still possible to disable all interrupts locally with local_save_flags() or local_irq_disable(). A single interrupt can be disabled globally with disable_irq(). Some of the spinlock operations also disable interrupts on the local processor, of course. None of these functions are changed (at least, with regard to their external interface) since 2.4.

Various small changes

One function that has changed is synchronize_irq(). In 2.6, this function takes an integer IRQ number as a parameter. It spins until no interrupt handler is running for the given IRQ. If the IRQ is disabled prior to calling synchronize_irq(), the caller will know that no interrupt handler can be running after that call. The 2.6 version of synchronize_irq() only waits for handlers for the given IRQ number; it is no longer possible to wait until no interrupt handlers at all are running.

If your code has post-interrupt logic which runs as a bottom half, or out of a task queue, it will need to be changed for 2.6. Bottom halves are deprecated, and the task queue mechanism has been removed altogether. Post-interrupt processing should now be done using tasklets or work queues.

A new function was added in 2.6.1:

    int can_request_irq(unsigned int irq, unsigned long flags);

This function returns a true value if the given interrupt allocation request would succeed, but does not actually allocate anything. Potential users should always be aware that the situation could change after calling can_request_irq().

Finally, the declarations of request_irq() and free_irq() have moved from <linux/sched.h> to <linux/interrupt.h>.

Comments (none posted)

Driver porting: supporting asynchronous I/O

This article is part of the LWN Porting Drivers to 2.6 series.
One of the key "enterprise" features added to the 2.6 kernel is asynchronous I/O (AIO). The AIO facility allows user processes to initiate multiple I/O operations without waiting for any of them to complete; the status of the operations can then be retrieved at some later time. Block and network drivers are already fully asynchronous, and thus there is nothing special that needs to be done to them to support the new asynchronous operations. Character drivers, however, have a synchronous API, and will not support AIO without some additional work. For most char drivers, there is little benefit to be gained from AIO support. In a few rare cases, however, it may be beneficial to make AIO available to your users.

AIO file operations

The first step in supporting AIO (beyond including <linux/aio.h>) is the implementation of three new methods which have been added to the file_operations structure:

    ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer, 
			 size_t count, loff_t pos);
    ssize_t (*aio_write) (struct kiocb *iocb, const char __user *buffer, 
			  size_t count, loff_t pos);
    int (*aio_fsync) (struct kiocb *, int datasync);

For most drivers, the real work will be in the implementation of aio_read() and aio_write(). These functions are analogous to the standard read() and write() methods, with a couple of changes: the file parameter has been replaced with an I/O control block (iocb), and they (usually) need not complete the requested operations immediately. The iocb argument can usually be treated as an opaque cookie used by the AIO subsystem; if you need the struct file pointer for this file descriptor, however, you can find it as iocb->ki_filp.

The aio_ operations can be synchronous. One obvious example is when the requested operation can be completed without blocking. If the operation is complete before aio_read() or aio_write() returns, the return value should be the usual status or error code. So, the following aio_read() method, while being pointless, is entirely correct:

    ssize_t my_aio_read(struct kiocb *iocb, char __user *buffer, 
                        size_t count, loff_t pos)
    {
	return my_read(iocb->ki_filp, buf, count, &pos);
    }

In some cases, synchronous behavior may actually be required. The so-called "synchronous iocb's" allow the AIO subsystem to be used synchronously when need be. The macro:

    is_sync_kiocb(struct kiocb *iocb)

will return a true value if the request must be handled synchronously.

In most cases, though, it is assumed that the I/O request will not be satisfied immediately by aio_read() or aio_write(). In this case, those functions should do whatever is required to get the operation started, then return -EIOCBQUEUED. Note that any work that must be done within the user process's context must be done before returning; you will not have access to that context later. In order to access the user buffer, you will probably need to either set up a DMA mapping or turn the buffer pointer into a series of struct page pointers before returning. Bear in mind also that there can be multiple asynchronous I/O requests active at any given time. A driver which implements AIO will have to include proper locking (and, probably queueing) to keep these requests from interfering with each other.

When the I/O operation completes, you must inform the AIO subsystem of the fact by calling aio_complete():

    int aio_complete(struct kiocb *iocb, long res, long res2);

Here, iocb is, of course, the IOCB you were given when the request was initiated. res is the usual result of an I/O operation: the number of bytes transfered, or a negative error code. res2 is a second status value which will be returned to the user; currently (2.6.0-test9), callers of aio_complete() within the kernel always set res2 to zero. aio_complete() can be safely called in an interrupt handler. Once you have called aio_complete(), you no longer own the IOCB or the user buffer, and should not touch them again.

The aio_fsync() method serves the same purpose as the fsync() method; its purpose is to ensure that all pending data are written to disk. As a general rule, device drivers will not need to implement aio_fsync().

Cancellation

The design of the AIO subsystem includes the ability to cancel outstanding operations. Cancellation may occur as the result of a specific user-mode request, or during the cleanup of a process which has exited. It is worth noting that, as of 2.6.0-test9, no code in the kernel actually performs cancellation. So cancellation may not work properly, and the interface could change in the process of making it work. That said, here is how the interface looks today.

A driver which implements cancellation needs to implement a function for that purpose:

    int my_aio_cancel(struct kiocb *iocb, struct io_event *event);

A pointer to this function can be stored into any IOCB which can be cancelled:

    iocb->ki_cancel = my_aio_cancel;

Should the operation be cancelled, your cancellation function will be called with pointers to the IOCB and an io_event structure. If it is possible to cancel (or successfuly complete) the operation prior to returning from the cancellation function, the result of the operation should be stored into the res and res2 fields of the io_event structure, and return zero. A non-zero return value from the cancellation function indicates that cancellation was not possible.

Comments (1 posted)

Patches and updates

Kernel trees

Stephen Hemminger 2.5.63-osdl3 ?
Marcelo Tosatti Linux 2.4.21-pre5 ?
Con Kolivas 2.4.20-ck4 ?
Alan Cox Linux 2.2.24-rc5 ?

Architecture-specific

Christoph Hellwig allow CONFIG_SWAP=n for i386 ?

Build system

Sam Ravnborg kbuild: Separate objdir ?

Core kernel code

Development tools

Device drivers

Documentation

Denis Vlasenko lk maintainers ?
Andries.Brouwer@cwi.nl man-pages 1.56 released ?

Filesystems and block I/O

Memory management

Andrew Morton 2.5.63-mm1 ?
Andrew Morton 2.5.63-mm2 ?
William Lee Irwin III pgcl-2.5.63-2 ?
William Lee Irwin III pgcl-2.5.64-[12] ?

Networking

Kazunori Miyazawa IPv6 IPsec support ?

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds