User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.18-rc6, released by Linus on September 3. It is possibly the final prepatch before the 2.6.18 final release. There are a lot of fixes; one of those is the removal of much of the SMP alternatives work, which was causing build problems with some compilers. See the long-format changelog for the details.

A handful of additional fixes have gone into the mainline git repository since the -rc6 release.

The current -mm tree is 2.6.18-rc5-mm1. Recent changes to -mm include some enhancements to the no-MMU architecture support, a number of NFS server improvements, a steady trickle of reiser4 fixes, and a set of patches allowing a kernel to be built with no block device support (though those don't quite work yet).

On the 2.4 front, Willy Tarreau has released with a small set of important fixes, and 2.4.34-pre2 with a larger set of fixes. The current plan is that the gcc 4.x support discussed here a couple of weeks ago will be merged into 2.4.34-pre3.

Other notes: a number of developers have been having difficulties posting patches to lists hosted on recently. The list maintainers have begun using bogofilter in an attempt to cut down on the amount of spam getting through to the list - and, presumably, to reduce the amount of time they put into manually maintaining filter patterns. To date, bogofilter appears to believe that quite a few patches are spam. On the other hand, a substantial amount of distinctly non-technical mail involving nontraditional approaches to family quality time and members of the animal kingdom has been sailing through without a hitch. One can only assume that the training process will eventually iron out these little problems.

Comments (none posted)

Kernel development news

Quote of the week

"Tiedostoa tai hakemistoa ei ole" just means "No such file or directory".

EVERYBODY knows that.

It's not like there's even a single ä or ö in the whole sentence, so you can't even blame the strange letters (and that's unusual, since in Finnish, usually every other letter is 'ä' if only just to confuse the uninitiated).

-- Linus Torvalds

Comments (9 posted)

Support for drivers in user space

Device drivers are generally done inside the kernel for the usual reasons of performance and control. There are times, however, when the ability to run a device driver in user space is helpful. These include situations where the code is far too large to go into the kernel (, for example) and where the author of the driver does not wish to place the code under the GPL. Some types of drivers (such as those for USB devices) are easily run in user space now, but others can be a bit more challenging. Very few PCI drivers, for example, are written in user space.

Thomas Gleixner has written an interface module which may help to change that situation. With this code in place, PCI drivers (some of them, at least) can be written almost entirely in user space, with only a small stub module loaded into the kernel.

That module has two specific jobs to carry out. The first is to register the device to be driven, with a couple of bits of important information. To that end, it should fill out an iio_device structure, which contains the following fields:

struct iio_device {
    char			*name;
    char			*version;
    unsigned long		physaddr;
    void			*virtaddr;
    unsigned long		size;
    long			irq;

    irqreturn_t (*handler)(int irq, void *dev_id, struct pt_regs *regs);
    ssize_t (*event_write)(struct file *filep, const char __user * buf,
    		       size_t count, loff_t *ppos);
    struct file_operations	*fops;
    void			*priv;
    /* ... */

The first part of the structure provides information about the hardware to be driven - its name, where its I/O memory area lives (physaddr), where that area has been mapped into the kernel (virtaddr), its size, and the interrupt being used by the device. If virtaddr is zero, then physaddr is interpreted as the beginning of a range of I/O ports, rather than a memory address.

The fops field provides the file operations for the device; normally, they are set to the generic versions provided by the IIO (for "industrial I/O") driver: iio_open(), iio_read(), iio_mmap(), etc. With this setup, the driver can create a basic device which allows a user-space program to read from or write to device memory (or ports). I/O memory can also be mapped into user space.

The capabilities described thus far are not all that different from what can be done with /dev/mem; the main difference is that the stub driver can enable the PCI device and perform any other needed initialization. The real hitch in writing user-space PCI drivers, however, has been in the handling of interrupts. There is currently no way to write a user-space interrupt handler, and the IIO patch doesn't really change that. Instead, the stub driver is expected to provide a minimal interrupt handler of its own.

This handler is needed because every device requires its own specific interrupt acknowledgment ritual. The kernel must respond quickly to an interrupt and give the device the attention it craves so that said device will stop asserting the interrupt. After that, any additional processing can be done at relative leisure. So, once the handler provided with the stub driver acknowledges the interrupt, the rest of the work can normally be done by the user-space driver.

All that is needed is to let this driver know that the interrupt has happened. The IIO module provides a couple of mechanisms for that purpose. One is a second device node associated with the device; whenever an interrupt happens, a byte can be read from this "event device." So a user-space driver can simply block on a read from that device, or it can use poll() in more complicated situations. It is also possible for the user-space driver to receive SIGIO signals when an interrupt happens, but using signals will normally increase the ultimate response time to the interrupt.

So, to make all this happen, the stub driver provides a minimal interrupt handler in the handler() field of the iio_device structure. When an interrupt happens, the IIO module will call this handler; if it returns IRQ_HANDLED, user space will be notified. If the stub driver provides an event_write() function, that function will be called in response to a write operation on the event device. This capability can be used to further control the kernel-space response to interrupts, request that interrupts be masked, etc.

Readers who think that the event mechanism shares some features with the proposed kevent subsystem are right. It is probable that the IIO event handling code will be rewritten to use kevents, if and when kevents are merged into the mainline.

Meanwhile, however, the IIO driver works. Thomas has posted an example driver (or parts of one, anyway) to show how this mechanism can be used. The real question which appears to be on a number of minds, however, is: could ATI and nVidia use IIO to move their drivers out of the kernel. Only those vendors can answer that question, however, so, until they say something, nobody really knows.

Comments (29 posted)

Parallel IDE drivers

Back in 2003, Jeff Garzik announced the availability of "a new SCSI driver." That driver was, in fact, the libata subsystem, which was to be the foundation for serial ATA support in Linux. In the process, however, Jeff had thought a bit about supporting the current parallel ATA (PATA) drives, but that was not really his goal:

Note that PATA in my driver is only an afterthought. The main area of focus, now and in the future, is SATA.

In the last three years, the parallel ATA drives that most of us use have continued to be driven by the old IDE driver subsystem. Some of this code dates back to the beginning of Linux; since then it has been maintained by a substantial list of people, a number of whom are widely held to have been driven insane by the experience. The current maintainer, Bartlomiej Zolnierkiewicz, has kept a rather low profile for some time now; he signed off no patches in either of the 2.6.17 or upcoming 2.6.18 kernels. Not much has been happening in the IDE area.

That does not mean that things have been quiet in the parallel ATA area, however. Over the last year or so, Alan Cox has been working to bring full PATA support into the libata code. The resulting drivers have been sitting in the -mm tree for a while, but that period is about to end: the PATA driver set has been queued for merging into 2.6.19.

The stated advantages of the new PATA code are many. The code has been reworked from the beginning, and is up to current kernel standards. The use of libata means that these drivers are well integrated with their SATA cousins, bringing two divergent subsystems back together. The new drivers support a number of chipsets that the IDE layer doesn't handle. Error handling has been much improved. Also, according to Alan's announcement from August, the new drivers feature "active maintenance and updates" and "more interesting bugs to find and help fix."

On the other hand, the new PATA drivers are not considered to be ready for production use yet, and distributors are not expected to enable them in the near future. The merging into 2.6.19 is intended mainly to broaden the test base. A completely new disk subsystem is the sort of thing that one likes to test very well before entrusting it with data that one wishes to actually keep; that process may go on for a little while yet. It is also worth noting that the new PATA code also drops support for some ancient IDE controllers.

The issue that gets everybody's attention, however, is that, as with all drives handled through libata, PATA drives show up as if they were SCSI disks, and are named /dev/sd*. Anybody who just switches to the new drivers without updating /etc/fstab (or using the mount-by-label feature) is likely to have a rough bootstrap experience. That is an easy problem to work around, but the use of the SCSI drive namespace seems to bother some people. What appears to be happening in reality is that Linux is slowly moving toward having a generic disk subsystem, where everything can just be called /dev/diskN. All that's left is a few details and a new set of udev rules to rename the device nodes.

Someday, most of us will be using the new PATA code. But this is not a process which is expected to go quickly, and there are no plans to remove or deprecate the existing IDE code:

At this point in time it is premature to discuss or plan the point at which the old IDE layer would go away. That discussion can start at the point where everyone is happy that the new libata based layer is providing better quality and coverage than the old one. Even then there would be no need to hurry.

So it appears that Linux will have parallel subsystems for parallel ATA support for some time.

Comments (5 posted)

Guest page hinting

Paravirtualized systems are operating systems unto themselves - they look like independent systems to the greatest extent possible. In the end, however, a paravirtualized system is still running under a host, and must interact with that host. A recent set of patches (entitled "guest page hinting") shows how running paravirtualized systems in a fully independent mode can hurt performance - and the sorts of tricks which can be required to make things run more efficiently.

Consider, for example, a short-lived application which runs on a guest system. That application may dirty a number of pages, then exit, its job finished. The guest system knows that the dirty pages are no longer in use, and can be recycled. From the host's point of view, however, the only thing known is that the pages are dirty. So the host will, if needs to reclaim those pages, carefully write their (useless) data out to swap first. This is a wasted effort which would be nice to avoid.

The hinting patches add a couple of low-level primitives for use by guest operating systems: set_page_unused() and set_page_stable(). The former marks a page as being unneeded by the guest, while the latter marks the page as being in active use. The s/390 architecture (which is the main target for this patch set currently) can implement these states through a pair of page flags which the guest can set, making the operations fast. Once pages have been marked as unused, the host system can reclaim them with no further effort, making the whole virtual memory subsystem more efficient.

The next step is to consider page cache pages. These pages will contain data from a file found on a storage device somewhere, meaning that they can be recreated from the source if need be. That, in turn, means that the host could discard them in response to memory pressure. But, once again, the host knows nothing about the guests' page caches. So the hinting patches add another state, called "volatile," to mark pages with backing store. When the host is feeling memory pressure, it is free to discard volatile pages without saving their contents first. It must, however, make sure that the guest system knows that this action has taken place so that the page can be removed from the guest's page cache. In the current patch set, this notification only works for s/390 machines, however.

Pages which have been locked into memory pose an extra challenge here - they can be part of the page cache, but they still shouldn't be taken away by the host system. So such pages cannot be marked as "volatile." The problem is that figuring out if a page is locked is harder than it might seem; it can involve scanning a list of virtual memory area (VMA) structures, which is slow. So the hinting patches add a new flag to the address_space structure to note that somebody has locked pages from that address space in memory. When the flag is set, those pages are not marked as being volatile.

The swap cache also benefits from some hinting work - once the guest has written a page to swap, that page has good backing store and can be grabbed by the host system. The approach taken is similar to that used with the page cache, though there are a few extra details to take care of. For example, the guest must take care to have the page marked stable (and deal with its potentially having been discarded by the host) before freeing the associated entry in the swap area.

Attentive readers may have noticed that these patches are heavily oriented toward the s/390 architecture. IBM has, of course, been doing virtualization for a very long time, so it is not surprising that some relatively advanced virtualization patches are coming from that direction - or that IBM's architectures are designed with virtualization in mind. Other paravirtualization projects will encounter many of the same issues, however, and may well benefit from this work. So the next stage for this patch set should be consideration by other projects and possible work to make the hinting features more generally applicable.

Comments (1 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

  • Andreas Gruenbacher: Tmpfs acls. (September 4, 2006)

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds