LWN.net Logo

Advertisement

Front, Kernel, Security, Distributions, Development. See your byline here on LWN.net.

Advertise here

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.20-rc3, released by Linus just before he went out to celebrate the new year. It contains the fix for the file corruption bug (see below) and a few hundred other fixes.

Previously, 2.6.20-rc2 was released on December 23 with another big set of fixes.

Just a few patches have been added to the mainline git repository since -rc3 came out. There are currently six entries in the known unfixed regressions list maintained by Adrian Bunk.

The current -mm tree is 2.6.20-rc2-mm1. Recent changes to -mm include a new version of the user-space drivers patch, more paravirtualization hooks, a generic time implementation for x86_64, and a generic GPIO driver core.

For older kernels: 2.6.16.37 was released on December 28 with a long list of fixes.

2.4.34 came out on December 28. It has a number of security fixes and support for the gcc 4.x compilers.

Comments (3 posted)

Kernel development news

Quote of the week

I don't care much, really. But then, I understand how all this stuff works. Try explaining to someone the relationship between pte-dirtiness, page-dirtiness, radix-tree-dirtiness and buffer_head-dirtiness.

-- Andrew Morton

Comments (2 posted)

Device resource management

Writing device drivers can be a tricky task. Simply getting a piece of hardware to operate as desired - perhaps working from erroneous or nonexistent documentation - can be a frustrating process. Beyond that, however, the driver must allocate several different types of resources for the device; these resources can include I/O memory mappings, interrupt lines, blocks of memory, DMA buffers, registrations with multiple subsystems, etc. All of these allocations must be returned to the system when the device (or its driver) goes away. It is not uncommon for driver writers to forget to deallocate something, leading to resource leaks.

The problem can get worse, however, in the face of initialization errors. If the driver is unable to properly set up its device, it must undo any registrations which had been done up to the point of failure. Attempts to handle initialization failures usually take the form of several goto labels within the initialization function or some sort of global "initialization state" variable describing where cleanup should begin. Either way, these paths tend not to be well tested, so the chances of an initialization failure leading to some sort of resource leak are quite good.

Tejun Heo, who has done much to improve the Linux serial ATA subsystem over the last year, has had enough of these sorts of initialization problems. So he has put together a device resource management patch which, if accepted, has the potential to make driver code simpler and more robust. The core idea is simple: every time a driver allocates a resource, the management code remembers the allocation and any information needed to free that allocation. When the driver disconnects from the device, all of the remembered allocations are returned to the system.

This sort of allocation tracking cannot be added to the current API in any sort of coherent way. Tejun's patch, instead, creates new "managed" versions of various allocation functions. The new functions look like the old ones with (1) the addition of "m" (or "devm") to the name, and (2) a struct device argument if the function did not already have one. So, for example, the managed versions of the interrupt allocation functions are:

    int devm_request_irq(struct device *dev, unsigned int irq,
		         irq_handler_t handler, unsigned long irqflags,
		    	 const char *devname, void *dev_id);
    void devm_free_irq(struct device *dev, unsigned int irq, 
                       void *dev_id);

The patch also includes managed functions for dealing with DMA buffers, I/O memory regions, plain memory allocations, and PCI device setup. They allow the driver author to replace a whole set of deallocation calls with a simple call to devres_release_all(), simplifying the code significantly. In fact, even that call is unnecessary; the driver core will call it when the driver detaches from the device.

For more complicated situations, there is also a "group" concept. Groups can be thought of as markers in the stream of allocations associated with a given device. The allocations performed within a specific group can be rolled back without affecting any others. In brief, the group API is:

    void *devres_open_group(struct device *dev, void *id, gfp_t gfp);
    void devres_close_group(struct device *dev, void *id);
    void devres_remove_group(struct device *dev, void *id);
    int devres_release_group(struct device *dev, void *id);

A call to devres_open_group() will create a new group for the given device, identified by the id value. Any allocations performed thereafter will be considered to be a part of that group until devres_close_group() is called. If initialization works as desired, however, devres_remove_group() can be used to get rid of the group overhead while leaving the allocations (and their tracking information) untouched. In the failure path, devres_release_group() will return all allocations belonging to the given group.

There has been very little discussion of this patch set, as of this writing. Driver writers, perhaps, are still recovering from the holiday festivities. It is not too hard to imagine that there could be some discomfort about the extra overhead involved in tracking all of those allocations - especially since things do function normally almost all of the time. In the end, however, the promise of correct operation in a wider range of situations may be enough to motivate the inclusion of the new interface.

Comments (6 posted)

A nasty file corruption bug - fixed

The December 20 LWN Kernel Page contained an article about a file corruption bug generally (but not exclusively) seen with ext3 filesystems. Certain applications which have unusual patterns of access to memory-mapped files could, at times, see gaps where data had not made it all the way to the disk. The rtorrent tool was one such application; other test cases were found (and developed) as the hunt for this problem intensified. The problem is now solved, but it offers some interesting lessons on how this kind of subtle bug can come about - and how to get it fixed.

[Cheezy diagram] In an attempt to explain what was going on, your editor will once again employ his rather dubious artistic skills. To that end, readers are kindly requested to look at the diagram to the right and suspend enough disbelief to imagine that it represents a page in memory - a page containing interesting data, and which represents an equivalent set of blocks found within a file on the disk. The distinction between the page and its component blocks is important, which is why the dotted lines divide up the page. A 4096-byte page in memory is likely represented by eight 512-byte disk blocks (which are, most likely, merged back together by the drive, but we'll pretend that isn't happening).

There are a couple of different kernel data structures which contain information about this page, making the diagram a bit more complicated:

[Second diagram]

The page may be mapped into one or more process address spaces. For each such mapping, there will be a page table entry (PTE) which performs the translation between the user-space virtual address and the physical address where the page actually lives. There is also some other information in the PTE, including a "dirty" bit. When the application modifies the page, the processor will set the dirty bit, allowing the operating system to respond by (for example) writing the page back to its backing store. Note that, if there are multiple PTEs pointing to a single page, they may well disagree on whether the page is dirty or not. The only way to know for sure is to scan all existing PTEs and see if any of them are marked dirty.

The kernel maintains a separate data structure known as the system memory map; it contains one struct page for every physical page known to exist. This structure contains a number of interesting bits of information, including a pointer to the page's backing store (if any), a data structure allowing the associated PTEs to be found relatively easily, and a set of page flags. One of those flags is a dirty bit - another flag which notes that the page is in need of writing to its backing store. (For those following closely, it may be worth pointing out that the red arrow pointing to the page does not actually exist as a pointer field; it is implicit in the structure's position within the memory map).

Finally, there is another set of structures which may be associated with this page:

[Third diagram]

The "buffer head" (or "bh") goes back to the earliest days of Linux. It can be thought of as a mapping between a disk block and its copy in memory. The bh is not central to Linux memory management in the way it once was, but a number of filesystems still use it to handle their disk I/O tracking. Note that there is not necessarily a bh structure for every block found within a page; if a filesystem has reason to believe that only some blocks need writing, it does not need to create bh structures for the rest. Among other things, the bh structure contains yet another dirty flag.

With all of these different flags representing what is essentially the same information, it is not entirely surprising that some confusion eventually came about. The maintenance of redundant data structures can be a challenge in any setting, and the kernel environment adds difficulties of its own.

Deep within the kernel, there is a function called set_page_dirty(); it is used by the memory management code when it notices (via a PTE or a direct application operation) that a page is in need of writeback. Among other things, it copies the dirty bit from the page table entries into the page structure. If the page is part of a file, set_page_dirty() will call back into the relevant filesystem - but only if said filesystem has provided the appropriate method. Many filesystems do not provide set_page_dirty() callback, however; for these filesystems, the kernel will, instead, traverse the list of associated bh structures and mark each of them dirty.

And that is where the problem comes in. The filesystem may well have noticed that a block represented by a given bh was dirty and started I/O on it before the set_page_dirty() call. When the I/O is complete, the filesystem clears the dirty flag in the bh. If the set_page_dirty() call comes while the I/O on the block is active, the filesystem will not notice the fact that the block's data may have changed after it was written. Instead, the block will be marked clean, even though what was written does not correspond to what is currently in memory. File corruption results.

Linus's fix is simple. When the virtual memory subsystem decides that it is time to write a page, a new call to set_page_dirty() is made. That ensures that all buffer heads will be marked dirty at the time the filesystem's writepage() method is called. That change ensures that all blocks of the page will be written; testers have confirmed that it makes the file corruption problems go away. The patch has gone into the mainline git repository; it should show up in the next 2.6.19 stable update as well.

The longer-term solution is to continue pushing buffer heads out of the kernel's I/O paths. As Linus puts it:

The buffer head has been purely an "IO entity" for the last several years now, and it's not a cache entity. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen.

I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either.

Ted Ts'o responds that a fix for ext4 could yet happen, but it involves other filesystems as well. The ext3 filesystem is probably going to stay with buffer heads, however, meaning that the kernel will have to continue to work with them indefinitely.

Finally, this story illustrates just how hard it can be to track down and fix certain kinds of kernel bugs. Early in the process it was hard for the interested developers to reproduce the problem, so they had to rely on the initial reporters to try out various patches. Those reporters stuck with the process, building and testing a lot of kernels before the problem was flushed out. They deserve much of the credit for the resolution of this problem.

Comments (18 posted)

Asynchronous buffered file I/O

Asynchronous I/O (AIO) operations have the property of not blocking in the kernel. If an operation cannot be completed immediately, it is set in motion and control returns to the calling application while things are still in progress. This functionality allows a suitably-programmed application to keep multiple operations going in parallel without blocking on any of them.

While Linux has long offered a set of system calls for asynchronous I/O, support within the kernel has been spotty and slow in coming. Most char devices do not provide the necessary methods - generally because there is no pressing need for them to support asynchronous operations. Networking supports AIO reasonably well. At the block level, all I/O is asynchronous, but that is not true when dealing with the virtual filesystem layer. Quite a bit of work went into supporting asynchronous direct filesystem I/O, making the big database vendors happy. But most applications do not use direct I/O, and the system as a whole usually benefits from the use of buffered I/O. So asynchronous buffered I/O support is arguably the biggest remaining hole.

Various buffered filesystem AIO patches have been posted over the course of some three years, but none have made it into the kernel. Recently, Suparna Bhattacharya has restarted this work with a new file AIO patch which attempts to add this capability in the least intrusive way possible. This work may now be simple enough that few will be able to find things to object to.

Like previous versions of the patch, the current code adds a special wait queue to each process's task structure. That queue is used for normal synchronous operations, while asynchronous operations each have their own, dedicated queue. The current wait queue is passed into filesystem I/O operations which could block. That enables a couple of special tricks to be performed:

  • The I/O wait code checks to see if an asynchronous wait queue is in use. If so, it simply returns -EIOCBRETRY rather than waiting. This return code indicates that the operation is still in progress; among other things, it is used to ensure that the wait queue entry remains on the queue until the operation completes.

  • Normally, wait queues wake up whatever process is waiting on them. They are, however, rather more general than that. By changing the wakeup function (see this LWN article for information on how to do that), the AIO code can use wait queues as notification service. When a "wakeup" happens on a queue being used for AIO, the kernel, rather than waking up a process, starts up a workqueue with an entry that will take the next step in the I/O operation.

The normal buffered filesystem read code, simplified almost into oblivion, looks something like this:

    for each file page to be read
	get the page into the page cache
	copy the contents to the user buffer

The real code can be found in mm/filemap.c as do_generic_mapping_read(), but the leading comment notes that "this is really ugly." It is one of only three functions so marked in that file, so, trust your editor, and go with the simple version above.

In the pseudocode version, the place where things block is clearly the step where the file page is read into the page cache. If the page is not already cached, the kernel will have to set up a disk I/O operation and wait for it to be carried out. That code proceeds the way it always did, until it gets to the "wait" part, at which point the AIO wait queue will be noticed and the code will return to whatever it was doing before. Once the read completes, the special wakeup function associated with the AIO queue will pick up where things left off.

One might well wonder just how that "pick up" part works. The wakeup function will not be running in the process of the original calling application, and may well not be running in process context at all. So it queues up a workqueue function which will examine the state of the outstanding I/O operation and, if necessary, jump back into the loop above to continue the work. Before doing so, however, the workqueue function carefully tweaks its memory management context so that it shares the original application's address space. That tweak is necessary to make the final line above (copy the page to the user buffer) work as expected. The workqueue function will perform that copy, then proceed on to the next page (if any). Likely as not, that next page will need to be read in from disk, so the workqueue function will, after ensuring that the operation is started, simply quit. This process repeats until all of the requested data has been read, at which point the application can be notified that the operation is complete.

On the write side, one might think that no changes are required - buffered file writes are already asynchronous, with the flush to disk happening in the background. The exception, however, is when O_SYNC is in use. There are situations where applications want to know when the data has found its way to the disk platter, but they still don't want to block waiting for that to happen. A very similar approach is used to make asynchronous O_SYNC writes work, though the patch is a little larger. A couple of the low-level page writeback functions required modifications so that they would pass the relevant wait queue around.

Even with this change in place, writes can still block on occasion. In particular, any operation which requires allocating disk blocks for the file may block while those allocations are performed. This issue can probably be worked around, but that work has not yet been done.

The result of all this is a working asynchronous buffered file I/O capability which makes almost no changes to (and adds little overhead to) the "normal" synchronous code. If no serious objections are raised, the Linux AIO subsystem might just become a little more complete in the near future.

Comments (5 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds