The current 2.6 prepatch is 2.6.20-rc3
by Linus just before he
went out to celebrate the new year. It contains the fix for the file
corruption bug (see below) and a few hundred other fixes.
Previously, 2.6.20-rc2 was
released on December 23 with another big set of fixes.
Just a few patches have been added to the mainline git repository since
-rc3 came out. There are currently six entries in the known unfixed regressions list maintained by
The current -mm tree is 2.6.20-rc2-mm1. Recent changes
to -mm include a new version of the user-space drivers patch, more
paravirtualization hooks, a generic time implementation for x86_64, and a
generic GPIO driver core.
For older kernels: 126.96.36.199 was released on
December 28 with a long list of fixes.
2.4.34 came out on
December 28. It has a number of security fixes and support for the
gcc 4.x compilers.
Comments (3 posted)
Kernel development news
I don't care much, really. But then, I understand how all this
stuff works. Try explaining to someone the relationship between
pte-dirtiness, page-dirtiness, radix-tree-dirtiness and
-- Andrew Morton
Comments (2 posted)
Writing device drivers can be a tricky task. Simply getting a piece of
hardware to operate as desired - perhaps working from erroneous or
nonexistent documentation - can be a frustrating process. Beyond that,
however, the driver must allocate several different types of resources for
the device; these resources can include I/O memory mappings, interrupt
lines, blocks of memory, DMA buffers, registrations with multiple
subsystems, etc. All of these allocations must be returned to the system
when the device (or its driver) goes away. It is not uncommon for driver
writers to forget to deallocate something, leading to resource leaks.
The problem can get worse, however, in the face of initialization errors.
If the driver is unable to properly set up its device, it must undo any
registrations which had been done up to the point of failure. Attempts to
handle initialization failures usually take the form of several
goto labels within the initialization function or some sort of
global "initialization state" variable describing where cleanup should
begin. Either way, these paths tend not to be well tested, so the chances
of an initialization failure leading to some sort of resource leak are
Tejun Heo, who has done much to improve the Linux serial ATA subsystem over
the last year, has had enough of these sorts of initialization problems.
So he has put together a device
resource management patch which, if accepted, has the potential to make
driver code simpler and more robust. The core idea is simple: every time
a driver allocates a resource, the management code remembers the allocation
and any information needed to free that allocation. When the driver
disconnects from the device, all of the remembered allocations are returned
to the system.
This sort of allocation tracking cannot be added to the current API in any
sort of coherent way. Tejun's patch, instead, creates new "managed"
versions of various allocation functions. The new functions look like the
old ones with (1) the addition of "m" (or "devm") to
the name, and (2) a struct device argument if the function
did not already have one. So, for example, the managed versions of the
interrupt allocation functions are:
int devm_request_irq(struct device *dev, unsigned int irq,
irq_handler_t handler, unsigned long irqflags,
const char *devname, void *dev_id);
void devm_free_irq(struct device *dev, unsigned int irq,
The patch also includes managed functions for dealing with DMA buffers, I/O
memory regions, plain memory allocations, and PCI device setup. They allow
the driver author to replace a whole set of deallocation calls with a
simple call to devres_release_all(), simplifying the code
significantly. In fact, even that call is unnecessary; the driver core
will call it when the driver detaches from the device.
For more complicated situations, there is also a "group" concept. Groups
can be thought of as markers in the stream of allocations associated with a
given device. The allocations performed within a specific group can be
rolled back without affecting any others. In brief, the group API is:
void *devres_open_group(struct device *dev, void *id, gfp_t gfp);
void devres_close_group(struct device *dev, void *id);
void devres_remove_group(struct device *dev, void *id);
int devres_release_group(struct device *dev, void *id);
A call to devres_open_group() will create a new group for the
given device, identified by the id value. Any allocations
performed thereafter will be considered to be a part of that group until
devres_close_group() is called. If initialization works as
desired, however, devres_remove_group() can be used to get rid of
the group overhead while leaving the allocations (and their tracking
information) untouched. In the failure path,
devres_release_group() will return all allocations belonging to
the given group.
There has been very little discussion of this patch set, as of this
writing. Driver writers, perhaps, are still recovering from the holiday
festivities. It is not too hard to imagine that there could be some
discomfort about the extra overhead involved in tracking all of those
allocations - especially since things do function normally almost all of
the time. In the end, however, the promise of correct operation in a wider
range of situations may be enough to motivate the inclusion of the new
Comments (6 posted)
The December 20 LWN Kernel Page contained an article
about a file
corruption bug generally (but not exclusively) seen with ext3 filesystems.
Certain applications which have unusual patterns of access to memory-mapped
files could, at times, see gaps where data had not made it all the way to
the disk. The rtorrent tool was one such application; other test cases
were found (and developed) as the hunt for this problem intensified.
The problem is now solved, but it offers some interesting lessons on how
this kind of subtle bug can come about - and how to get it fixed.
In an attempt to explain what was going on, your editor will once again
employ his rather dubious artistic skills. To that end, readers are kindly
requested to look at the diagram to the right and suspend enough disbelief
to imagine that it
represents a page in memory - a page containing interesting data, and which
represents an equivalent set of blocks found within a file on the disk.
The distinction between the page and its component blocks is important,
which is why the dotted lines divide up the page. A 4096-byte page in
memory is likely represented by eight 512-byte disk blocks (which are, most
likely, merged back together by the drive, but we'll pretend that isn't
There are a couple of different kernel data structures which contain
information about this page, making the diagram a bit more complicated:
The page may be mapped into one or more process address spaces. For each
such mapping, there will be a page table entry (PTE) which performs the
translation between the user-space virtual address and the physical address
where the page actually lives. There is also some other information in the
PTE, including a "dirty" bit. When the application modifies the page, the
processor will set the dirty bit, allowing the operating system to respond
by (for example) writing the page back to its backing store. Note that, if
there are multiple PTEs pointing to a single page, they may well disagree
on whether the page is dirty or not. The only way to know for sure is to
scan all existing PTEs and see if any of them are marked dirty.
The kernel maintains a separate data structure known as the system memory
map; it contains one struct page for every physical page known to
exist. This structure contains a number of interesting bits of
information, including a pointer to the page's backing store (if any), a
data structure allowing the associated PTEs to be found relatively easily,
and a set of page flags. One of those flags is a dirty bit - another flag
which notes that the page is in need of writing to its backing store. (For
those following closely, it may be worth pointing out that the red arrow
pointing to the page does not actually exist as a pointer field; it is
implicit in the structure's position within the memory map).
Finally, there is another set of structures which may be associated with
The "buffer head" (or "bh") goes back to the earliest days of Linux. It
can be thought of as a mapping between a disk block and its copy in
memory. The bh is not central to Linux memory management in the way it
once was, but a number of filesystems still use it to handle their disk I/O
tracking. Note that there is not necessarily a bh structure for every
block found within a page; if a filesystem has reason to believe that only
some blocks need writing, it does not need to create bh structures for the
rest. Among other things, the bh structure contains yet another dirty
With all of these different flags representing what is essentially the same
information, it is not entirely surprising that some confusion eventually
came about. The maintenance of redundant data structures can be a
challenge in any setting, and the kernel environment adds difficulties of
Deep within the kernel, there is a function called
set_page_dirty(); it is used by the memory management code when it
notices (via a PTE or a direct application operation) that a page is in
need of writeback. Among other things, it copies the dirty bit from the
page table entries into the page structure. If the page is part of a
file, set_page_dirty() will call back into the relevant filesystem
- but only if said filesystem has provided the appropriate method. Many
filesystems do not provide set_page_dirty() callback, however; for
these filesystems, the kernel will, instead, traverse the list of
associated bh structures and mark each of them dirty.
And that is where the problem comes in. The filesystem may well have
noticed that a block represented by a given bh was dirty and started I/O on
it before the set_page_dirty() call. When the I/O is complete,
the filesystem clears the dirty flag in the bh. If the
set_page_dirty() call comes while the I/O on the block is active,
the filesystem will not notice the fact that the block's data may have
changed after it was written. Instead, the block will be marked clean,
even though what was written does not correspond to what is currently in
memory. File corruption results.
Linus's fix is simple. When the virtual
memory subsystem decides that it is time to write a page, a new call to
set_page_dirty() is made. That ensures that all buffer heads
will be marked dirty at the time the filesystem's writepage()
method is called. That change ensures that all blocks of the page will be
written; testers have confirmed that it makes the file corruption problems
go away. The patch has gone into the mainline git repository; it should
show up in the next 2.6.19 stable update as well.
The longer-term solution is to continue pushing buffer heads out of the
kernel's I/O paths. As Linus puts it:
The buffer head has been purely an "IO entity" for the last
several years now, and it's not a cache entity. Anybody who does writeback
by buffer heads is basically bypassing the real cache (the page cache),
and that's why all the problems happen.
I think ext3 is terminally crap by now. It still uses buffer heads in
places where it really really shouldn't, and as a result, things like
directory accesses are simply slower than they should be. Sadly, I don't
think ext4 is going to fix any of this, either.
Ted Ts'o responds that a fix for ext4 could
yet happen, but it involves other filesystems as well. The ext3 filesystem
is probably going to stay with buffer heads, however, meaning that the
kernel will have to continue to work with them indefinitely.
Finally, this story illustrates just how hard it can be to track down and
fix certain kinds of kernel bugs. Early in the process it was hard for the
interested developers to reproduce the problem, so they had to rely on the
initial reporters to try out various patches. Those reporters stuck with
the process, building and testing a lot of kernels before the
problem was flushed out. They deserve much of the credit for the
resolution of this problem.
Comments (18 posted)
Asynchronous I/O (AIO) operations have the property of not blocking in the
kernel. If an operation cannot be completed immediately, it is set in
motion and control returns to the calling application while things are
still in progress. This functionality allows a suitably-programmed
application to keep multiple operations going in parallel without blocking
on any of them.
While Linux has long offered a set of system calls for asynchronous I/O,
support within the kernel has been spotty and slow in coming. Most char
devices do not provide the necessary methods - generally because there is
no pressing need for them to support asynchronous operations. Networking
supports AIO reasonably well. At the block level, all I/O is asynchronous,
but that is not true when dealing with the virtual filesystem layer. Quite
a bit of work went into supporting asynchronous direct filesystem I/O,
making the big database vendors happy. But most applications do not use
direct I/O, and the system as a whole usually benefits from the use of
buffered I/O. So asynchronous buffered I/O support is arguably the biggest
Various buffered filesystem AIO patches have been posted over the course of
some three years, but none have made it into the kernel. Recently, Suparna
Bhattacharya has restarted this work with a new
file AIO patch which attempts to add this capability in the least
intrusive way possible. This work may now be simple enough that few will
be able to find things to object to.
Like previous versions of the patch, the current code adds a special wait
queue to each process's task structure. That queue is used for normal
synchronous operations, while asynchronous operations each have their own,
dedicated queue. The current wait queue is passed into filesystem I/O
operations which could block. That enables a couple of special tricks to be
- The I/O wait code checks to see if an asynchronous wait queue is
in use. If so, it simply returns -EIOCBRETRY rather than
waiting. This return code indicates that the operation is still in
progress; among other things, it is used to ensure that the wait queue
entry remains on the queue until the operation completes.
- Normally, wait queues wake up whatever process is waiting on them.
They are, however, rather more general than that. By changing the
wakeup function (see this LWN
article for information on how to do that), the AIO code can use
wait queues as notification service. When a "wakeup"
happens on a queue being used for AIO, the kernel, rather than waking
up a process, starts up a workqueue with an entry that will take the
next step in the I/O operation.
The normal buffered filesystem read code, simplified almost into oblivion,
looks something like this:
for each file page to be read
get the page into the page cache
copy the contents to the user buffer
The real code can be found in mm/filemap.c as
do_generic_mapping_read(), but the leading comment notes that
"this is really ugly." It is one of only three functions so
marked in that file, so, trust your editor, and go with the simple version
In the pseudocode version, the place where things block is clearly the step
where the file page is read into the page cache. If the page is not
already cached, the kernel will have to set up a disk I/O operation and
wait for it to be carried out. That code proceeds the way it always did,
until it gets to the "wait" part, at which point the AIO wait queue will be
noticed and the code will return to whatever it was doing before. Once the
read completes, the special wakeup function associated with the AIO queue
will pick up where things left off.
One might well wonder just how that "pick up" part works. The wakeup
function will not be running in the process of the original calling
application, and may well not be running in process context at all. So it
queues up a workqueue function which will examine the state of the
outstanding I/O operation and, if necessary, jump back into the loop above
to continue the work. Before doing so, however, the workqueue function
carefully tweaks its memory management context so that it shares the
original application's address space. That tweak is necessary to make the
final line above (copy the page to the user buffer) work as expected. The
workqueue function will perform that copy, then proceed on to the next page
(if any). Likely as not, that next page will need to be read in from disk,
so the workqueue function will, after ensuring that the operation is
started, simply quit. This process repeats until all of the requested data
has been read, at which point the application can be notified that the
operation is complete.
On the write side, one might think that no changes are required - buffered
file writes are already asynchronous, with the flush to disk happening in
the background. The exception, however, is when O_SYNC is in
use. There are situations where applications want to know when the data
has found its way to the disk platter, but they still don't want to block
waiting for that to happen. A very similar approach is used to make
asynchronous O_SYNC writes work, though the patch is a little
larger. A couple of the low-level page writeback functions required
modifications so that they would pass the relevant wait queue around.
Even with this change in place, writes can still block on occasion. In
particular, any operation which requires allocating disk blocks for the
file may block while those allocations are performed. This issue can
probably be worked around, but that work has not yet been done.
The result of all this is a working asynchronous buffered file I/O
capability which makes almost no changes to (and adds little overhead to)
the "normal" synchronous code. If no serious objections are raised, the
Linux AIO subsystem might just become a little more complete in the near
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>