LWN.net Logo

Asynchronous buffered file I/O

Asynchronous I/O (AIO) operations have the property of not blocking in the kernel. If an operation cannot be completed immediately, it is set in motion and control returns to the calling application while things are still in progress. This functionality allows a suitably-programmed application to keep multiple operations going in parallel without blocking on any of them.

While Linux has long offered a set of system calls for asynchronous I/O, support within the kernel has been spotty and slow in coming. Most char devices do not provide the necessary methods - generally because there is no pressing need for them to support asynchronous operations. Networking supports AIO reasonably well. At the block level, all I/O is asynchronous, but that is not true when dealing with the virtual filesystem layer. Quite a bit of work went into supporting asynchronous direct filesystem I/O, making the big database vendors happy. But most applications do not use direct I/O, and the system as a whole usually benefits from the use of buffered I/O. So asynchronous buffered I/O support is arguably the biggest remaining hole.

Various buffered filesystem AIO patches have been posted over the course of some three years, but none have made it into the kernel. Recently, Suparna Bhattacharya has restarted this work with a new file AIO patch which attempts to add this capability in the least intrusive way possible. This work may now be simple enough that few will be able to find things to object to.

Like previous versions of the patch, the current code adds a special wait queue to each process's task structure. That queue is used for normal synchronous operations, while asynchronous operations each have their own, dedicated queue. The current wait queue is passed into filesystem I/O operations which could block. That enables a couple of special tricks to be performed:

  • The I/O wait code checks to see if an asynchronous wait queue is in use. If so, it simply returns -EIOCBRETRY rather than waiting. This return code indicates that the operation is still in progress; among other things, it is used to ensure that the wait queue entry remains on the queue until the operation completes.

  • Normally, wait queues wake up whatever process is waiting on them. They are, however, rather more general than that. By changing the wakeup function (see this LWN article for information on how to do that), the AIO code can use wait queues as notification service. When a "wakeup" happens on a queue being used for AIO, the kernel, rather than waking up a process, starts up a workqueue with an entry that will take the next step in the I/O operation.

The normal buffered filesystem read code, simplified almost into oblivion, looks something like this:

    for each file page to be read
	get the page into the page cache
	copy the contents to the user buffer

The real code can be found in mm/filemap.c as do_generic_mapping_read(), but the leading comment notes that "this is really ugly." It is one of only three functions so marked in that file, so, trust your editor, and go with the simple version above.

In the pseudocode version, the place where things block is clearly the step where the file page is read into the page cache. If the page is not already cached, the kernel will have to set up a disk I/O operation and wait for it to be carried out. That code proceeds the way it always did, until it gets to the "wait" part, at which point the AIO wait queue will be noticed and the code will return to whatever it was doing before. Once the read completes, the special wakeup function associated with the AIO queue will pick up where things left off.

One might well wonder just how that "pick up" part works. The wakeup function will not be running in the process of the original calling application, and may well not be running in process context at all. So it queues up a workqueue function which will examine the state of the outstanding I/O operation and, if necessary, jump back into the loop above to continue the work. Before doing so, however, the workqueue function carefully tweaks its memory management context so that it shares the original application's address space. That tweak is necessary to make the final line above (copy the page to the user buffer) work as expected. The workqueue function will perform that copy, then proceed on to the next page (if any). Likely as not, that next page will need to be read in from disk, so the workqueue function will, after ensuring that the operation is started, simply quit. This process repeats until all of the requested data has been read, at which point the application can be notified that the operation is complete.

On the write side, one might think that no changes are required - buffered file writes are already asynchronous, with the flush to disk happening in the background. The exception, however, is when O_SYNC is in use. There are situations where applications want to know when the data has found its way to the disk platter, but they still don't want to block waiting for that to happen. A very similar approach is used to make asynchronous O_SYNC writes work, though the patch is a little larger. A couple of the low-level page writeback functions required modifications so that they would pass the relevant wait queue around.

Even with this change in place, writes can still block on occasion. In particular, any operation which requires allocating disk blocks for the file may block while those allocations are performed. This issue can probably be worked around, but that work has not yet been done.

The result of all this is a working asynchronous buffered file I/O capability which makes almost no changes to (and adds little overhead to) the "normal" synchronous code. If no serious objections are raised, the Linux AIO subsystem might just become a little more complete in the near future.


(Log in to post comments)

Asynchronous buffered file I/O

Posted Jan 5, 2007 10:32 UTC (Fri) by kleptog (subscriber, #1183) [Link]

Yay! I will be nice to finally get this feature.

Asynchronous buffered file I/O

Posted Jan 9, 2007 8:11 UTC (Tue) by ldo (subscriber, #40946) [Link]

Some operating systems from nearly thirty years ago were already providing this feature, in a very simple way: decouple I/O from process scheduling. This way, there is no distinction between "synchronous" and "asynchronous" I/O in the kernel at all--as far as the kernel is concerned, all I/O operations are asynchronous.

Instead, the distinction is implemented entirely in userspace. The synchronous versions of the I/O calls actually make two kernel calls: "request I/O operation" followed by "wait for completion". No need for special paths through the kernel for handling synchronous versus asynchronous kinds of I/O operations.

Asynchronous buffered file I/O

Posted Jan 9, 2007 16:53 UTC (Tue) by nix (subscriber, #2304) [Link]

The problem is that the vast majority of all I/O ops are synchronous, so needing multiple syscalls for them is unnecessary overhead. (You might think `who cares, the I/O will dominate', but it may not with e.g. fast network cards).

Plus, synchronous I/O was there first (I suspect this is the *real* reason why it gets syscalls of its own).

Asynchronous buffered file I/O

Posted Jan 10, 2007 5:57 UTC (Wed) by ldo (subscriber, #40946) [Link]

The extra overhead didn't matter for low-performance applications.

The high-performance applications tended to be written to use ASTs to report I/O completion.

Asynchronous buffered file I/O

Posted Jan 24, 2007 1:44 UTC (Wed) by schabi (subscriber, #14079) [Link]

you write:
Networking supports AIO reasonably well.
That might be true for TCP, but AFAICS there's no api that lets us use UDP with asynchroneous I/O. The problem is that, besides the actual data, address information has to be passed on send and receive, and the aio_* API does not allow that information to be passed.

Asynchronous buffered file I/O

Posted Dec 4, 2012 18:41 UTC (Tue) by VBart (subscriber, #88003) [Link]

What happened with the patches? It seems they didn't get into the kernel.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds