Asynchronous buffered file I/O
While Linux has long offered a set of system calls for asynchronous I/O, support within the kernel has been spotty and slow in coming. Most char devices do not provide the necessary methods - generally because there is no pressing need for them to support asynchronous operations. Networking supports AIO reasonably well. At the block level, all I/O is asynchronous, but that is not true when dealing with the virtual filesystem layer. Quite a bit of work went into supporting asynchronous direct filesystem I/O, making the big database vendors happy. But most applications do not use direct I/O, and the system as a whole usually benefits from the use of buffered I/O. So asynchronous buffered I/O support is arguably the biggest remaining hole.
Various buffered filesystem AIO patches have been posted over the course of some three years, but none have made it into the kernel. Recently, Suparna Bhattacharya has restarted this work with a new file AIO patch which attempts to add this capability in the least intrusive way possible. This work may now be simple enough that few will be able to find things to object to.
Like previous versions of the patch, the current code adds a special wait queue to each process's task structure. That queue is used for normal synchronous operations, while asynchronous operations each have their own, dedicated queue. The current wait queue is passed into filesystem I/O operations which could block. That enables a couple of special tricks to be performed:
- The I/O wait code checks to see if an asynchronous wait queue is
in use. If so, it simply returns -EIOCBRETRY rather than
waiting. This return code indicates that the operation is still in
progress; among other things, it is used to ensure that the wait queue
entry remains on the queue until the operation completes.
- Normally, wait queues wake up whatever process is waiting on them. They are, however, rather more general than that. By changing the wakeup function (see this LWN article for information on how to do that), the AIO code can use wait queues as notification service. When a "wakeup" happens on a queue being used for AIO, the kernel, rather than waking up a process, starts up a workqueue with an entry that will take the next step in the I/O operation.
The normal buffered filesystem read code, simplified almost into oblivion, looks something like this:
for each file page to be read
get the page into the page cache
copy the contents to the user buffer
The real code can be found in mm/filemap.c as
do_generic_mapping_read(), but the leading comment notes that
"this is really ugly
". It is one of only three functions so
marked in that file, so, trust your editor, and go with the simple version
above.
In the pseudocode version, the place where things block is clearly the step where the file page is read into the page cache. If the page is not already cached, the kernel will have to set up a disk I/O operation and wait for it to be carried out. That code proceeds the way it always did, until it gets to the "wait" part, at which point the AIO wait queue will be noticed and the code will return to whatever it was doing before. Once the read completes, the special wakeup function associated with the AIO queue will pick up where things left off.
One might well wonder just how that "pick up" part works. The wakeup function will not be running in the process of the original calling application, and may well not be running in process context at all. So it queues up a workqueue function which will examine the state of the outstanding I/O operation and, if necessary, jump back into the loop above to continue the work. Before doing so, however, the workqueue function carefully tweaks its memory management context so that it shares the original application's address space. That tweak is necessary to make the final line above (copy the page to the user buffer) work as expected. The workqueue function will perform that copy, then proceed on to the next page (if any). Likely as not, that next page will need to be read in from disk, so the workqueue function will, after ensuring that the operation is started, simply quit. This process repeats until all of the requested data has been read, at which point the application can be notified that the operation is complete.
On the write side, one might think that no changes are required - buffered file writes are already asynchronous, with the flush to disk happening in the background. The exception, however, is when O_SYNC is in use. There are situations where applications want to know when the data has found its way to the disk platter, but they still don't want to block waiting for that to happen. A very similar approach is used to make asynchronous O_SYNC writes work, though the patch is a little larger. A couple of the low-level page writeback functions required modifications so that they would pass the relevant wait queue around.
Even with this change in place, writes can still block on occasion. In particular, any operation which requires allocating disk blocks for the file may block while those allocations are performed. This issue can probably be worked around, but that work has not yet been done.
The result of all this is a working asynchronous buffered file I/O
capability which makes almost no changes to (and adds little overhead to)
the "normal" synchronous code. If no serious objections are raised, the
Linux AIO subsystem might just become a little more complete in the near
future.
| Index entries for this article | |
|---|---|
| Kernel | Asynchronous I/O |
