A retry-based AIO infrastructure

[Posted March 2, 2004 by corbet]

The asynchronous I/O infrastructure was added in 2.5 as a way to allow processes to initiate I/O operations without having to wait for their completion. The underlying mechanism is documented in this Driver Porting Series article. The actual implementation of asynchronous I/O in the kernel has been somewhat spotty, however. It works for some devices (which have specifically implemented that support) and for direct file I/O. Other sorts of potentially interesting uses, such as with regular buffered file I/O, have remained unimplemented.

Part of the problem is that buffered file I/O integrates deeply with the page cache and virtual memory subsystem. It is not all that easy to graft asynchronous I/O operations into those complex bodies of code. So the kernel developers have, for the most part, simply punted on cases like that.

Suparna Bhattacharya, however, has not given up so easily. For over a year, now, she has been working on a set of patches which bring the asynchronous mode to the buffered I/O realm. A new set of patches has recently been posted which trims down the buffered AIO changes to the bare minimum. So this seems like a good time to take a look at what is involved in making asynchronous buffered I/O work.

The architecture implemented by these patches is based on retries. When an asynchronous file operation is requested, the code gets things started and goes as far as it can until something would block; at that point it makes a note and returns to the caller. Later, when the roadblock has been taken care of, the operation is retried until the next blocking point is hit. Eventually, all the work gets done and user space can be notified that the requested operation is complete. The initial work is done in the context of the process which first requested the operation; the retries are handled out of a workqueue.

For things to work in this mode, kernel code in the buffered I/O path must be taught not to block when it is working on an asynchronous request. The first step in this direction is the concept of an asynchronous wait queue entry. Wait queue entries are generally used, surprisingly, for waiting; they include a pointer to the process which is to be awakened when the wait is complete. With the AIO retry patch, a wait queue entry which has a NULL process pointer is taken to mean that actually waiting is not desired. When this type of wait queue entry is encountered, functions like prepare_to_wait() will not put the process into a sleeping state (though it does add the wait queue entry to the associated wait queue), and some functions will return the new error code -EIOCBRETRY rather than actually sleeping.

The next step is to add a new io_wait entry to the task structure. When AIO retries are being performed, that entry is pointed to an asynchronous wait queue entry associated with the specific AIO request. This task structure field is, for all practical purposes, being used in a hackish manner to pass the wait queue entry into functions deep inside the virtual memory subsystem. It might have been clearer to pass it explicitly as a parameter, but that would require changing large numbers of internal interfaces to support a rarely-used functionality. The io_wait solution is arguably less clean, but it also makes for a far less invasive patch. It does mean, however, that work can only proceed on a single AIO request at a time.

Finally, a few low-level functions have been patched to note the existence of a special wait queue entry in the io_wait field and to use it instead of the local entry that would normally have been used. In particular, page cache functions like wait_on_page_locked() and wait_on_page_writeback() have been modified in this way. These functions are normally used to wait until file I/O has been completed on a page; they are the point where buffered I/O often blocks. When AIO is being performed, instead, they will return the -EIOCBRETRY error code immediately.

The AIO code also takes advantage of the fact that wait queue entries, in 2.6, contain a pointer to the function to be called to wake up the waiting process. With an asynchronous request, there may be no such process; instead, the kernel needs to attempt the next retry. So the AIO code sets up its own wakeup function which does not actually wake any processes, but which does restart the relevant I/O request.

Once that structure is in place, all that's left is a bit of housekeeping code to keep track of the status of the request between retries. This work is done entirely within the AIO layer; as each piece of the request is satisfied, the request itself as seen by the filesystem layer is modified to take that into account. When the operation is retried to transfer the next chunk of data, it looks like a new request with the already-done portion removed.

Add in a few other hacks (telling the readahead code about the entire AIO request, for example, and an AIO implementation for pipes) and the patch set is complete. It does not attempt to fix every spot which might block (that would be a large task), but it should take care of the most important ones.

Index entries for this article
Kernel	Asynchronous I/O

A retry-based AIO infrastructure

Posted Mar 8, 2004 13:55 UTC (Mon) by mwilck (subscriber, #1966) [Link] (6 responses)

It does mean, however, that work can only proceed on a single AIO request at a time.

Hmm... and I thought the point of AIO was to be able to just fire off dozens of IO requests and not bother until some of them signal completion. What am I missing here?

One at a time

Posted Mar 8, 2004 15:01 UTC (Mon) by corbet (editor, #1) [Link] (5 responses)

Perhaps I didn't express that quite as well as I could have. What it means is that the kernel can be actively working on only one request per process at a time. There can be several requests with I/O outstanding, but, once the CPU's attention is required, only one at a time can be worked on, even on multiprocessor systems.

One at a time

Posted Mar 8, 2004 20:23 UTC (Mon) by mwilck (subscriber, #1966) [Link] (3 responses)

I figured that. Yet the term "asynchronous" suggests to me that the requests should be progressing independently - not the kind of serialized behavior that you describe. What if that current request is progressing slowly (think a floppy) and others in the queue never get worked on until it's completed?

My impression is that it should have been the other way around: instead of implementing AIO on top of buffered IO, asynchronous requests should be the basic IO primitive and all other IO should be implemented on top of that. I can't oversee what that'd imply for the page cache, though.

One at a time

Posted Mar 9, 2004 8:08 UTC (Tue) by larryr (guest, #4030) [Link] (2 responses)

I think the thread that is handling accessing a floppy is going to be asleep almost all the time, so there will be plenty of time for other threads to wake up, do some stuff, and either finish or go back to sleep themselves.

Larry

AIO as replacement for mutlithreading ?

Posted Mar 9, 2004 21:47 UTC (Tue) by mwilck (subscriber, #1966) [Link] (1 responses)

What are you referring to as a "thread"? An AIO request? How can other AIO requests "wake up, do some stuff" if the kernel is still handling the floppy request which blocks the single AIO entry for the process?

Guess I must read the code myself.

I may have put my concern in the wrong words though. One idea I have about AIO is that you can have an application behave like a multithreaded application with a singe thread. Instead of creating threads for certain IO tasks you just fire off AIO requests and they _proceed simultaneously_, as if they were driven by different threads. I wonder how that'd be possible with the serialized approach described here. Perhaps the whole notion is wrong, though?

AIO as replacement for mutlithreading ?

Posted Mar 11, 2004 17:34 UTC (Thu) by larryr (guest, #4030) [Link]

How can other AIO requests "wake up, do some stuff" if the kernel is still handling the floppy request which blocks the single AIO entry for the process?

Most of the (wall clock) time the kernel is handling the floppy request is going to be spent sleeping (waiting for the floppy device), and while that request is sleeping another request can proceed until it either completes or has to wait for a device (sleep).

Larry

One at a time

Posted Apr 29, 2004 11:29 UTC (Thu) by suparna (guest, #7766) [Link]

I'm not sure I read this correctly, but I don't think we have that kind of a limitation in the code. A workqueue thread's tsk->io_wait pointer is set to the address of the ki_wait field inside the iocb that it is handling at that particular time. There is no reason why a worker thread on another CPU cannot process another iocb, by setting its own tsk->io_wait pointer to point to another iocb's ki_wait. The point to notice here is that in these situations "tsk" refers to the task which is processing the iocb at a given time (not the task which originally issued the IO), and the io_wait pointer reflects the wait context of the iocb on whose behalf, so to say, the code is being executed.

That said, it may sometimes be more efficient not to have worker threads on multiple CPUs trying to process iocbs for the same ioctx at the same time (reduces spinlock bouncing on the ioctx lock, as observed by Chris Mason).

Hope that clarifies !