Asynchronous block loop I/O

By Jonathan Corbet
January 30, 2013

The kernel's block loop driver has a conceptually simple job: take a file located in a filesystem somewhere and present it as a block device that can contain a filesystem of its own. It can be used to manipulate filesystem images; it is also useful for the management of filesystems for virtualized guests. Despite having had some optimization effort applied to it, the loop driver in current kernels is not as fast as some would like it to be. But that situation may be about to change, thanks to an old patch set that has been revived and prepared for merging in a near-future development cycle.

As a block driver, the loop driver accepts I/O requests described by struct bio (or "BIO") structures; it then maps each request to a suitable block offset in the file serving as backing store and issues I/O requests to perform the desired operations on that file. Each loop device has its own thread, which, at its core, runs a loop like this:

    while (1) {
	wait_for_work();
	bio = dequeue_a_request()
	execute_request(bio);
    }

(The actual code can be seen in drivers/block/loop.c.) This code certainly works, but it has an important shortcoming: it performs I/O in a synchronous, single-threaded manner. Block I/O is normally done asynchronously when possible; write operations, in particular, can be done in parallel with other work. In the loop above, though, a single, slow read operation can hold up many other requests, and there is no ability for the block layer or the I/O device itself to optimize the ordering of requests. As a result, the performance of loop I/O traffic is not what it could be.

In 2009, Zach Brown set out to fix this problem by changing the loop driver to execute multiple, asynchronous requests at the same time. That work fell by the wayside when other priorities took over Zach's time, so his patches were never merged. More recently, Dave Kleikamp has taken over this patch set, ported it to current kernels, and added support to more filesystems. As a result, this patch set may be getting close to being ready to go into the mainline.

At the highest level, the goal of this patch set is to use the kernel's existing asynchronous I/O (AIO) mechanism in the loop driver. Getting there takes a surprising amount of work, though; the AIO subsystem was written to manage user-space requests and is not an easy fit for kernel-generated operations. To make these subsystems work together, the 30-part patch set takes a bottom-up approach to the problem.

The AIO code is based around a couple of structures, one of which is struct iovec:

    struct iovec {
	void __user *iov_base;
	__kernel_size_t iov_len;
    };

This structure is used by user-space programs to describe a segment of an I/O operation; it is part of the user-space API and cannot be changed. Associated with this structure is the internal iov_iter structure:

    struct iov_iter {
	const struct iovec *iov;
	unsigned long nr_segs;
	size_t iov_offset;
	size_t count;
    };

This structure (defined in <linux/fs.h>) is used by the kernel to track progress working through an array of iovec structures.

Any kernel code needing to submit asynchronous I/O needs to express it in terms of these structures. The problem, from the perspective of the loop driver, is that struct iovec deals with user-space addresses. But the BIO structures representing block I/O operations deal with physical addresses in the form of struct page pointers. So there is an impedance mismatch between the two subsystems that makes AIO unusable for the loop driver.

Fixing that involves changing the way struct iov_iter works. The iov pointer becomes a generic pointer called data that can point to an array of iovec structures (as before) or, instead, an array of kernel-supplied BIO structures. Direct access to structure members by kernel code is discouraged in favor of a set of defined accessor operations; the iov_iter structure itself gains a pointer to an operations structure that can be changed depending on whether iovec or bio structures are in use. The end result is an enhanced iov_iter structure and surrounding support code that allows AIO operations to be expressed in either user-space (struct iovec) or kernel-space (struct bio) terms. Quite a bit of code using this structure must be adapted to use the new accessor functions; at the higher levels, code that worked directly with iovec structures is changed to work with the iov_iter interface instead.

The next step is to make it possible to pass iov_iter structures directly into filesystem code. That is done by adding two more functions to the (already large) file_operations structure:

    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);

These functions are meant to work much like the existing aio_read() and aio_write() functions, except that they work with iov_iter structures rather than with iovec structures directly. A filesystem supporting the new operations must be able to cope with I/O requests expressed directly in BIO structures — usually just a matter of bypassing the page-locking and mapping operations required for user-space addresses. If these new operations are provided, the aio_*() functions will never be called and can be removed.

After that, the patch set adds a new interface to make it easy for kernel code to submit asynchronous I/O operations. In short, it's a matter of allocating an I/O control block with:

    struct kiocb *aio_kernel_alloc(gfp_t gfp);

That block is filled in with the relevant information describing the desired operation and a pointer to a completion callback, then handed off to the AIO subsystem with:

    int aio_kernel_submit(struct kiocb *iocb);

Once the operation is complete, the completion function is called to inform the submitter of the final status.

A substantial portion of the patch set is dedicated to converting filesystems to provide read_iter() and write_iter() functions. In most cases the patches are relatively small; most of the real work is done in generic code, so it is mostly a matter of changing declared types and making use of some of the new iov_iter accessor functions. See the ext4 patch for an example of what needs to be done.

With all that infrastructural work done, actually speeding up the loop driver becomes straightforward. If the backing store for a given loop device implements the new operations, the loop driver will use aio_kernel_submit() for each incoming I/O request. As a result, requests can be run in parallel with, one hopes, a significant improvement in performance.

The patch set has been through several rounds of review, and most of the concerns raised would appear to have been addressed. Dave is now asking that it be included in the linux-next tree, suggesting that he intends to push it into the mainline during the 3.9 or 3.10 development cycle. Quite a bit of kernel code will be changed in the process, but almost no differences should be visible from user space — except that block loop devices will run a lot faster than they used to.

Index entries for this article
Kernel	Block layer/Loopback device
Kernel	Loopback device

Asynchronous block loop I/O

Posted Jan 31, 2013 6:22 UTC (Thu) by alonz (subscriber, #815) [Link]

The end result is an enhanced iov_iter structure and surrounding support code that allows AIO operations to be expressed in either user-space (struct iovec) or kernel-space (struct bio) terms.

Wouldn't it be better to just have the new functions (read_iter and write_iter) always take struct bio-based structures, and convert the struct iovecs to these structures in the generic aio_* functions?

This would result in a far cleaner abstraction, IMO…

Asynchronous block loop I/O

Posted Feb 2, 2013 16:21 UTC (Sat) by butlerm (subscriber, #13312) [Link]

This is great! I would just like to cheer on the folks willing to take this on. It is a much needed improvement.

Asynchronous block loop I/O

Posted Feb 2, 2013 18:33 UTC (Sat) by nix (subscriber, #2304) [Link] (4 responses)

the AIO subsystem was written to manage user-space requests

... a task for which it is almost unused. glibc doesn't use it to implement the POSIX aio functionality. Is QEMU its only user?

Asynchronous block loop I/O

Posted Feb 2, 2013 21:31 UTC (Sat) by raven667 (subscriber, #5198) [Link] (3 responses)

I thought Oracle was the big user of AIO

Asynchronous block loop I/O

Posted Feb 3, 2013 0:41 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

Oh yes, the database. So the whole point of this aio infrastructure is... to seduce Oracle away from the evil of raw partitions? Is that *all*?

It seems odd that glibc isn't using it to implement the user-side aio calls, is all. What's missing?

Asynchronous block loop I/O

Posted Feb 4, 2013 16:40 UTC (Mon) by raven667 (subscriber, #5198) [Link]

Well, Linus doesn't merge major interfaces like that for just one vendor, they must be generally useful. You could make the same kinds of arguments around btrfs or ocfs2 or the clustering lock manager or other features which were implemented by and for Oracle products primarily.

I don't know enough about AIO to comment intelligently about why it's not exposed by glibc.

Asynchronous block loop I/O

Posted Feb 8, 2013 7:01 UTC (Fri) by joib (subscriber, #8541) [Link]

IIRC the kernel AIO interface requires files to be opened with O_DIRECT, as well as page (or block?)-aligned I/O. So it's not really a general-purpose interface.

There have been a number of people working on "buffered AIO" over the years, but so far nothing has been merged.