LWN.net Logo

Page-based direct I/O

By Jonathan Corbet
August 25, 2009
An "address space" in kernel jargon is a mapping between a range of addresses and their representation in an underlying filesystem or device. There is an address space associated with every open file; any given address space may or may not be tied to a virtual memory area in a process's virtual (memory) address space. In a typical process, a number of address spaces will exist for mappings of the executable being run, files the process has open, and ranges of anonymous user memory (which use swap as their backing store). There are a number of ways for processes to operate on their address spaces, one of the stranger of which being direct I/O. A new patch series from Jens Axboe looks to rationalize the direct I/O path a bit, making it more flexible in the process.

The idea behind direct I/O is that data blocks move directly between the storage device and user-space memory without going through the page cache. Developers use direct memory for either (or both) of two reasons: (1) they believe they can manage caching of file contents better than the kernel can, or (2) they want to avoid overflowing the page cache with data which is unlikely to be of use in the near future. It is a relatively little-used feature which is often combined with another obscure kernel capability: asynchronous I/O. The biggest consumers, by far, of this functionality are large relational database systems, so it is not entirely surprising that a developer currently employed by Oracle is working in this area.

When the kernel needs to do something with an address space, it usually looks into the associated address_space_operations structure for an appropriate function. So, for example, normal file I/O are handled with:

    int (*writepage)(struct page *page, struct writeback_control *wbc);
    int (*readpage)(struct file *filp, struct page *page);

As with the bulk of low-level, memory-oriented kernel operations, these functions operate on page structures. When memory is managed at this level, there is little need to worry about whether it is user-space or kernel memory, or whether it is in the high-memory zone. It's all just memory. The function which handles direct I/O looks a little different, though:

    ssize_t (*direct_IO)(int rw, struct kiocb *iocb, const struct iovec *iov,
			 loff_t offset, unsigned long nr_segs);

The use of the kiocb structure shows the assumption that direct I/O will be submitted through the asynchronous I/O path. Beyond that, though, the iovec structure pointing to the buffers to be transferred comes directly from user space, and it contains user-space addresses. That, in turn implies that the direct_IO() function must itself deal with the process of getting access to the user-space buffers. That task is generally handled in VFS-layer generic code, but there's another problem: the direct_IO() function cannot be called on kernel memory.

The kernel does not normally need to use the direct I/O paths itself, but there is one exception: the loopback driver. This driver allows an ordinary file to be mounted as if it were a block device; it can be most useful for accessing filesystem images stored within disk files. But files accessed via a loopback mount may well be represented in the page cache twice: once on each side of the loopback mount. The result is a waste of memory which could probably be put to better uses.

It would, in summary, be nice to change the direct_IO() interface to avoid this memory waste, and to make it a little bit more consistent with the other address space operations. That is what Jens's patch does. With that patch, the interface becomes:

    struct dio_args {
	int rw;
	struct page **pages;
	unsigned int first_page_off;
	unsigned long nr_segs;
	unsigned long length;
	loff_t offset;

	/*
	 * Original user pointer, we'll get rid of this
	 */
	unsigned long user_addr;
    };

    ssize_t (*direct_IO)(struct kiocb *iocb, struct dio_args *args);

In the new API, many of the relevant parameters have been grouped into the dio_args structure. The memory to be transferred can be found by way of the pages_array. The higher-level VFS direct I/O code now handles the task of mapping user-space buffers and creating the pages array.

The impact on the code is, for the most part, small; it's mostly a matter of moving the location where the translation from user-space address to page structures is done. The current code does have a potential problem in that it only processes one I/O segment at a time, possibly creating performance problems for some kinds of applications. That mode of operation is not really wired into the system, though, and can presumably be fixed at some point.

The only other objection came from Andrew Morton, who does not like the way Jens implemented the process of working through the array of page structures. The index into this array (called head_page) is built into struct dio and hidden from the code which is actually working through the pages; that leads to potential confusion, especially if the operation aborts partway through. Andrew called it "a disaster waiting to happen" and recommended that indexing be made explicit where the pages array is processed.

That is a detail, though - albeit a potentially important one. The core goals and implementation appear to have been received fairly well. It seems highly unlikely that this code could be ready for the 2.6.32 merge window, but we might see it aiming for the mainline in a subsequent development cycle.


(Log in to post comments)

Page-based direct I/O

Posted Aug 27, 2009 22:10 UTC (Thu) by giraffedata (subscriber, #1954) [Link]

Developers use direct memory for either (or both) of two reasons: (1) they believe they can manage caching of file contents better than the kernel can, or (2) they want to avoid overflowing the page cache with data which is unlikely to be of use in the near future.

I agree that's what they use it for, but neither of these is really the point of direct I/O. Developers use it for this side effect, because Linux doesn't offer what they really want: simply a cache replacement policy that says, "try not to keep anything for this file in cache."

The reasons for direct I/O per se are different. I believe there are two: 1) something besides this kernel image is accessing the file, so there's no way for the user space program and this other thing to synchronize with kernel caching in the mix. 2) you want to save the expense of an extra copy of data through the kernel's cache. (You can get this with mmap too, but it has drawbacks compared to read/write).

Page-based direct I/O

Posted Aug 28, 2009 2:05 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Linux doesn't offer what they really want: simply a cache replacement policy that says, "try not to keep anything for this file in cache."
posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE);

Page-based direct I/O

Posted Aug 28, 2009 2:55 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Thanks for pointing out fadvise; I forgot about it. That's probably a better option than having the application choose the actual cache behavior.

But I believe most users of direct I/O today need a new option to get what they want. NOREUSE says "I'm not going to access this data again soon," but we also need, "I don't have anything better to do than wait for I/O, so don't buffer writes on my account."

The beauty of this, as opposed to direct I/O, is the kernel can still do readahead and write behind in order to do more efficient disk scheduling, which is something the application really isn't in a position to do -- a user of a filesystem isn't supposed to know anything about seeks and such.

Page-based direct I/O

Posted Sep 7, 2009 10:34 UTC (Mon) by jlokier (guest, #52227) [Link]

When doing video streaming on small embedded systems (small = 32MB RAM, slow processor, no MMU), kernel read-ahead and write-behind turn out to be problematic because they cause too much memory pressure and in a rather lumpy way (which is the real problem - memory allocation failures start happening, and the I/O rate is very variable from one second to the next).

But not having read-ahead and write-behind makes latency too high, unless asynchronous I/O is used to keep the queues full. Asynchronous I/O doesn't work on Linux except with direct I/O. So we're dabbling in asynchronous, direct I/O for video streaming on small devices to make it more reliable.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds