LWN.net Logo

Scatterlist chaining

High-performance I/O generally involves the use of direct memory access (DMA) operations. With DMA, the I/O device transfers data directly to and from main memory without the intervention of the CPU. In the simplest form of DMA, the controller is handed a pointer to a region of memory, given the length, and told to do its thing. The processor can then forget about the operation until the device signals that the work is done.

This simple view has a drawback, however, in that it assumes that the data to be transferred is stored contiguously in memory. When the I/O buffer is in kernel space, the kernel can often arrange for it to be physically contiguous - though that gets harder as the size of the buffers gets larger. If the buffer is in user space, it is guaranteed to be scattered around physical memory. So it would be nice if DMA operations could work with buffers which are split into a number of distinct pieces.

In fact, with any reasonably capable peripheral device, buffers can be split this way. The term for operations on such buffers is "scatter/gather I/O"; scatter/gather has been well supported under Linux for some time. The DMA chapter of Linux Device Drivers covers scatter/gather in a fair amount of detail. In short, a driver starts by filling in an array of scatterlist structures, which (on the i386 architecture) look like:

    struct scatterlist {
        struct page	*page;
    	unsigned int	offset;
    	dma_addr_t	dma_address;
    	unsigned int	length;
    };

For each segment, the page pointer tells where the segment is to be found in memory, offset tells where the data begins within the page, and length is the length of the segment. Once the list has been filled in, the driver calls:

    int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
                   enum dma_data_direction direction);

This operation, at a minimum, fills in the dma_address field of each scatterlist entry with an address which can be given to the peripheral. It might do more, though: physically contiguous pages may be coalesced into a single scatterlist entry, or the system's I/O memory management unit might be programmed to make parts (or all) of the list virtually contiguous from the device's point of view. All of this - including the exact form of struct scatterlist - is architecture dependent, but the scatter/gather interface is set up so that drivers need not worry about architecture details.

Recently, a particular constraint in the scatter/gather interface has turned up. For various reasons, scatterlists must fit within a single page; that restriction puts an upper limit on the number of segments which may be represented. On the i386 architecture, with high memory enabled, struct scatterlist requires 20 bytes, which limits a scatterlist to 204 entries. If each scatterlist entry points to a full page, the maximum size for a DMA transfer will be about 800KB. On an x86-64 system, the situation is worse: the structure uses 40 bytes, cutting the maximum length in half.

There are situations where larger I/O operations are desirable. The block I/O subsystem is one of them, but there are certainly others: high-resolution video capture devices are an example. The limitation on scatterlist length is one of the factors motivating developers who are working on large block size support. By increasing the effective page size, they are able to increase the maximum I/O operation size.

Increasing the page size is not the only feasible approach, though; another is simply to make scatterlists longer. Multi-page contiguous scatterlists are not really in the cards, but chaining single-page scatterlists can be done. Jens Axboe has been working on that approach; his scatterlist chaining patch is on its sixth revision as of this writing.

Chaining is done by overloading the page pointer in the last scatterlist entry in a page. The least significant bit is set to indicate that the entry is, in fact, a chain link rather than another segment to transfer. The change is almost transparent to drivers. In current kernels, the code which iterates through a scatterlist usually looks something like this:

    struct scatterlist *sg = &the_scatterlist[0];

    for (i = 0; i < nentries; i++) {
	program_io_operation(sg);
	sg++;
    }

When chaining is being used, simply incrementing through the array no longer works. So Jens has added a simple sg_next() macro to follow the the chain links when necessary. So the sg++ line above turns into something like:

    sg = sg_next(sg);

Since a driver change is required, chained scatterlists should not be used unless one knows for sure that the driver is prepared for them. The patch from Jens fixes up a number of drivers, especially in the block subsystem. Even so, the maximum I/O size must be raised explicitly by the administrator (via a sysfs file) before chaining will be turned on. Once it's enabled, however, multi-megabyte I/O operations become possible. No intrusive memory management changes required.


(Log in to post comments)

Scatterlist chaining

Posted May 17, 2007 14:16 UTC (Thu) by i3839 (guest, #31386) [Link]

> No intrusive memory management changes required.

I assume you hint at the variable pagecache patch that's floating around, but is this scatterlist chaining a reaction to that, or coincidentally at around the same time?

Scatterlist chaining

Posted May 17, 2007 17:51 UTC (Thu) by pj (subscriber, #4506) [Link]

Are there performance comparison numbers with/without this change? How much of a speedup is it likely to be?

Scatterlist chaining (performance)

Posted May 18, 2007 16:03 UTC (Fri) by nevyn (subscriber, #33129) [Link]

I was also interested in that. There doesn't seem to be any direct posting of perf. numbers, but it does hint at it indirectly (from http://lkml.org/lkml/2007/5/9/251):

"""Performance wise, it's meant to help higher end hardware that need 2-4mb (or bigger) commands to get good performance. That also includes things like tapes that have big block sizes, getting a command of the right size there is the difference between good and abysmal performance."""

Scatterlist chaining (why scatterlist needs to fit into single page)

Posted Jun 12, 2007 14:52 UTC (Tue) by tom.tuytschaever (guest, #45731) [Link]

> For various reasons, scatterlists must fit within a single page

What would these reasons be (in short) ?
Is is because of the way that dma_map_sg & dma_unmap_sg are implemented ?

Or is it because it is harder to allocate more than one page contiguously in memory ? Couldn't really find this in article http://lwn.net/Articles/233599/

Scatterlist chaining

Posted Oct 18, 2007 6:15 UTC (Thu) by sonnyrao (subscriber, #11351) [Link]

Does the one page limit exist for architectures with an IOMMU ?

Scatterlist chaining

Posted Dec 3, 2007 16:11 UTC (Mon) by mdr (guest, #49391) [Link]

I was wondering, does this mean that the speed of disk I/O (I/O to the raid controller) is
being negatively impacted by this constraint, if we can only transfer a maximum of 800KB per
interrupt. 

This means that either a processor needs to spend more time servicing interrupts, because more
interrupts need to be generated to move the data about, or that less data is going to get
transferred?

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds