LWN.net Logo

A better DMA memory allocator

By Jonathan Corbet
March 10, 2008
As any device driver author knows, hardware can be a pain sometimes. In the early days of Linux, peripherals attached to the ISA bus inflicted their particular variety of pain by being unable to use more than 24 bits to access memory. What that meant, in practical terms, was that ISA devices could not perform DMA operations on memory above 16MB. The PCI bus lifted that restriction, but, for some time, there were quite a few "PCI" devices that were minimally modified ISA peripherals; many of those retained the 16MB limit.

To handle the needs of these devices, Linux has long maintained the DMA memory zone. Drivers which need to allocate memory from that zone would specify GFP_DMA in their allocation requests. The memory management code takes special care to keep memory in that zone available so that DMA requests can be satisfied. In this way, the system can provide reasonable assurance that memory will be available to perform DMA in ways which meet the special needs of this particularly challenged hardware.

The only problem is that there aren't a whole lot of devices out there which still have the old 24-bit addressing limitation. So the DMA zone tends to sit idle. Meanwhile, there are devices with other sorts of limitations. Many peripherals only handle 32-bit addresses, so their DMA buffers must be allocated in the bottom 4GB of memory. There is a subset, however, with stranger limitations - 30 or 31-bit addresses, for example. The kernel's DMA library provides a way for drivers to disclose that sort of embarrassing limitation, but the memory management code does not really help the DMA layer make allocations which satisfy those constraints. So drivers for such devices must use the DMA zone (which may not be present on all architectures), or hope that normal zone memory fits the bill.

Andi Kleen has set out to clean up this situation with a new DMA memory allocator. His solution is to take a chunk of memory out of the kernel's buddy allocator entirely and manage it in an entirely different way, forming a reserve pool for DMA allocations. The result is a bit of a departure from normal Linux memory management algorithms, but it may well be better suited to the task at hand.

The new "mask" allocator grabs a configurable chunk of low memory at boot time. Allocations from this region are made with a separate set of calls, with the core API being:

    struct page *alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask);
    void __free_pages_mask(struct page *page, unsigned size);

    void *get_pages_mask(gfp_t gfp, unsigned size, u64 mask);
    void free_pages_mask(void *mem, unsigned size);

alloc_pages_mask() looks a lot like the longstanding alloc_pages() function, but there's some important differences. The size parameter is the desired size of the allocation, rather than the "order" value used by alloc_pages(), and mask describes the range of usable addresses for this allocation. Though mask looks like a bitmask, it is really better understood as the address value that the allocated memory should have; "holes" in the mask would make no sense.

A call to alloc_pages_mask() will first attempt to allocate the requested memory using the normal Linux memory allocator, on the assumption that the reserved DMA memory is an especially limited resource. If the allocation fails, perhaps because there's no physically-contiguous chunk of sufficient size available, then the allocator will dip into the reserved DMA pool. If the normal allocation succeeds, though, the allocated memory must still be tested against the maximum allowable address: the normal memory allocator, remember, has no support for allocating below an arbitrary address. So if the returned memory is out of bounds, it must be immediately freed and the reserved pool will be used instead.

That reserved pool is not managed like the rest of memory. Rather than the buddy lists maintained by the slab allocator, the DMA allocator has a simple bitmap describing which pages are available. It will normally cycle through the entire memory region, allocating the next available chunk of sufficient size. If that chunk is above the memory limit, though, the allocator will move back to the lower end of the reserved pool and allocate from there instead. Since DMA allocations tend to be short-lived, one would expect that a suitable block of memory would either be available or become available in the near future.

One other difference of note is that, unlike the slab allocator, the DMA allocator does not round memory allocation sizes up to the next power of two. DMA allocations can be relatively large, so that rounding can result in significant internal fragmentation and memory waste.

At the next level up, Andi has added a new form of mempool which uses the DMA allocator:

    mempool_t *mempool_create_pool_pmask(int min_nr, int size, u64 mask);

This pool will behave like normal mempools, with the exception that all allocations will be below the limit passed in as mask. These pools are used in the block layer, where memory allocations for DMA must succeed.

One might object that reserving a big chunk of low memory for this purpose reduces the total amount of memory available to the system - especially if the DMA allocator is cherry-picking normal memory whenever it can anyway. But the cost is not as bad as one might think. These patches do away with the old DMA zone, which, for all practical purposes, was already managed as a reserved (and often unused) memory area. Some 64-bit architectures also set aside a significant chunk (around 64MB) of low memory for the swiotlb - essentially a set of bounce buffers used for impedance matching between high memory (>4GB) buffers and devices which cannot handle more than 32-bit addresses. With Andi's patch set, the swiotlb, too, makes allocations from the DMA area and no longer has its own dedicated memory pool. So the total amount of memory set aside for I/O will not change very much; it could, in fact, get smaller.

For most driver authors, there will be little in the way of required changes if this patch set gets merged. The DMA layer already allows drivers to specify an address mask with dma_set_mask(); with the DMA allocator in place, that mask will be better observed. The one change which might affect a few drivers is further down the line: eventually the GFP_DMA memory allocation flag will go away. Any driver which still uses this flag should set a proper mask instead.

So far, there has been little discussion resulting from the posting of these patches. Silence does not mean assent, of course, but it would appear that there is little opposition to this set of changes.


(Log in to post comments)

A better DMA memory allocator

Posted Mar 15, 2008 1:04 UTC (Sat) by iabervon (subscriber, #722) [Link]

I seem to recall that there are actually devices out there whose mask of addresses they can
use for DMA actually does have holes in it, due to some of their address lines being miswired
or something like that. It's hard to come up with a hardware quirk that's so nonsensical that
there isn't hardware that manages to have it.

Why it's a "mask" ...

Posted Mar 20, 2008 7:29 UTC (Thu) by HalfMoon (guest, #3211) [Link]

The classic example of that is that SA-1100 (old ARMv4, no longer manufactured) DMA controller had an erratum which meant that one address bit could never be used ... every other MByte was unusable for DMA. So this is a case where the functionality of a DMA address "mask" was appropriate, instead of just a "biggest address" value. Intel never fixed that bug (or a boatload of others in that chip).

It escapes me why Linux calls what it has a "mask"; it's long overdue to change its name to reflect the fact that it's just a ceiling on the addresses usable for DMA. Calling it a "mask" makes it seem like the complete inability to handle that SA-1100 erratum is a bug.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds