A handful of DMA topics
Many devices can perform full 64-bit DMA operations. This capability is nice on large-memory systems, but working with larger addresses can also bring a performance penalty. As a way of helping drivers pick the optimal size for DMA address descriptors, James Bottomley has proposed the creation of a new function called dma_get_required_mask().
The current API already has dma_set_mask(), which tells the kernel about the range of DMA addresses the device can access. The new function would be called after an invocation of dma_set_mask(); it would return a new bitmask describing what the platform sees as the optimal set of DMA addresses, taking the device's original DMA mask into account. If the specific hardware situation does not require the use of larger addresses, the platform can suggest using the faster, 32-bit mode even when the device can handle larger addresses. The driver can then use that advice to set a new mask describing what it will actually use.
The "scatterlist" mechanism is another part of the DMA subsystem; it allows drivers to set up scatter/gather I/O, where the buffer to be transferred is split into multiple, distinct chunks of memory. Scatter/gather is useful in a number of situations, including network packets (which are assembled from multiple chunks), the readv() and writev() system calls, and for I/O directly to or from user-space buffers, which can be spread out in physical memory. The mapping functions for scatter/gather I/O will coalesce pieces of the buffer which turn out to be physically adjacent in memory. In practice, that has turned out not to happen very often; one recent report showed that, out of approximately 32,000 segments, all of 40 had been merged in this manner.
It turns out, however, that the Linux memory allocator is not helping the situation. When the allocator breaks up a large block of pages to satisfy smaller requests (a frequent occurrence), it returns the highest page in the block. A series of allocations will, thus, obtain pages in descending order. If those pages are assembled into an I/O buffer, each page will need to be a separate segment in a scatter/gather operation, since the reverse-allocated pages cannot be merged.
William Lee Irwin put together a patch which causes the allocator to hand out pages from the bottom of a block instead of the top. With this patch applied, the merge rate in this particular test went up to over 55%. Larger segments lead to faster I/O setup and execution, which is a good thing. Sometimes a tiny patch can make a big difference, once you know where the problem is.
Meanwhile, Ian Molton turned up a different sort of problem. Some types of interfaces have their own onboard memory. This memory is, often, accessible to the CPU, and it can be used by devices attached to the interface for DMA operations. But that memory is not part of the regular system RAM, and it typically does not show up in the system memory map. As a result, the generic DMA functions will not make use of this memory when allocating DMA buffers.
It would be nice to be able to make use of this memory, however. It is there, and it can be used to offload some DMA buffers from main memory. On some systems, it may be the only memory which is usable for DMA operations to certain devices. The DMA API has even been set up with this sort of memory in mind; it can handle cases where, for example, the memory in question has a different address from the device's point of view than it does for the processor. It would seem that the addition of an architecture-specific module to the DMA API could enable such memory to be allocated on platforms which have it, when the DMA target is a device which can make use of it.
The biggest problem would appear to be that this sort of remote memory is not part of the system's memory map, and, thus, there is no struct page structure which describes it. The lack of a page structure makes certain macros fail. It also completely breaks any driver which tries to map the buffer into user space via the nopage() VMA operation. And, it turns out, drivers really do that; the ALSA subsystem, for example, maps buffers to user space in this manner.
Once a problem is identified, it can usually be fixed. The right approach in this case would appear to be a combination of two things. The first is to simply fix any bad assumptions in drivers with regard to how they can treat DMA buffers. If the driver expects that a page structure exists for a DMA buffer, it is broken and simply needs to be fixed. The second part is to provide an architecture-independent way for device drivers to map DMA buffers into user space.
To that end, Russell King has proposed yet another DMA API function:
int dma_map_coherent(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, dma_addr_t handle, size_t size);
This function would take the given mapped DMA buffer (as described by
cpu_addr and handle) and map it into the requested VMA.
Device drivers could use this function to make a buffer available to user
space, and would be able to discard their existing nopage()
methods. The new interface would thus simplify things, though it does
still leave a reference counting problem on the driver side of things:
freeing the DMA buffer before user space has unmapped it would be a big
mistake.
Index entries for this article | |
---|---|
Kernel | Direct memory access |
Kernel | dma_get_required_mask() |
Posted Jun 24, 2004 14:30 UTC (Thu)
by mmarkov (guest, #4978)
[Link] (3 responses)
How does the CPU access this memory? Via the data bus?
What addresses does it use -- and aren't all the valid
addresses refering to the RAM?
Posted Jun 24, 2004 16:06 UTC (Thu)
by Duncan (guest, #6647)
[Link]
Posted Jun 25, 2004 14:38 UTC (Fri)
by cthulhu (guest, #4776)
[Link]
Posted Jul 1, 2004 12:49 UTC (Thu)
by jamey (guest, #22733)
[Link]
Many drivers for such devices used ioremap and private memory allocators. What caused us to propose this change was the use of standard device hardware interfaces (UCB OHCI) in the context of a chip with direct access only to private SRAM. In the past, we made extensive modifications to drivers such as the USB OHCI driver to support these devices. However, it appeared that the 2.6 device model and the generic DMA API would enable us to clean this up and to produce drivers acceptable to the mainstream kernel.
Meanwhile, Ian Molton turned up a different sort of problem. Some types of interfaces have their own onboard memory. This memory is, often, accessible to the CPU,
A handful of DMA topics
Wouldn't a graphics card be the best known of these? I haven't read the A handful of DMA topics
main article as I'm just skimming them now to come back to later (after
some sleep, when I can make better sense of them <g>), so I don't know if
that fits in the context or not, but it's immediately what *I* think of
when I think of an interface with its own memory.
Duncan
Check out Ian's email - the chip memory is in the CPU I/O address space. A handful of DMA topics
If it accessible in the PCI or AGP space, it would show up in the CPU
memory space.
What I'm wondering now is how such chips have been used by drivers in the
past. My guess is that there were alternatives to using the I/O space,
or they didn't use DMA. In this case, the chip *must* DMA from its own
internal memory, and hence forced this enhancement.
For the devices for which Ian and I are writing drivers, the CPU accesses the device's internal memory via the CPU memory bus. On other platforms, the CPU might access the internal memory via an IO bus. The new DMA calls proposed by James Bottomley (dma_declare_coherent_memory and dma_release_coherent_memory) handle either case.A handful of DMA topics