|
|
Log in / Subscribe / Register

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

By Jonathan Corbet
August 28, 2017
Like most actively developed programs, the kernel grows over time; there have only been two development cycles ever (2.6.36 and 3.17) where the kernel as a whole was smaller than its predecessor. The kernel's internal API tends to grow in size and complexity along with the rest. The good thing about the internal API, though, is that it is completely under the control of the development community and can be changed at any time. Among other things, that means that parts of the kernel's internal API can be removed if they are no longer needed — or if their addition in the first place is deemed to be a mistake. A pair of pending removals in the memory-management area shows how this process can work.

GFP_TEMPORARY

One of the many challenges faced by the kernel's memory-management subsystem is fragmentation. If allocations are not placed carefully, the system's free memory can end up split into many small chunks that cannot be coalesced; that can lead to allocations failing even though much of the system's memory is idle. This is particularly true of memory allocations for use by the kernel itself. Those allocations can be long-lived and there is usually no way to relocate them if they are in the way. A single small allocation can prevent the reuse of an entire page; that, in turn, can block the creation of larger chunks of memory around that page.

It has long been understood that not all kernel memory allocations are equal. Some data structures are critical to the operation of the system and cannot be removed; consider the structures describing a mounted filesystem or a running process, for example. Others, though, exist to improve the system's performance and can be dropped if needed; the inode and dentry caches in the virtual filesystem layer are perhaps the biggest examples of this type of structure. The latter type of structure is called "reclaimable".

A key heuristic used within the memory-management subsystem is to try to separate reclaimable and non-reclaimable allocations. A page full of reclaimable allocations can, in theory at least, be recovered for other uses when memory is tight. But a single non-reclaimable allocation will prevent the reuse of the entire page. Separating the two types increases the probability that pages containing reclaimable allocations can, in truth, be reclaimed should the need arise.

Back in 2007, Mel Gorman added the GFP_TEMPORARY allocation type in an attempt to make memory allocation more flexible. The reasoning was this: some memory allocations last a long time, while others are highly transient. A structure allocated to represent a newly added device may persist for the lifetime of the system, while memory allocated to satisfy a system call may be returned within milliseconds. When an allocation is short-lived, it doesn't matter whether it is reclaimable or not; since it will be returned shortly regardless, it is unlikely to hold up the reclaim of a page full of otherwise reclaimable allocations. So GFP_TEMPORARY allocations were allowed to draw from the reclaimable pool, even though there was no mechanism by which they could be reclaimed.

Earlier this year, GFP_TEMPORARY was the subject of an extensive discussion that was, at the start, focused on a seemingly simple question: what does "temporary" mean? Is there a limit on how long the allocation can be held? Is the holder of a GFP_TEMPORARY allocation allowed to block or take locks? It turns out that this was not specified when GFP_TEMPORARY was added. The discussion failed to fill that void, and a review of the GFP_TEMPORARY call sites in the kernel revealed some decidedly non-temporary uses. It became clear that nobody really knew what a "temporary" allocation was supposed to be.

There was talk of trying to nail down that definition, but Michal Hocko pushed a different approach: remove GFP_TEMPORARY entirely. The current uses, he said, did not justify keeping it around:

I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning.

There were a handful of complaints about the loss of the flag, but no serious opposition to the change. Other developers, including Neil Brown, agreed with the change:

If we have a flag that doesn't have a well defined meaning that actually affects behavior, it will not be used consistently, and if we ever change exactly how it behaves we can expect things to break. So it is better not to have a flag, than to have a poorly defined flag.

He suggested improving the kernel's notion of reclaimability of allocations instead. That may happen in the future, but the removal of GFP_TEMPORARY is set to happen more quickly. The patches are in linux-next now, meaning they are on track to hit the mainline during the 4.14 merge window. Should that happen, GFP_TEMPORARY will itself prove to have been temporary — for a "ten years" value of "temporary".

dma_alloc_noncoherent()

The allocation of memory for direct memory access (DMA) operations is not as simple as it might seem. Devices often have a different view of memory than the CPU does, and allocations for DMA must bridge that gap. These allocations must usually be physically contiguous, and they have to be in a region of memory that the target device is able to access, for example. An interesting additional requirement is handled by dma_alloc_noncoherent():

    void *dma_alloc_noncoherent(struct device *dev, size_t size,
    				dma_addr_t *dma_handle, gfp_t flag);

A call to dma_alloc_noncoherent() is an explicit request to allocate a DMA buffer in a noncoherent region of memory. Memory that is cache-coherent looks the same to both the CPU and I/O devices. If the CPU writes to that memory, its writes will be visible to the device; similarly, if the device writes a region of memory, the CPU will immediately see the new data. Noncoherent memory lacks that guarantee; if the CPU wants to read data placed into memory via a DMA operation, it must take care to invalidate its own memory caches after the completion of the I/O operation, but before its first access.

Noncoherent memory is clearly trickier to work with; without sufficient care, it is easy to end up with corrupted data. So one might wonder why anybody would want to ask for it specifically. The answer is that on architectures where cache coherence doesn't come naturally (ARM, for example), coherent memory is far slower. Turning on coherence generally involves turning off caching, with a predictable effect on performance. For situations involving any significant data processing, using coherent memory is just not an option.

It is thus important to be able to allocate noncoherent memory for DMA buffers, which raises the question of why Christoph Hellwig is working to remove dma_alloc_noncoherent(). The answer is that, on any reasonably current system, control over memory access modes is more sophisticated than simply turning caching on or off. Memory can be configured to allow write combining (where multiple write operations can be grouped by the hardware for performance), for example, or it can be set to allow operations to be reordered. Many of these features can be configured together. Creating new allocation function for each combination is clearly unlikely to lead to joy, so the kernel developers added a new set of functions in the 3.4 development cycle, including:

    void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
			  gfp_t flag, unsigned long attrs);

The attrs field can be used to specify a whole range of attributes, including DMA_ATTR_NON_CONSISTENT to obtain a noncoherent mapping. This function clearly fills the same role as dma_alloc_noncoherent() and a lot more besides, so there is little reason to keep dma_alloc_coherent() around. Hellwig has been working to remove it, which means updating all of its callers to use dma_alloc_attrs() instead. Much of that work went in during the 4.13 merge window; only three call sites remain. His current patches remove those last three, along with the function itself. Any out-of-tree drivers using dma_alloc_noncoherent() will have to be updated separately, of course.

In both cases, the kernel's internal API is getting (slightly) smaller, but no functionality is being lost. This work is an example of the sort of cleanups that are possible when there is no need to maintain API compatibility. Interfaces exposed to user space must be preserved, but the ability to evolve internally is a big part of why the kernel remains maintainable despite having just celebrated its 26th birthday.

Index entries for this article
KernelDirect memory access
KernelMemory management/GFP flags


to post comments

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Aug 28, 2017 18:41 UTC (Mon) by mwsealey (guest, #71282) [Link] (6 responses)

Intel, ARM and the Linux DMA API all disagree on the meaning of coherent and consistent (the DMA-API docs are even inconsistent with it's usage) which doesn't map well at all to, say, ARM's memory models and types.

That's because the DMA API is written from the point of view of the DMA device, although that doesn't make sense either; implying coherency of memory by using the property of consistency (which isn't the same thing at all) is really a very strange concept, and doesn't involve cache-coherency at all (although it does involve disabling caching to try and gain the property of coherency).

I fail to see how removing dma_alloc_noncoherent would be controversial. But I also think someone should really consider the insanity of an API that writing software that runs on the CPU yet defining memory allocation and memory usage from the point of view of the device. As far as I can tell WRITE_COMBINE, WEAK_ORDERING, NON_CONSISTENT could be cacheable memory or not (the Device-GRnE type covers it beautifully!) - it would replace dma_alloc_noncoherent by making the same relaxations, but it isn't specific enough to cover the possibilities. Hopefully the next step is actually fixing that..

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Aug 29, 2017 7:34 UTC (Tue) by arnd (subscriber, #8866) [Link] (5 responses)

dma_alloc_noncoherent is very different from the others, and was never supported on ARM because it must be paired with dma_cache_sync() that is not implementable on architectures on which we must use cache flushes to make normal memory coherent.

I think it was introduced for SGI Origin or Altix machines, big NUMA boxes on which dma_alloc_coherent memory was very expensive and other memory could be made coherent using a simple synchronizing operation on the interconnect.

If we remove dma_alloc_noncoherent, we have to kill off dma_cache_sync as well, and maybe think about removing support for the machines that relied on it.

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Aug 29, 2017 18:41 UTC (Tue) by willy (subscriber, #9762) [Link]

I think it was actually PA-RISC. There are some early machines (710, 720, 730, 750) where there is no IOMMU, and no concept of cache coherent DMA. I think that supporting V class would also require noncoherent memory, but that effort is extremely unlikely to be undertaken at this point.

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Aug 30, 2017 20:31 UTC (Wed) by mwsealey (guest, #71282) [Link] (3 responses)

> dma_alloc_noncoherent is very different from the others, and was never supported on ARM because it must be paired with dma_cache_sync() that is not
> implementable on architectures on which we must use cache flushes to make normal memory coherent.

I'm not sure I understand that. ARC and MIPS do exactly what you describe as not-implementable.

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Aug 31, 2017 8:29 UTC (Thu) by arnd (subscriber, #8866) [Link] (2 responses)

My mistake, I misremembered indeed. On platform with explicit (but sane) cache management, the dma_cache_sync() model can work but is twice as expensive as the ownership model of the streaming API, as it requires the flush and/or invalidate to be performed for every sync rather than every other one.

On platforms that require bouncing (e.g. swiotlb) however it is worse. Here, the operation for DMA_BIDIRECTIONAL is undefined since dma_cache_sync() cannot decide which direction to copy the data.

On cache architectures that do speculative prefetching and also write back prefetched (clean) cache lines during a flush you get the same problem, where dma_cache_sync has to decide between writeback and invalidate. No sane architecture would do such a thing of course, which means that it probably happens on at least one ARM implementation but not elsewhere :-).

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Sep 1, 2017 19:56 UTC (Fri) by mwsealey (guest, #71282) [Link] (1 responses)

> My mistake, I misremembered indeed. On platform with explicit (but sane) cache management, the dma_cache_sync() model can work but is twice as expensive as
> the ownership model of the streaming API, as it requires the flush and/or invalidate to be performed for every sync rather than every other one.

Every other sync?

In terms of DMA direction, if you put data in memory from the CPU and DMA needs to read it, you Clean to PoC. If you need to read memory that DMA wrote, you need to Invalidate to PoC. The requirements are super explicit (and a natural consequence of software-managed coherency). There's no every-other-sync.

Lets imagine you have two buffers - a source and a destination. Your CPU writes to the source and expects the DMA to write to the destination. You need to dma_cache_sync(DMA_TO_DEVICE) to make sure the source buffer is not just in the CPU caches, and the DMA can see it. That's your Clean to PoC. We actually talk about C&I here because there's no point not invalidating those lines if you're going to discard the source buffer and deal with the destination, but dma_cache_sync() implementations usually err on the side of letting natural cache behaviour deal with it, just in case the CPU goes back to the source.

When your CPU needs to read the destination data, it MUST perform an architecturally required invalidate of that range. Why? Because of speculation. The CPU pipeline is constantly fetching instructions ahead, guessing at addresses for loads and stores, and speculatively issuing them to address generation and load-store units. If you dma_unmap_foo() a region and then access it in some unbounded period of time, the CPU may have guessed this, and prefetched the data before the DMA has finished it's operation. The invalidate and the required barriers (see Barrier Litmus Tests) prevent the CPUs from having stale (inconsistent) data in the caches.

There's no elision of cache maintenance that lets you only manage the caches every other time, or only on one side, otherwise you essentially create a mess. You might not notice on an ARM7TDMI but you will notice a hell of a lot on a Cortex-A9, Cortex-A15, or any of the shinier 64-bit cores.

Personally, I don't think that users should have to care about this. They may have some extreme requirements that the memory absolutely not be cached for a particular reason, but otherwise the DMA API should give you Normal memory, cacheable if possible. Unless you are dealing with phenomenally small amounts of data where a cache line size dwarfs the data you want to keep consistent, or phenomenally large areas where you'd just pollute the cache by dealing with any of it on the CPU, you want the CPU to be able to cache it. And as CPUs get faster and faster and more complex, generating non-cacheable transactions is a performance bear of it's own making; the consensus (which is wrong) is that "DMA activity must happen in uncacheable areas to save on explicit maintenance, which is costly." Significant amounts of software make this assumption. Significant amounts of software then cannot deal with the fact that uncacheable areas of memory are extremely costly to access (even if you have the best merging store buffer or load queueing on the planet) compared to cacheable ones. 1000s of cycles of latency to go all the way out to DRAM vs. single digit cycles, maybe double digits to get to cache. You forfeit prefetching (both automatic and explicit) COMPLETELY if it's uncacheable. Significant amounts of software hobbles the CPU on every load and store, so that you can save a couple thousand cycles at the pinch point of CPU<>DMA handoff.

Significant amounts of software therefore don't use dma_alloc_noncoherent (which is your easiest way to hint you want cacheable memory) and if it does, it can get it another way, but the specificity of dma_alloc_attrs doesn't even cover the case - it's merely "please don't give a crap about being strongly ordered and sequentially consistent, I can deal with cache maintenance happening at that pinch point if need be, or I know my device is IO coherent". The funny thing about being IO coherent (i.e. DMA can snoop CPU caches - AXI4 ACE-Lite at least) is that it's completely pointless to implement if you can't guarantee cacheable memory with the appropriate type for CPUs and interconnect to manage coherency. If you can't SPECIFICALLY define it as Normal Cacheable then there's technology going to waste.

And there's my point, right there; the Linux DMA API is hideous and losing dma_alloc_noncoherent won't hurt anyone any more than it already hurts.

Ta :]

Goodbye to GFP_TEMPORARY and dma_alloc_noncoherent()

Posted Sep 1, 2017 21:46 UTC (Fri) by arnd (subscriber, #8866) [Link]

What I mean with "every other sync" is that we can save one of the dma_sync_*_to_{device,cpu} operations. In the example you just gave, the streaming DMA API will always write back the cache for dma_sync_*_to_device(DMA_TO_DEVICE) and it will always invalidate the cache for dma_sync_*_to_cpu(DMA_TO_CPU), while the corresponding dma_sync_*_to_cpu(DMA_TO_DEVICE) and dma_sync_*_to_device(DMA_TO_CPU) operations can often be implemented as no-op (on ARM, we do an extra invalidate for the latter in case we have a dirty cache line. IIRC some others only do this for map() but not sync(), and rely on the CPU not dirtying the cache lines while the CPU owns the buffer).

Having both dma_sync_*_to_device() and dma_sync_*_to_cpu() do the same operation the way that MIPS (on speculating CPU cores) does means that you need to do two writeback operations for one DMA_TO_DEVICE and two invalidations for one DMA_TO_CPU (which may well be necessary anyway, depending who you ask) in the streaming API.

With dma_alloc_noncoherent()/dma_cache_sync() you have no ownership information at all, so you either end up doing the double operations as well, or the driver has to make assumptions about the hardware platform to decide to skip either the equivalent of the dma_sync_*_to_cpu() or the dma_sync_*_to_device() call (I assume most drivers do the latter).
Note that using dma_alloc_noncoherent() with dma_sync_*_to_cpu/device(), or using dma_map_*() with dma_cache_sync() is not well-defined and both can blow up spectacularly depending on the implementation of the dma_map_ops.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds