Some kernel memory-allocation improvements

By Jonathan Corbet
September 23, 2015

The kernel's memory allocator must work within a large set of constraints if it is to perform well under most workloads. Over time, these constraints have led to the addition of significant complexity to the low-level allocation code. But problems remain, and, as it turns out, sometimes the best solution is to get rid of some of that complexity and take a simpler approach. This patch set from Mel Gorman has been making the rounds for some time and appears to be reaching a mature stage; it's worth a look as an example of what it takes to perform well in current kernels.

The allocator in question here is the low-level page allocator (or "buddy allocator"). The smallest unit of memory it deals with is a full page (4096 bytes on most systems). The slab allocators (including kmalloc()) are built on top of the page allocator; they have their own complexities that we'll not be concerned with here.

The page allocator is the ultimate source of memory in the system; if that allocator can't fulfill a request, the memory simply cannot be had. So considerable effort is put into ensuring that memory is available at all times, especially for high-priority callers that cannot wait for some memory to be reclaimed from elsewhere. High-order allocations (those requiring more than one page of physically contiguous memory) complicate the problem; memory tends to fragment over time, making contiguous chunks hard to find. Balancing memory use across NUMA systems adds yet another twist. All of these constraints (and more) must be managed without slowing down the system in the allocator itself. Solving this problem involves a significant amount of complex code, scary heuristics, and more; it is not surprising that memory-management changes can be hard to merge into the mainline.

The zone cache

The page allocator divides physical memory into "zones," each of which corresponds to memory with specific characteristics. ZONE_DMA contains memory at the bottom of the address range for use by severely challenged devices, for example, while ZONE_NORMAL may contain most memory on the system. 32-bit systems have a ZONE_HIGHMEM for memory that is not directly mapped into the kernel's address space. Each NUMA node has its own set of zones as well. Depending on the characteristics of any given allocation request, the page allocator will search the available zones in a specific priority order. For the curious, /proc/zoneinfo gives a lot of information about the zones in use on any given system.

Checking a zone to see whether it has the memory to satisfy a given request can be more work than one might expect. Except for the highest-priority requests, a given zone shouldn't be drawn below specific low "watermarks," and comparing a zone's available memory against a watermark can require a significant calculation. If the "zone reclaim" feature is enabled, checking a nearly-empty zone can cause the memory-management subsystem to try to reclaim memory in that zone. For these reasons and more, the "zonelist cache" was added to the 2.6.20 kernel in 2006. This cache simply tries to remember which zones have been found to be full in the recent past, allowing allocation requests to avoid checking full zones.

The case for the zonelist cache has been weakening for some time, and Mel's patch set weakens it further by making the watermark checks cheaper. Zone reclaim has been fingered as a performance problem for a number of workloads (including PostgreSQL server workloads) and is now disabled by default. But the biggest problem seems to be that, if a zone is unable to satisfy a high-order allocation, it will be marked as "full" even if single pages abound there. Subsequent single-page allocations will then avoid the zone, even though that zone is entirely capable of satisfying those allocations. That can cause allocations to be needlessly performed on remote NUMA nodes, worsening performance.

Mel notes that this problem "could be addressed with additional complexity but as the benefit of zlc [the zonelist cache] is questionable, it is better to remove it". This part of the patch series removes nearly 300 lines of complex memory-management code and improves some benchmarks in the process. The fact that zones are always checked has some other effects; the most notable is apparently that more direct reclaim work (attempts to reclaim memory by the process performing an allocation) results from checking zones that would have previously been skipped.

The atomic high-order reserve

Within a zone, memory is grouped into "page blocks," each of which can be marked with a "migration type" describing how the block should be allocated. In current kernels, one of those types is MIGRATE_RESERVE; it marks memory that simply will not be allocated at all unless the alternative is to fail an allocation request entirely. Since a physically contiguous range of blocks is so marked, the effect of this policy is to maintain a minimum number of high-order pages in the system. That, in turn, means that high-order requests (within reason) can be satisfied even when memory is generally fragmented.

Mel added the migration reserve during the 2.6.24 development cycle in 2007. The reserve improved the situation at the time but, in the end, it relied on an accidental property of the minimum watermark implemented in the page allocator many years before. The reserve does not actively keep high-order pages around; it simply steers requests away from a specific range of memory unless there is no alternative, in the hope that said range will remain contiguous. The reserve also predates the current memory-management code, which does a far better job of avoiding fragmentation and performing compaction when fragmentation does occur. Mel's current patch set implements the conclusion that this reserve has done its time and removes it.

There is still value in reserving blocks of memory for high-order allocations, though; fragmentation is still a concern in current kernels. So another part of Mel's patch set creates a new MIGRATE_HIGHATOMIC reserve that serves this purpose, but in a different way. Initially, this reserve contains no page blocks at all. If a high-order allocation cannot be satisfied without breaking up a previously whole page block, that block will be marked as being part of the high-order atomic reserve; thereafter, only higher-order allocations (and only high-priority ones at that) can be satisfied from that page block.

The kernel will limit the size of this reserve to about 1% of memory, so it cannot grow overly large. Page blocks remain in this reserve until memory pressure reaches a point where a single-page allocation is about to fail; at this point, the kernel will take a page block out of the reserve to be able to satisfy that request. The end result is a high-order page reserve that is more flexible, growing or shrinking in response to the demands of the current workload. Since the demand for high-order pages can vary significantly from one system (and one workload) to the next, it makes sense to tune the reserve to what is actually running; the result should be more flexible allocations and higher-reliability access to high-order pages.

Lest kernel developers think that they can be more relaxed about high-order allocations in the future, though, Mel notes that, as a result of the limited size of the reserve, "callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail". He gives no indication of just who he thinks those callers are, though. There is one other potential pitfall with this reserve that bears keeping in mind: since the first page block doesn't enter the reserve until a high-order allocation is made, the reserve may remain empty until the system has been running for a long time. By that point, memory may be so fragmented that the reserve can no longer be populated. Should such situations arise in real-world use, they could be addressed by proactively putting a minimum amount of memory into the reserve at boot time.

The high-order reserve also makes it possible to remove the separate watermarks for high-order pages. These watermarks try to ensure that each zone has a minimal number of pages available at each order; the allocator will fail allocations that drive the level below the relevant watermark for all but the highest-priority allocations. These watermarks are relatively expensive to implement and can cause normal-priority allocations to fail even though suitable pages are available. With the patch set applied, the code continues to enforce the single-page watermark, but, for higher-order allocations, it merely checks that a suitable page is available, counting on the high-order reserve to ensure that pages will be kept available for high-priority allocations.

Flag day

Memory-allocation requests in the kernel are always qualified by a set of "GFP flags" ("GFP" initially came from "get free page") describing what can and cannot be done to satisfy the request. The most commonly used flags are GFP_ATOMIC and GFP_KERNEL, though they are actually built up from lower-level flags. GFP_ATOMIC is the highest-priority request; it can dip into reserves and is not allowed to sleep. Underneath the hood, GFP_ATOMIC is defined as the single bit __GFP_HIGH, marking a high-priority request. GFP_KERNEL cannot use the reserves but can sleep; it is the combination of __GFP_WAIT (can sleep), __GFP_IO (can start low-level I/O), and __GFP_FS (can invoke filesystem operations). The full set of flags is huge; they can be found in include/linux/gfp.h.

Interestingly, the highest-priority requests are not marked with __GFP_HIGH; instead, they are marked by the absence of __GFP_WAIT. Requests with __GFP_HIGH can push memory below the watermarks, but only non-__GFP_WAIT requests can access the atomic reserves. This mechanism doesn't work as well as it could in current kernels, where many subsystems may make allocations they don't want to wait for (often because there is a fallback mechanism available if the allocation fails), but those subsystems do not need access to the deepest reserves. But, by leaving off __GFP_WAIT, such code will access those reserves anyway.

This problem, along with a general desire to have more explicit control over how memory-allocation requests are satisfied, has led Mel to rework the set of GFP flags somewhat. To do so, he has added some new flags to the (already long) list:

__GFP_ATOMIC identifies requests that truly come from atomic context, where no sort of blocking or delay is acceptable and there is no fallback in case of failure. If an allocation is being made with __GFP_ATOMIC, it may mean, for example, that spinlocks are currently held. These requests are the ones that get access to the atomic reserves.
__GFP_DIRECT_RECLAIM indicates that the caller is willing to go into direct reclaim. Doing so implies that the request can block if need be. This flag does not imply __GFP_FS or __GFP_IO; they must be specified separately if they are applicable (though in such cases it probably makes sense to just use GFP_KERNEL).
__GFP_KSWAPD_RECLAIM says that the kswapd kernel thread can be woken up to perform reclaim. Poking kswapd does not imply blocking, but it may start activity in the system that can affect performance in general. As an example, consider a driver that would very much like to allocate a high-order chunk of memory, but it can get by with a bunch of single pages if that chunk is not available. The high-order allocation may be best tried without __GFP_KSWAPD_RECLAIM, since things will still work fine if that allocation fails and there is no real need to start aggressively reclaiming memory.

With this set of flags, code can express the difference between being absolutely unable to sleep and not wanting to sleep. The "must succeed" nature of a request has been separated from the "don't sleep" aspect, eliminating situations where allocations dip unnecessarily into the atomic reserves. For users of the basic GFP_ATOMIC and GFP_KERNEL flag sets little will change, but Mel's patch set makes changes to several dozen call sites that deal with GFP flags at a lower level.

As a whole, this patch set touches 101 files and removes a net 240 lines of code. With luck, a number of core memory-management algorithms have been simplified while improving performance and making the system more reliable. Mel's strongly benchmark-focused approach will help to build confidence in this work, but it's still a set of significant changes to a complex kernel subsystem, so it's not surprising that the patches have had to go through a number of revisions and extensive review. It seems likely that this process is coming to an end, and that this work will find its way into the mainline in the next development cycle or two.

Index entries for this article
Kernel	Memory management/GFP flags

Flag Day

Posted Sep 24, 2015 21:17 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Cute pun there. ;-)

Some kernel memory-allocation improvements

Posted Oct 1, 2015 21:20 UTC (Thu) by Shabbyx (guest, #104730) [Link]

For a year now, I've had a problem with memory if I had used a lot of it (say >4GB out of 8GB) with VirtualBox:

- With swap disabled: When (according to System Monitor) I have used about 60% of my memory (the rest used as cache), apparently OOM killer is invoked, which makes the system go to a crawl for a few seconds and complete unresponsiveness afterwards. After Alt+SysRq+s-then-u, it comes back alive and I can see the OOM logs.
- With swap enabled: All goes well, but when I shut the system down, there is a huge amount of wait (minutes) before something related to swap happens and the computer shuts down. I have come to use Alt+SysRq+s-then-u-then-o to turn off since then whenever I use VirtualBox.

Now that I have vented, I really hope the problem goes away with this patch-set!

Some kernel memory-allocation improvements

Posted Oct 4, 2015 0:04 UTC (Sun) by toyotabedzrock (guest, #88005) [Link] (1 responses)

I must be misunderstanding something because it sounds as if the patch waits till the system is about to run out of memory before it creates a reserve. It also sounds like it would create a bottleneck for future systems with more cores.
It's a shame Intel doesn't have a memory translation system like a ssd drive so the OS can pretend it has contiguous blocks of memory.

Some kernel memory-allocation improvements

Posted Oct 4, 2015 4:29 UTC (Sun) by Fowl (subscriber, #65667) [Link]

You mean like some sort of MMU?

Perhaps something with multiple levels of page tables and a fairly complicated set of protection bits... ;)