Top-tier memory management

May 28, 2021

This article was contributed by Marta Rybczyńska

Modern computing systems can feature multiple types of memory that differ in their performance characteristics. The most common example is NUMA architectures, where memory attached to the local node is faster to access than memory on other nodes. Recently, persistent memory has started appearing in deployed systems as well; this type of memory is byte-addressable like DRAM, but it is available in larger sizes and is slower to access, especially for writes. This new memory type makes memory allocation even more complicated for the kernel, driving the need for a method to better manage multiple types of memory in one system.

NUMA architectures contain some memory that is close to the current CPU, and some that is further away; remote memory is typically attached to different NUMA nodes. There is a difference in access performance between local and remote memory, so the kernel has gained support for NUMA topologies over the years. To maximize NUMA performance, the kernel tries to keep pages close to the CPU where they are used, but also allows the distribution of virtual memory areas across the NUMA nodes for deterministic global performance. The kernel documentation describes ways that tasks may influence memory placement on NUMA systems.

The NUMA mechanism can be extended to handle persistent memory as well, but it was not really designed for that case. The future may bring even more types of memory, such as High Bandwidth Memory (HBM), which stacks DRAM silicon dies and provides a larger memory bus. Sooner or later, it seems that a different approach will be needed.

Recently, kernel developers have been discussing a possible solution to the problem of different memory types: adding the notion of memory tiers. The proposed code extends the NUMA mode to include features like migrating infrequently used pages to slow memory, migrating hot pages back to fast memory, and a proposal for a control mechanism for this feature. The changes to the memory-management subsystem to support different tiers are complex; the developers are discussing three related patch sets, each building on those that came before.

Migrating from fast to slow memory

The first piece of the puzzle takes the form of a patch set posted by Dave Hansen. It improves the memory reclaim process, which normally kicks in when memory is tight and pushes out the content of rarely used pages. Hansen said that, in a system with persistent memory, those pages could instead be migrated from DRAM to the slower memory, maintaining access to them if they are needed again. Hansen noted in the cover letter that this is a partial solution, as migrated pages will be stuck in slow memory with no path back to faster DRAM. This mechanism is optional and users will be able to enable it on-demand with the sysctl vm.zone_reclaim_mode or in /proc/sys/vm/zone_reclaim_mode with the bitmask set to 8.

The patch set received some initial positive reviews, including one from Michal Hocko, who noted that the feature could also be useful in traditional NUMA systems without memory tiers.

...and back

The second part of the puzzle is a migration of frequently used pages from slow to fast memory. This has been proposed in a patch set by Huang Ying.

In current kernels, NUMA balancing works by periodically unmapping a process's pages. When there is a page fault caused by access to an unmapped page, the migration code decides whether the affected page should be moved to the memory node where the page fault occurred. The migration decision depends on a number of factors, including frequency of access and the availability of memory on the accessing node.

The proposed patch takes advantage of those page faults to make a better estimation of which pages are hot; it replaces the current algorithm, which considers the most recently accessed pages to be hot. The new estimate uses the elapsed time between the time the page was unmapped and the page fault as a measure of how hot the page is, and offers a sysctl knob to define a threshold: kernel.numa_balancing_hot_threshold_ms. All pages with a fault latency lower than the threshold will be considered hot. Correctly setting this threshold may be difficult for the system administrator, so the final patch of the series implements a method to automatically adjust it. To do that, the kernel monitors the number of pages being migrated with the user-configurable balancing rate limit numa_balancing_rate_limit_mbps, then it increases or decreases the threshold to bring the rate closer to that value.

Controlling memory tiers

Finally, Tim Chen submitted a proposal for management of the configuration of memory tiers, and the top-level tier containing the fastest memory in particular. The proposal is based on control groups version 1 (Chen noted that support of version 2 is in the works), and monitors the amount of top-tier memory used by the system and by each control group individually. It uses soft limits and enables kswapd to move pages to slower memory in control groups that exceed their soft limit. Since the limit is soft, a control group may be allowed to exceed the limit if top-tier memory is plentiful, but may be quickly cut back to the limit if that resource is tight.

In Chen's proposal, the soft limit for a control group is called toptier_soft_limit_in_bytes. The code also traces the global usage of top-tier memory, and if the free memory falls under toptier_scale_factor/10000 of the overall memory of the node it is attached to, kswapd will start memory reclaim focused on control groups that exceed their soft limit.

Hocko disliked the idea of using soft limits:

In the past we have learned that the existing implementation is unfixable and changing the existing semantic impossible due to backward compatibility. So I would really prefer the soft limit just find its rest rather than see new potential use cases.

The likely reasons for Hocko's dislike for soft limits come from the previous attempts to change the interface (LWN covered the discussions in 2013 and 2014). The default soft limit is "unlimited", and this cannot be changed without a risk of backward compatibility issues.

Further into the discussion, Shakeel Butt asked about a use case where high-priority tasks would obtain better access to the top-tier memory, which would be more strictly limited for low-priority tasks. Yang Shi pointed to earlier work that divided fast and slow memory for different tasks, and concluded that the solution was hard to use in practice, as it required good knowledge of the hot and cold memory in the specific system. The developers discussed more fine-grained control of the type of memory used, but did not reach a conclusion.

Before the discussion stopped, Hocko offered some ideas on how the interface could work: differing types of memory would be configured into separate NUMA nodes, and tasks could indicate their preferences for which nodes should host their memory. Some nodes might be reclaimed ahead of others when memory pressure hits. He also further noted that this mechanism should be generic, not based on the location of persistent memory in the CPU nodes:

I haven't been proposing per NUMA limits. [...] All I am saying is that we do not want to have an interface that is tightly bound to any specific HW setup (fast RAM as a top tier and PMEM as a fallback) that you have proposed here. We want to have a generic NUMA based abstraction.

Next steps

None of the patch sets have been merged at the moment of this writing, and it looks like it is not going to happen soon. Changes in memory management take time and it seems that the developers need to agree on the way to control the usage of fast/slow memory in different workloads before a solution will appear. The top-tier management patches are explicitly intended as discussion fodder and are not intended for merging in their current form in any case. We will likely see more discussion on the subject in the coming months.

Index entries for this article
Kernel	Memory management/Tiered-memory systems
GuestArticles	Rybczynska, Marta

Top-tier memory management

Posted May 28, 2021 16:13 UTC (Fri) by NHO (guest, #104320) [Link] (8 responses)

What is swap but a very slow memory tier?

Top-tier memory management

Posted May 28, 2021 16:26 UTC (Fri) by kay (subscriber, #1362) [Link] (1 responses)

Indeed and to extend your remark: this sounds very similar to the LRU page reclaim idea described in https://lwn.net/Articles/856931/

Top-tier memory management

Posted Jun 4, 2021 4:53 UTC (Fri) by krakensden (subscriber, #72039) [Link]

I'm pretty excited about that patch set, the abstract makes it sound like a huge win.

Swap

Posted May 28, 2021 16:28 UTC (Fri) by corbet (editor, #1) [Link] (5 responses)

Swap is not directly addressible, though, so it doesn't really qualify as a "tier" in the sense being discussed here.

Swap

Posted May 28, 2021 16:35 UTC (Fri) by kay (subscriber, #1362) [Link]

tecnically yes, its no direct access memory

I remember computers using magnetic tape as memory ;-), so you can see it as veeery slow to access memory ;-)

Swap

Posted May 28, 2021 17:35 UTC (Fri) by calumapplepie (guest, #143655) [Link] (3 responses)

How hard would it be to rework swap into just another tier, from the perspective of memory load distribution?

Obviously, since swap can't be mapped for direct access by processes, it'd never be quite like the others. But duplicating the code for "move between fast and slow storage" seems like a worse call. Obviously swap needs a fair amount of unique logic: but so would many other different tiers. Further, generalizing a single system to allow for arbitrary tiers that can be as different as DRAM and spinning rust swapfiles would mean that the next weird memory system that is dreamed up by the hardware designers can be added without much effort.

Swap

Posted May 28, 2021 17:51 UTC (Fri) by corbet (editor, #1) [Link] (2 responses)

In a sense what you're asking for already exists, it's called "virtual memory". In that sense it is directly addressible and the kernel will automatically move data in and out.

Swap

Posted May 28, 2021 18:41 UTC (Fri) by faramir (subscriber, #2327) [Link] (1 responses)

While virtual memory is transparent to the software which is using it, it is not transparent to the system. It requires the system to find some real memory that can be (re)used. This can sometimes be a problem. With NUMA or persistent memory, that isn't a problem. Memory access may still be slower, but at least it will still work without any immediate work by the system.

Swap

Posted May 28, 2021 22:31 UTC (Fri) by MattBBaker (guest, #28651) [Link]

The system takes care of that feature too. It's fair to say that if a page was being evicted from DRAM, then on a system with just DRAM it would have been swapped instead. So why not assume that everything in slow space is going to be swapped too. Especially if the slow bank is an HBM space then you can migrate DRAM pages into the HBM banks and once you get HBM_LINE_SIZE pages then page them all out in bulk.

Top-tier memory management

Posted May 28, 2021 19:06 UTC (Fri) by droundy (subscriber, #4559) [Link] (1 responses)

I wonder why something closer to the LRU used for the page cache wouldn't be used for determining which tier a page belongs in? It kind of seems like the same problem.

Top-tier memory management

Posted May 30, 2021 14:54 UTC (Sun) by Paf (subscriber, #91811) [Link]

I’m not sure of the precise cost of the mechanisms being proposed here, but the page cache LRU is kind of expensive (at least to me as an FS developer optimizing throughput). I’m not sure that kind of management is appropriate for all of memory. (But I might be off in terms of scale.)

Top-tier memory management

Posted May 30, 2021 7:11 UTC (Sun) by pabs (subscriber, #43278) [Link] (4 responses)

I wonder if this sort of thing could allow driving memory sticks of different speeds at different rates and preferring the faster ones.

Top-tier memory management

Posted May 30, 2021 8:06 UTC (Sun) by flussence (guest, #85566) [Link] (2 responses)

That sounds like a nice thing to have, but it seems more likely we'll have practical nuclear fusion before mainboards that support independent memory channel clocks.

Top-tier memory management

Posted May 30, 2021 8:47 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That's actually not at all a problem. CPUs talk with DRAM through a memory controller, and it can already change the DRAM frequency.

Top-tier memory management

Posted May 30, 2021 12:29 UTC (Sun) by flussence (guest, #85566) [Link]

It's *possible*, but at present it also sounds like an incredibly niche feature for a few extra MHz that'd get shot down by manufacturers just telling users to not use mismatched RAM sticks.

This idea isn't too far from how old CPUs had independent functional units for x87/SSE/3DNow. I've heard urban legends about clever asm programmers wringing a few percent more speed out of those, but it wasn't a party trick worth spending silicon on in the long run.

Top-tier memory management

Posted May 30, 2021 10:30 UTC (Sun) by willy (subscriber, #9762) [Link]

While that sounds appealing, you're really just giving up a lot of bandwidth. If you have N sticks of RAM, you want to let the CPU put cache line L on stick L % N.

What you're proposing would put pages A-B on stick 0, B-C on stick 1, etc. It's like choosing concatenation instead of RAID 0.