Weighted interleaving for memory tiering
The days when computers simply had "memory" have passed. Systems now can contain numerous types of memory with wildly different performance profiles. Persistent memory can be available in large quantities, but it is relatively slow. CXL memory may be faster, though still not as fast as normal DRAM, and it can come and go during the life of the system. High-bandwidth memory can be faster than normal DRAM. Devices, too, can make their own memory available on the system's memory bus. Each memory provider is represented in the system as one or more CPU-less NUMA nodes.
Tiering is the process of grouping these nodes into levels (tiers) with similar performance, then working to place memory optimally in each tier; it is a work in progress. The kernel only gained a formal concept of memory tiers in the 6.1 release, when this series from Aneesh Kumar K.V. was merged. The kernel now assigns each node an "abstract distance" reflecting its relative performance; by default, all nodes have an abstract distance of 512. Memory with a lower abstract distance than that is expected to be faster than ordinary DRAM, while a higher abstract distance indicates slower memory. Notably, the distance is set by the driver that makes the memory available to the system; it is not subject to direct administrator control.
As memory is initialized, the kernel will organize it into distinct tiers, with the tier number for any given node obtained by dividing its abstract distance by 128. Normal memory, with its default abstract distance of 512, ends up in tier four as a result. There is, thus, room for four tiers of memory that is faster than DRAM, and a vast number of slower tiers.
The kernel can use tiering to make decisions about page placement; quite a bit of work has gone into trying to figure out how to move pages between the tiers. When pages are unused, they can be demoted to slower tiers rather than being reclaimed entirely; that is a relatively easy decision that fits into existing memory-management practices. Deciding when pages should be promoted back to faster memory, though, is a bit trickier. Importantly, this work has been focused on moving memory to the correct tier after it has been in use for a while and its relative busyness can be assessed.
Price's patch set builds on the tiering mechanism, but is aimed at a different problem: what is the optimal original placement for pages when they are first allocated? In particular, it addresses the practice of NUMA interleaving, wherein allocations are spread out across a set of NUMA nodes with the purpose of getting consistent performance from any of a set of the CPUs on the system. NUMA interleaving has worked for years, but it is relatively simple; it spreads pages equally across the set of nodes provided in the memory policy set by the application. Interleaving has no concept of tiers and no ability to bias allocations in any specific direction.
The patch series adds the concept of a "weight" to each NUMA tier. Weights are relative to each CPU, so the weight of tier 2 as seen from CPU 0 may be different than the weight of that tier seen from CPU 10. These weights determine how strongly allocation decisions should be biased toward each tier; a higher weight for a given tier results in more pages being allocated on nodes contained within that tier. Unlike the abstract distance, the tier weights are controlled by the administrator; a command like:
echo 0:20 > /sys/devices/virtual/memory_tiering/memory_tier2/interleave_weight
would set the weight of tier 2, as seen by CPU 0, to 20. By default, all tiers have a weight of one.
These weights are used by the NUMA interleaving code when pages are allocated. If a system has two tiers, tier 4 (where normal DRAM ends up) and tier 8 (with slower memory), and the weights of those tiers for the current CPU are set to 20 and 10 respectively, then interleaved allocations will place twice as many pages on tier 4 as on tier 8. It's worth noting that, while the application specifies its NUMA allocation policy (which nodes it wants to allocate memory on), the weights are solely under the control of the administrator.
While promotion and demotion between tiers is aimed at putting the most-used pages on the fastest memory, weighted allocation has a different objective — necessarily, since the kernel cannot know at allocation time which pages will be the most heavily used. The intent, instead, is to try to obtain an optimal split of memory usage across memory controllers, maximizing the bandwidth utilization of each. Moving some pages from faster memory to a slower tier can result in an overall performance increase, even if the moved pages are actively used, if the end result is to keep both tiers busily moving data.
According to Price, this approach works:
Observed a significant increase (165%) in bandwidth utilization with the newly proposed multi-tier interleaving compared to the traditional 1:1 interleaving approach between DDR and CXL tier nodes, where 85% of the bandwidth is allocated to DDR tier and 15% to CXL tier with MLC -w2 option.
(MLC refers to Intel's proprietary memory latency checker tool).
This is the second version of this patch set (the first was posted
by Ravi Jonnalagadda in late September). There have been a number of
comments about the details of the implementation, but there does not appear
to be any opposition to the concept in general. Since all tiers start with
an identical weight, there will be no changes to NUMA interleaving unless
the administrator has explicitly changed the system's configuration. There
would thus appear to be little reason for this work not to advance once the
details have been taken care of.
Index entries for this article | |
---|---|
Kernel | Memory management/Tiered-memory systems |
Kernel | Releases/6.9 |
Posted Nov 7, 2023 17:51 UTC (Tue)
by gmprice (subscriber, #167884)
[Link]
Posted Nov 14, 2023 12:38 UTC (Tue)
by Srinivasulu (guest, #167986)
[Link]
Srinivasulu Thanneeru (2):
Weighted interleaving for memory tiering
Weighted interleaving for memory tiering
memory tier: Introduce sysfs for tier interleave weights.
mm: mempolicy: Interleave policy for tiered memory nodes.