|
|
Log in / Subscribe / Register

The future of memory tiering

By Jonathan Corbet
May 12, 2023

LSFMM+BPF
Memory tiering is the practice of dividing physical memory into separate levels according to its performance characteristics, then allocating that memory in a (hopefully) optimal manner for the workload the system is running. The subject came up repeatedly during the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. One session, led by David Rientjes, focused directly on tiering and how it might be better supported by the Linux kernel.

Tiering was often mentioned in the context of CXL memory but, Rientjes began, it is not just about CXL. Instead, tiering presents a number of "generic use cases for hardware innovation". There are a lot of ways of thinking about tiering and what is covered by the term. The management of NUMA systems, where some memory is closer to a given CPU (and thus faster to access) than the rest, is a classic example. Swapping could be seen as a form of tiering, as can non-volatile memory or high-bandwidth memory. And, of course, mechanisms like CXL memory expansion and memory pooling. It is, he said, leading to "a golden age of optimized page placement".

[David Rientjes] The discussion briefly digressed into whether swapping really qualifies as tiering. In the end, the consensus seemed to be that, to be a memory tier, a location must be byte-addressable by the CPU. So swapping is off the list.

Michal Hocko said that there are two dimensions to the tiering problem. One is the user-space APIs to be provided by the kernel; somehow user space has to be given the control it needs over memory placement. The relevant question here is whether the existing NUMA API is sufficient, or whether something else is needed. The other aspect of the problem, though, is the kernel implementation, which should handle tiering well enough that user space does not actually need to care about it most of the time.

Rientjes responded that the NUMA API has been a part of the kernel for around 20 years. Whether it is suitable for the tiering use case depends on the answers to a number of questions, including whether it can properly describe and control all of the types of tiering that are coming. Slower expansion memory is the case that is cited most often, but there are others, including memory stored on coprocessors, network interfaces, and GPUs. He wondered what kinds of incremental changes to the NUMA API would be needed; the one-dimensional concept of NUMA distance may not be enough to properly describe the differences between tiers. The group should also, he said, consider what the minimal level of kernel support should be, and which optimizations should be left to user space.

One problem, Dan Williams said, is that vendors (and their devices) often lie to the kernel about their capabilities. Getting to the truth of the matter is not a problem that can just be punted to user space. There need to be ways for user space to indicate its intent, which can then be translated by the kernel into actual placement decisions.

Matthew Wilcox said that systems will have multiple levels of tiering; the kernel will have to decide how to move pages up and down through those tiers. Specifically, should movement be done one step at a time, or might it be better to just relegate memory directly to the slowest tier (or to swap) if it is known not to be needed for a while? And if multi-tier movement is the right thing to do, how does the kernel figure out when it is warranted? After a bit of inconclusive discussion, Hocko repeated that, while it would be nice to push decisions like that to user space, the kernel has the responsibility to do the right thing as much as possible.

Rientjes had a number of other questions to discuss, but the time allotted to the session was running out. The biggest problem for memory tiering still appears to be page promotion; it is not particularly hard to tell when pages are not in use and should be moved to slower memory, but it is rather more difficult to determine when a page has become hot and should be migrated to faster storage. There are a number of options being explored by CPU vendors to help in this area; the kernel is going to have to find a generic way to support these forthcoming architecture-specific features.

A few other questions had to be skipped over. One of these was what the interface for the configuration of memory devices as tiered nodes should look like. User space will want to influence tiering policies, but that interface has yet to be designed as well. Probably some sort of integration with control groups will be necessary. The list of questions went on from there, but they will have to be discussed some other time.

Index entries for this article
KernelCompute Express Link (CXL)
KernelMemory management/Tiered-memory systems
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2023


to post comments

The future of memory tiering

Posted May 15, 2023 10:09 UTC (Mon) by paulj (subscriber, #341) [Link] (4 responses)

One dimensional distance is actually a good metric. At least, it is the required /end/ metric. Ultimately, one needs to map different tiers with different properties into some kind of well-defined order of preference. This implies the order of preference does /not/ depend on the order in which tiers are compared - i.e., it should be a true order, where the order is transitive.

Ultimately, this means the collection of properties of any given memory system, should be mappable onto a position in an order - independently of knowledge of the properties of other memory systems. That order should map onto the set of integers.

If this seems a bit obvious, I'm saying it because computer engineers have managed to get this wrong in other important and critical systems, and we're still paying for it today. (I.e., BGP - at the centre of how the Internet works - and its routing system, which is non-deterministic in certain ways, which has caused problems).

So please DO consider that memory tier properties SHOULD ultimately be mappable onto a one-dimensional "distance" metric. That's a _good_ thing, in terms of ensuring your system has a well-defined order of preference. :)

The future of memory tiering

Posted May 18, 2023 13:00 UTC (Thu) by willy (subscriber, #9762) [Link] (2 responses)

The two fundamental ways we might choose to categorise memory are latency and bandwidth. If one sorts only on bandwidth, one will get ordering A, while if one sorts only on latency, one will get ordering B. Should we devise some combination of bandwidth and latency to give us an ordering C? Which of the orders will give us the best performance generally? Should an application be able to choose between the A, B and C orderings for its memory in order to get the best performance for itself?

The future of memory tiering

Posted May 18, 2023 13:24 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

All interesting questions. There are many possible ordering functions. E.g., for C, you could have:

"Apply order A, then B to tie-break"

or you could have:

"Multiply x*bandwidth by y/latency, order by resulting magnitude" - x and y are eminently tunable obviously.

No doubt many many other possibilities. Whatever function you choose, the result should be a true order. I.e., transitive. If you want the system to have well defined, deterministic, easy to reason about, behaviour. If the function is a composition of (sub)functions, those should be linear and/or well-ordered themselves.

Many possible such functions, but just make sure they produce an actual order.

The future of memory tiering

Posted May 18, 2023 16:51 UTC (Thu) by farnz (subscriber, #17727) [Link]

The problem is that no 1D ordering function will capture the "correct" order for two different applications, since some are insensitive to throughput as long as it's above a minimum acceptable level, but very sensitive to latency, while others are insensitive to latency as long as it's not so large that caching cannot hide it, but very sensitive to throughput.

You can score memory nodes as (x / latency) + (y * throughput), but each application you want to run will have different ideas for what x and y should be - so you either choose a compromise and do the wrong thing for both types of application, or you have a 2D map of possible nodes, and rely on the application telling what the 1D function for *this* application is.

The future of memory tiering

Posted May 18, 2023 13:37 UTC (Thu) by paulj (subscriber, #341) [Link]

Just to clarify this sentence "order of preference does /not/ depend on the order in which tiers are compared".

I meant the order in which the comparisons are made of instances in the set of objects. I.e., the ordering relation should be transitive.

The ordering relation may, of course, first order by one type of property and then another type of property. E.g. "First prefer the fastest memory bandwidth, and then the lowest latency".

The future of memory tiering

Posted May 15, 2023 19:28 UTC (Mon) by mwsealey (subscriber, #71282) [Link]

The MM folks might want to have a chat with the Scheduler folks about big.LITTLE and EAS and how they solved those problems..


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds