CXL 1: Management and tiering

By Jonathan Corbet
May 13, 2022

Compute Express Link (CXL) is an upcoming memory technology that is clearly on the minds of Linux memory-management developers; there were five sessions dedicated to the topic at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). The first three sessions, on May 3, covered various aspects of memory management in the presence of CXL. It seems that CXL may bring some welcome capabilities, especially for cloud-service providers, but that will come at the cost of some headaches on the kernel-development side.

At its core, CXL is a new way to connect memory to a CPU. That memory need not be on the local memory bus; indeed, it is likely to be located on a different device entirely. CXL vendors seemingly envision "memory appliances" that can provide memory to multiple systems in a flexible manner. Supporting CXL raises a number of interesting issues around system boot, memory hotplug, memory tiering, and more.

A CXL memory interface for containers

The first session was led by Hongjian Fan over a remote link; it was focused on how to use CXL memory to support containers. Figuring this out, he said, is complicated by the fact that CXL is new technology and there are no real devices to play with yet. So, much of the work being done is at the conceptual level. The Kubernetes container storage interface provides a flexible way to allocate storage to containers; he is working on a "container memory interface" (CMI) to do the same thing with CXL memory.

Systems can use CMI to provide functionality like memory tiering and to manage resources in a pooled-memory system. There are a few scenarios that Fan envisions for how this would all work. One would be that containers would have access to all of the memory available to the system (though managed by the control-group memory controller, of course); in this case, CXL would bring little change. If, instead, the container implements tiered memory, then CMI will control access to the different memory types. There are also pooled-memory scenarios, where the memory is located on an appliance somewhere.

Fan had a series of questions he was seeking to answer. The first would be whether it is possible to create a common CMI standard that would work across all CXL vendors. With regard to memory tiering, he asked, is it better to do it within the containers, or instead at the host level? There are also open questions about how to manage pooled-memory servers. An attendee started the discussion by asking whether all of this could be managed with control groups, with different types of data packaged as if they were CPUless NUMA nodes. That might be the simplest place to start, Fan answered, but he was not sure that control groups had sufficient flexibility.

Michal Hocko said that cpusets could perhaps help with the management, but they provide no way to control how memory is distributed across nodes. Dave Hansen said that there is interest in providing control over memory allocation; providers could charge lower rates for access to slower memory, for example. The problem exists now, and people try to manage things with the numactl utility, but it's not up to the task. It can block users from certain types of RAM, he said, but it's an all-or-nothing deal. It can't provide the finer quality-of-service control that providers want.

Dan Williams said that the current work has been focused on DRAM and slower types of memory. CXL is going to bring a broader spectrum of vendors and speeds, and multiple performance classes. While it might make sense to design a system to handle two tiers of memory service now, developers should be thinking about five tiers in the future. Matthew Wilcox said that enterprise vendors are unlikely to want to manage that many tiers, though.

Adam Manzanares suggested starting with well-defined uses cases and just two tiers. Otherwise, he worries that things will get out of control quickly. Wilcox said that there is a sane three-tier case consisting of CXL memory, DRAM, and persistent memory. But Hansen warned of multiple CXL-attached tiers, and that developers should expect "a lot of weird CXL devices". It is an open standard, and vendors are free to do interesting things with it.

Fan said that, for any sort of management to work, the kernel will need some idea of the relative performance of each available memory tier. Hansen answered that there is a lot of standards work in this area. ACPI has a way of enumerating NUMA latency, for example, and other mechanisms are under development. The Heterogeneous Memory Attribute Table (HMAT), for example, can provide bandwidth information for each memory type. UEFI, meanwhile, has specified the Coherent Device Attribute Table (CDAT) with CXL memory, among other types, in mind.

Williams said that Linux is too dependent on the notion of NUMA distance as a way of describing memory capabilities. There is better information about memory available from the firmware now, but the memory-management code does not make use of it. A baby step might be to boil that information down into a single distance value to at least make some use of it. Manzanares said that distance doesn't work for persistent memory, though, since it cannot capture the asymmetry between read and write speeds.

Hansen said that the relevant information is available now if an application knows where to look. The harder problem is making decisions about memory placement in the kernel. Different workloads may have different preferences depending on their access patterns; currently, applications have to figure out which memory they want and set up an appropriate NUMA policy. But the kernel could be using memory information to make smarter decisions; moving frequently written pages off of persistent memory, for example.

There was some discussion about where decisions on tiering should be made. Putting the logic into the kernel makes life easy for applications that don't care about NUMA placement, which is most of them, Williams said, but he worried that there could be fights between the kernel and user space about tiering. Hansen said those fights could happen now, but the kernel's NUMA-placement logic mostly stays out of the way if user space has set an explicit policy. That may be sufficient for future needs as well.

Williams asked for an explanation of the perceived deficiencies in the current NUMA API. Fan answered that there needs to be a way to set memory limits on a per-node basis; that will require a new control-group or numactl knob. Manzanares suggested adding better tiered-memory support to QEMU so that this work could go forward, but Davidlohr Bueso pointed out that it's not possible to get real performance numbers that way. The concern at this point, Manzanares said, is to work out the interface issues rather than to optimize performance. Hansen said that a lot can be done by putting some persistent memory into a system and treating it like another tier; the result "kind of looks like CXL if you squint at it funny". That would give ways to play with interfaces and get some initial performance data.

Fan thanked the group for having provided a bunch of good information for him to work with, and the session drew to a close.

Managing CXL memory

The next session, led by Jon Trantham, delved into some of the other issues that come up when trying to manage CXL memory. CXL, he said, is a way to attach memory devices that cannot go onto the DDR memory bus. Putting DDR interfaces onto devices can be hard for manufacturers, and DDR does not work all that well with persistent memory. But CXL memory has different performance characteristics than normal RAM. Its latency and bandwidth will differ, and they can change as the device ages. Persistence, endurance, and reliability can all differ as well.

There are various ways of reporting the characteristics and status of CXL memory, starting with the above-mentioned CDAT table. The CDAT is useful in that it can be updated as performance changes. CXL devices can also produce a stream of event records, indicating that maintenance is required or that performance is falling, for example. CXL 2.0 enables switches that can sit between memory and the computer, allowing memory to live in a different enclosure entirely. That makes actions like hot unplugging possible, but it will be necessary to figure out how to communicate that to the kernel.

CXL devices must implement decisions about how much memory to allocate to each processor; this can involve an "out-of-band fabric manager" to control the switches. Memory can be interleaved at a granularity as small as 64 bytes, which is great for performance but harder for error recovery; memory failure can leave small holes in the address space. Wilcox answered that the usual management technique in such situations is crashing.

On the security side, there are access and encryption keys shared between hosts and devices; that brings in the whole key-management problem. The sum of all this, he said, is that help is needed. How is all of this to be managed? Should it be done in the kernel or in user space?

Williams asked if the encryption features were only for persistent memory. Evidently CXL can provide link encryption for DRAM, but does not encrypt data at rest. Hansen said that it will never be possible for the kernel to recover from errors on a 64-byte boundary; it only handles memory at the page level. He suggested looking at the existing mechanisms and asking whether anything different was really needed; perhaps all of those CXL capabilities aren't really necessary.

Williams said that CXL makes it possible to turn bare metal into virtual machines; techniques like memory ballooning become possible. So it seems that the same interfaces should be used. Hocko says that ballooning relies on memory hotplug, which "mostly works", but shrinking memory is hard. The memory to be removed can only be used for movable allocations. This is equivalent to a return to the old high-memory systems, where much of the installed memory could not be used by the kernel.

Hansen answered that the kernel does a reasonable job of emptying a memory area that is to be removed, but there is always the case where a few pages simply cannot be cleared. If there were some way to retain those pages after the memory goes, he said, life would be easier and the whole mechanism would be more reliable.

The session closed with Manzanares suggesting more coordination between developers and vendors. Perhaps there needs to be some sort of regular group call where these issues are worked out. Chances are that something like that will be set up soon.

Tiering

The final CXL session on Tuesday was led by Jongmin Gim, who wanted to talk about tiering in particular. A lot of things are changing in the CXL 2.0 specification, he began, including the addition of a number of memory types. Tiering will allow the system to make the best use of those memory types, putting frequently used pages in fast memory while using slower memory to hold pages that are not needed as often.

Support for tiering is not currently upstream, but developers are working on it. There are various issues around promotion and demotion of pages between tiers to be worked out. The demotion side is easy, he said; if there is not enough fast memory available, kick out some pages. Promotion turns out to be harder, though. Current patches (described in this article) use the NUMA-balancing scan to try to determine which pages in slower memory are currently being used. When hot pages are found, they can be migrated to faster memory. A heuristic requiring two accesses before promoting a page helps to prevent rapid bouncing of pages between memory types.

One possible optimization might be to promote contiguous groups of pages together in a single operation. There was some discussion of implementing some sort of predictive algorithm to improve page promotion, but it was all at a fairly high level.

Manzanares said that the kernel's NUMA balancing was designed when all nodes in a system were more-or-less equal, and it is CPU-centric. He wondered whether the assumptions built into NUMA balancing are still valid in the CXL world. Gorman said that there is no assumption that nodes are the same size in the current code. Hansen said that NUMA balancing is used now for moving data to and from slower persistent-memory nodes, which are always mismatched in size, and it seems to be working now.

The discussion wandered around the details of NUMA balancing with no real conclusion. At the end of the session, though, there were two points of agreement: CXL devices are highly diverse, and that tiering is the way to manage them.

Index entries for this article
Kernel	Compute Express Link (CXL)
Kernel	Memory management/Tiered-memory systems
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

CXL 1: Management and tiering

Posted May 13, 2022 20:37 UTC (Fri) by MattBBaker (guest, #28651) [Link] (7 responses)

Will there be an option to bypass the kernel and program the memory directly? It's nice when the kernel and hide the details, but for some applications it's really better if it's aware of the different tiers of memory access and can explicitly pass messages instead of getting surprised when memory access is magically slow.

CXL 1: Management and tiering

Posted May 14, 2022 16:01 UTC (Sat) by Paf (subscriber, #91811) [Link] (3 responses)

As someone who’s worked in HPC and watched the shared memory machines be displaced by true clustered systems, despite the intense and remarkable engineering that went in to keeping those shared memory machines coherent at scale … yeah.

So this is the thing we’re doing again this week.

That doesn’t mean it’s not worth it - those machines made a lot of sense for a while and shifting trends may make that true again - but the costs of coherency across links like these is *intense*. I wonder where and how much use this will see, what cases it will be fast enough for, etc.

CXL 1: Management and tiering

Posted May 14, 2022 20:16 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

I don't think we're going to see 2048-node clusters built on top of CXL. The physics just doesn't support it.

The use cases I'm seeing are:

- Memory-only devices. Sharing (and cache coherency) is handled by the CPUs that access them. Basically CXL as a replacement for the DDR bus.

- GPU/similar devices. They can access memory coherently, but if you have any kind of contention between the CPU and the GPU, performance will tank. Programs are generally written to operate in phases of GPU-only and CPU-only access, but want migration handled for them.

Maybe there are other uses, but there's no getting around physics.

CXL 1: Management and tiering

Posted May 15, 2022 19:46 UTC (Sun) by Paf (subscriber, #91811) [Link] (1 responses)

This makes more sense - it’s mostly a way to make the clean cases easier and well supported by hardware.

Looking it up, I’m seeing a lot of stuff about disaggregated systems which just seems crazy. But marketing doesn’t have to match reality of intent for the main implementers.

CXL 1: Management and tiering

Posted May 15, 2022 21:58 UTC (Sun) by willy (subscriber, #9762) [Link]

Oh yes, the CXL boosters have a future where everything becomes magically cheap. I don't believe that world will come to pass. I think the future of HPC will remain as one-two socket CPU boxes with one-two GPUs, much more closely connected over CXL, but the scale-out interconnect isn't going away, and I doubt the scale-out interconnect will be CXL. Maybe it will; I've been wrong before.

I have no faith in disaggregated systems. You want your memory close to your [CG]PU. If you upgrade your [CG]PU, you normally want to upgrade your interconnect and memory at the same time. The only way this makes sense is if we've actually hit a limit in bandwidth and latency, and that doesn't seem to have happened yet, despite the sluggish adoption of PCIe4.

The people who claim "oh you want a big pool of memory on the fabric behind a switch connected to lots of front end systems" have simply not thought about the reality of firmware updates on the switch or the memory controller. How do you schedule downtime for the N customers using the [CG]PUs? Tuesday Mornings 7-9am Are Network Maintenance Windows just aren't a thing any more.

CXL 1: Management and tiering

Posted May 15, 2022 3:17 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)

> instead of getting surprised when memory access is magically slow.

Well, it's not like single-thread performance is deterministic either. However I agree shared memory is really crossing a line, it's the biggest programming footgun ever invented by hardware engineers.

Message passing requires an extra programming effort but unlike shared memory, performance and correctness issues can be traced and debugged in a reasonable amount of time.

https://queue.acm.org/detail.cfm?id=3212479

> Caches are large, but their size isn't the only reason for their complexity. The cache coherency protocol is one of the hardest parts of a modern CPU to make both fast and correct. Most of the complexity involved comes from supporting a language in which data is expected to be both shared and mutable as a matter of course.

CXL 1: Management and tiering

Posted May 16, 2022 16:30 UTC (Mon) by ballombe (subscriber, #9523) [Link] (1 responses)

> Message passing requires an extra programming effort but unlike shared memory, performance and correctness issues can be traced and debugged in a reasonable amount of time.

But message passing often requires more total memory for the same task.
And often the performance issue is traced to "not enough memory by node".
So...

CXL 1: Management and tiering

Posted May 17, 2022 8:27 UTC (Tue) by sdalley (subscriber, #18550) [Link]

Personally, I'm more than happy to trade a little more memory for a little more sanity.

And message passing needn't consume more memory if the message buffer simply changes hands between owners rather than being copied. A good interface to such a message interface can also ensure, under the hood, that references to buffers no longer owned are always NULLed.