CXL 2: Pooling, sharing, and I/O-memory resources

By Jonathan Corbet
May 19, 2022

During the final day of the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), attention in the memory-management track turned once again to the challenges posed by the upcoming Compute Express Link (CXL) technology. Two sessions looked at different problems posed by CXL memory, which can come and go over the operation of the system. CXL offers a lot of flexibility, but changes will be needed for the kernel to be able to take advantage of it.

Pooled and shared memory

Hongjian Fan, who led one of Tuesday's CXL sessions returned on Wednesday (via videoconference) for a discussion that was dedicated to pooled and shared memory. These are concepts that apply to memory appliances, where the goals are to share memory across multiple systems, improve memory utilization and, naturally, to reduce costs. Sharing memory from a central appliance can reduce the need to put large amounts of memory into every server; when a given machine needs more, it can get a temporary allocation from the appliance.

Pooled memory is partitioned on the appliance and allocated in chunks to servers, which only have access to the memory that has been given to them. Requesting memory from a pooled appliance creates a hotplug event, where new memory suddenly becomes addressable. Supporting pooled memory requires the ability to generate and manage the hotplug events, as well as a virtual-device driver that monitors memory use and requests or releases memory as appropriate.

Shared memory is, instead, shared across all servers, though it will probably not be possible for any given server to allocate it all. With a shared appliance, the memory is always in each server's physical address space, but it may not all be usable. The kernel can provide a sysfs file that indicates which memory is available at any given time; tracking of allocations can done by the appliance or via communication between servers, though the latter mode can create a lot of traffic.

Dave Hansen said that CXL memory behaves a lot like RAM today, but it requires some extra care. There may be cache-coherency issues not present with RAM, and the kernel can't keep any of its own data structures in this memory since those structures cannot be moved and would thus block removal. Fan said that cache coherency is part of the CXL protocol and shouldn't be a problem. Hansen added that there is little that is new with CXL memory appliances, they are much like how memory is managed with virtualization. But now it is being done in hardware, which scares him a bit. Memory removal success is "a matter of luck" now, he said, and calling this memory "CXL" won't change that.

An attendee asked what the benefit of the shared mode was, given that all memory will still be used exclusively by one system at any given time. Fan answered that the problem with pooled access is fast and reliable hotplugging, while the problem with shared access is communication between the systems. Hansen asked how access to shared memory is cut off when memory is reallocated, but Fan was unable to answer the question.

Dan Williams said that access control is not really visible to the kernel, and that it was necessary to "trust the switch". He added that users want to be able to manage this memory with the existing NUMA APIs, but they also want hard guarantees that it will be possible to remove memory from a system; those two goals are in conflict. It will be necessary to reset expectations about removal, he said; it will be a learning experience for the industry. Hansen said that the use of hotplug will be no different in this scenario, but Williams said there will now be a whole new level of software behind hotplug to manage the physical address space. That is something that the firmware has always done, but now the kernel will have to deal with it; the CXL specification group is still trying to figure out the details of how that will work.

Fan said some other changes will be necessary as well. There will need to be a mechanism to warn about available capacity on the appliance. Since memory can be requested and added to the system on the fly, the out-of-memory handler should perhaps wait for more memory to materialize before it starts killing processes. David Hildenbrand said that the out-of-memory scenario scares him; people think that it's possible to just wait for memory to appear, but it's not true. If the system is going into the out-of-memory state, there will be other allocations failing at the same time. What is needed is a way to determine that the system is short of memory, then wait for more memory in a safe way, before running out. Hansen added that plugging in more memory is an act that, in itself, requires allocating memory, and an out-of-memory situation is not a good time to try to do that. Williams said, as the session came to a close, that the system cannot be reactionary, and that memory requirements should be handled in user space at the job-scheduling level.

Managing the resource tree

Management of the physical address space was the topic of the second CXL session of the day. The resource structure is one of the oldest data structures in the kernel; it was added in the 2.3.11 release in 1999. Its job is to track the resources available to the system and, in the form of the iomem_resource variable, the layout of the computer's physical address space. It forms a tree structure with some resources (a PCI bus, for example) containing other resources (attached devices) within their address ranges. This tree is represented in /proc/iomem, which must be opened as root to show the actual addresses involved.

The kernel's I/O-memory resource tree was not designed with CXL in mind; for Linus Torvalds to have been so short-sighted in 1999 is perhaps forgivable. But, said Ben Widawsky in his session, that shortcoming is threatening to create problems now. In current systems, iomem_resource is initially created from the memory map provided by the boot firmware; architecture-specific code and drivers then modify it and subdivide the resources there as needed. Once a given range of physical address space has been assigned to a specific use, it can never be reassigned — only subdivided.

The core of the problem is that CXL memory can come and go, and it may not be present at boot time. When this memory is added, it essentially overrides a piece of the physical address space, which is something that iomem_resource is not prepared to handle. If the space used by CXL were disjoint from local system resources, Widawsky said, there wouldn't be a problem; traditional resources could be put into one range, and CXL in another. But that is not how things are going to work. RAM added via CXL will overlap the space already described by iomem_resource. What, he asked, can be done to properly represent these resources?

Mike Rapoport questioned the need to put CXL memory into iomem_resource at all. The problem, Hansen explained, is that CXL memory might be the only memory in the system. People tend to see CXL as a sort of add-on card, but it is closer to the core than that. On a system using only CXL, it would not be possible to boot without having that memory represented in iomem_resource. David Hildenbrand said that iomem_resource should describe everything in the system.

Widawsky said that there is a need to keep device-private memory from taking address space intended for CXL; this is another reason to represent CXL memory in the resource tree. He suggested that attempts to take pieces of memory assigned to CXL should be blocked. Hildenbrand suggested creating the CXL region as a device and adding some special calls to allocate space from that region. This could be tricky, Widawsky said. System RAM may already be set up in the resource tree; making it part of a special device would involve reparenting that RAM, which, he said, has never been done. Matthew Wilcox contradicted the "never been done" claim, but without details on when it had been done.

John Hubbard said that the kernel should keep iomem_resource as "the one truth" about the layout of the physical address space. Williams said that struct resource is old; there are people around who love to add new structures to the kernel, perhaps the time has come to do that for this problem. Wilcox referenced a "20-year-old patch" in Andrew Morton's tree, but didn't identify it. Hildenbrand said that the structure as a whole is difficult to traverse and work with; any work to improve it would be appreciated.

Widawsky asked if there was a path to a solution that involved a bit less hard work. Williams suggested adding resources in smaller chunks, with a number of entries for the CXL CFMWS ("fixed memory window structures") areas. Some of those entries could later be removed, Widawsky added, if it turned out they weren't being used for CXL memory.

The session came to an end with Wilcox asking what would happen in response to a discovery that an assigned resource's range is too small. Could it be expanded somehow? Williams said it would be good to be able to update the address map as more information became available. All told, the session described a problem but did not get close to finding a solution. This is a problem that has been seen in numerous other contexts as computers have become more dynamic. Solutions have been found in the past and will surely be found this time too, but it may be challenging to find one that doesn't involve a fair amount of hard work.

Index entries for this article
Kernel	Compute Express Link (CXL)
Kernel	I/O memory
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 19, 2022 18:00 UTC (Thu) by MattBBaker (guest, #28651) [Link] (2 responses)

I think a lot of people are confused about how CXL memory is going to work. Trying to munge it in memory hotplug is just going to be hopelessly broken. The problem is that the receiving system does not actually own that memory. The memory appliance does. Suppose that you have a CXL memory device with gobs of memory. The compute machines that are attached will request memory and give it back. But what happens if $VENDOR driver pukes over kernel memory and causes a kernel panic. The exciting data that gets written back to the appliance doesn't matter so much as what the device does in response to the fact that the machine that is bleeding out has some of its memory. A sane spec will have a mechanism for the appliance to say, "You are giving up this memory NOW!"

So if you want the kernel to manage this memory there will have to be two separate concepts, the memory the device owns and cannot be taken away, and memory that it does not own and subject to being ripped away. I'm half of the mind that a better model for this is to expose remote CXL devices in /dev/ and then mmap() that like a normal 'file'.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 19, 2022 22:35 UTC (Thu) by ejr (subscriber, #51652) [Link]

If memory can disappear, it doesn't remember much.

This all needs specified in terms of failure modes. It may be so specified; I haven't kept up.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted Jun 5, 2022 8:05 UTC (Sun) by njs (subscriber, #40338) [Link]

The kernel has had at least some support for memory disappearing underneath it for a long time -- ECC RAM can say "whoops, this RAM suddenly doesn't work anymore, sorry!" and the kernel will try to recover gracefully. (E.g. it might kill the process that owns that memory but the rest of the system will keep going.)

I have no idea how well or poorly this will extend to CXL failure modes, but it's at least precedented.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 19, 2022 20:56 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

Is CXL the new NVRAM? A new technology that is supposed to revolutionize everything, that requires a ton of kernel-level reworks, and then just disappears into nothingness?

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 20, 2022 3:57 UTC (Fri) by xanni (subscriber, #361) [Link] (1 responses)

Unlikely, because for example AMD Zen 4 CPUs will incorporate CXL and they're likely to be widely used.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 20, 2022 12:00 UTC (Fri) by simcop2387 (subscriber, #101710) [Link]

I'd expect this though to be segmented off into the Epyc line only though. At least for a generation or two simply because I don't expect anything close to the consumer level to be supporting CXL until hardware manufacturers start demanding it.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 20, 2022 10:27 UTC (Fri) by atnot (guest, #124910) [Link]

One has to distinguish here between CXL, CXL.mem and what zealous vendors promise CXL.mem can do. I have my doubts about the latter, but there's definitely compelling use cases for CXL.mem, and CXL itself is kind of too useful and big to fail at this point.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 1:45 UTC (Sat) by willy (subscriber, #9762) [Link] (7 responses)

If by "nothingness", you mean "has shipped two generations of product".

https://www.intel.com/content/www/us/en/products/details/...

And, yes, the kernel needed (and still needs more) modification to support it well.

It's funny you bring it up though, since one of the use cases is to put PMem on CXL.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 1:50 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Yep. It totally exists: https://www.insight.com/en_US/shop/product/P23535-B21/HEW...

You can buy a 256Gb module for an affordable $3,386.99 , a 512Gb module is on a discount right now at $9,768.99

It's fair to say that Optane memory is vaporware compared to the initial vision of systems with tens of terabytes of persistent RAM.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 3:38 UTC (Sat) by willy (subscriber, #9762) [Link] (4 responses)

As opposed to a 32GB DIMM for $560?

https://www.insight.com/en_US/shop/product/P00924-B21/HEW...

It's not "vaporware" just because you don't want to pay for it. And you can buy an Exadata X9M with 18TB of PMem per rack (admittedly that's spread over three servers each with 6TB). I did think we'd have more capacity by now (about double what we have). But an undershoot on capacity is hardly the same thing as "doesn't exist". It was always going to be a high-end option.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 5:17 UTC (Sat) by repnop (guest, #71931) [Link]

Additionally, it's entirely myopic to see CXL as simply a persistent memory technology.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 6:53 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

You can buy them much cheaper: https://memory.net/product/m393aag40m32-cae-samsung-1x-12...

But yep, this means that the persistent RAM is pretty much a non-entity now. It needs special chipsets, it's expensive, it's not available through most cloud computing providers.

CXL will likely be similar once people find out that it's nowhere close in speed to normal RAM.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 11:58 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

I was comparing apples-to-apples. PMem from this vendor vs DRAM from this vendor. PMem is cheaper per bit than DRAM, just as promised. It's more expensive per bit than NAND, also as promised. It's available in large capacities, so the price of a DIMM is high.

If you were expecting it in your laptop by now, then you weren't paying attention. You also don't have 100Gbps networking in your laptop, but it very much exists.

I think I'm done here. You said something stupid and hyperbolic; now you're determined to Be Right. It doesn't really matter what the facts are.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 18:56 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> PMem is cheaper per bit than DRAM

Except that it's not. At best it's about the same, and certainly machines with tens of terabytes of NVRAM are in the realm of exoticware for at least a decade more. This situation is far away from the hyped state where NVRAM was going to be ubiquitous.

> You also don't have 100Gbps networking in your laptop, but it very much exists.

I have 40Gbps networking at home, and it's not even that expensive. It's not in my laptop (for some reason adapters max out at 10GBps), but I certainly can use it otherwise.

CXL 2: Pooling, sharing, and I/O-memory resources

Posted May 21, 2022 8:10 UTC (Sat) by zdzichu (subscriber, #17118) [Link]

It's always amuses me that hardware companies develop and ship products before getting kernel support ready. Who is gonna use their products?
Or HW vendors designing CPU features conflicting with the way Linux operates (like Intel's memory tags). It's like they sit in a cave, completely ignoring software supposed to run on their silicon.