CXL 2: Pooling, sharing, and I/O-memory resources
Pooled and shared memory
Hongjian Fan, who led one of Tuesday's CXL sessions returned on Wednesday (via videoconference) for a discussion that was dedicated to pooled and shared memory. These are concepts that apply to memory appliances, where the goals are to share memory across multiple systems, improve memory utilization and, naturally, to reduce costs. Sharing memory from a central appliance can reduce the need to put large amounts of memory into every server; when a given machine needs more, it can get a temporary allocation from the appliance.
Pooled memory is partitioned on the appliance and allocated in chunks to servers, which only have access to the memory that has been given to them. Requesting memory from a pooled appliance creates a hotplug event, where new memory suddenly becomes addressable. Supporting pooled memory requires the ability to generate and manage the hotplug events, as well as a virtual-device driver that monitors memory use and requests or releases memory as appropriate.
Shared memory is, instead, shared across all servers, though it will probably not be possible for any given server to allocate it all. With a shared appliance, the memory is always in each server's physical address space, but it may not all be usable. The kernel can provide a sysfs file that indicates which memory is available at any given time; tracking of allocations can done by the appliance or via communication between servers, though the latter mode can create a lot of traffic.
Dave Hansen said that CXL memory behaves a lot like RAM today, but it requires some extra care. There may be cache-coherency issues not present with RAM, and the kernel can't keep any of its own data structures in this memory since those structures cannot be moved and would thus block removal. Fan said that cache coherency is part of the CXL protocol and shouldn't be a problem. Hansen added that there is little that is new with CXL memory appliances, they are much like how memory is managed with virtualization. But now it is being done in hardware, which scares him a bit. Memory removal success is "a matter of luck" now, he said, and calling this memory "CXL" won't change that.
An attendee asked what the benefit of the shared mode was, given that all memory will still be used exclusively by one system at any given time. Fan answered that the problem with pooled access is fast and reliable hotplugging, while the problem with shared access is communication between the systems. Hansen asked how access to shared memory is cut off when memory is reallocated, but Fan was unable to answer the question.
Dan Williams said that access control is not really visible to the kernel, and that it was necessary to "trust the switch". He added that users want to be able to manage this memory with the existing NUMA APIs, but they also want hard guarantees that it will be possible to remove memory from a system; those two goals are in conflict. It will be necessary to reset expectations about removal, he said; it will be a learning experience for the industry. Hansen said that the use of hotplug will be no different in this scenario, but Williams said there will now be a whole new level of software behind hotplug to manage the physical address space. That is something that the firmware has always done, but now the kernel will have to deal with it; the CXL specification group is still trying to figure out the details of how that will work.
Fan said some other changes will be necessary as well. There will need to be a mechanism to warn about available capacity on the appliance. Since memory can be requested and added to the system on the fly, the out-of-memory handler should perhaps wait for more memory to materialize before it starts killing processes. David Hildenbrand said that the out-of-memory scenario scares him; people think that it's possible to just wait for memory to appear, but it's not true. If the system is going into the out-of-memory state, there will be other allocations failing at the same time. What is needed is a way to determine that the system is short of memory, then wait for more memory in a safe way, before running out. Hansen added that plugging in more memory is an act that, in itself, requires allocating memory, and an out-of-memory situation is not a good time to try to do that. Williams said, as the session came to a close, that the system cannot be reactionary, and that memory requirements should be handled in user space at the job-scheduling level.
Managing the resource tree
Management of the physical address space was the topic of the second CXL session of the day. The resource structure is one of the oldest data structures in the kernel; it was added in the 2.3.11 release in 1999. Its job is to track the resources available to the system and, in the form of the iomem_resource variable, the layout of the computer's physical address space. It forms a tree structure with some resources (a PCI bus, for example) containing other resources (attached devices) within their address ranges. This tree is represented in /proc/iomem, which must be opened as root to show the actual addresses involved.
The kernel's I/O-memory resource tree was not designed with CXL in mind;
for Linus Torvalds to have been so short-sighted in 1999 is perhaps
forgivable. But, said Ben Widawsky in his session,
that shortcoming is threatening to create problems now. In current
systems, iomem_resource is initially created from the memory map
provided by the boot firmware; architecture-specific code and drivers then
modify it and subdivide the resources there as needed. Once a given range
of physical address space has been assigned to a specific use, it can never
be reassigned — only subdivided.
The core of the problem is that CXL memory can come and go, and it may not be present at boot time. When this memory is added, it essentially overrides a piece of the physical address space, which is something that iomem_resource is not prepared to handle. If the space used by CXL were disjoint from local system resources, Widawsky said, there wouldn't be a problem; traditional resources could be put into one range, and CXL in another. But that is not how things are going to work. RAM added via CXL will overlap the space already described by iomem_resource. What, he asked, can be done to properly represent these resources?
Mike Rapoport questioned the need to put CXL memory into iomem_resource at all. The problem, Hansen explained, is that CXL memory might be the only memory in the system. People tend to see CXL as a sort of add-on card, but it is closer to the core than that. On a system using only CXL, it would not be possible to boot without having that memory represented in iomem_resource. David Hildenbrand said that iomem_resource should describe everything in the system.
Widawsky said that there is a need to keep device-private memory from taking address space intended for CXL; this is another reason to represent CXL memory in the resource tree. He suggested that attempts to take pieces of memory assigned to CXL should be blocked. Hildenbrand suggested creating the CXL region as a device and adding some special calls to allocate space from that region. This could be tricky, Widawsky said. System RAM may already be set up in the resource tree; making it part of a special device would involve reparenting that RAM, which, he said, has never been done. Matthew Wilcox contradicted the "never been done" claim, but without details on when it had been done.
John Hubbard said that the kernel should keep iomem_resource as "the one truth" about the layout of the physical address space. Williams said that struct resource is old; there are people around who love to add new structures to the kernel, perhaps the time has come to do that for this problem. Wilcox referenced a "20-year-old patch" in Andrew Morton's tree, but didn't identify it. Hildenbrand said that the structure as a whole is difficult to traverse and work with; any work to improve it would be appreciated.
Widawsky asked if there was a path to a solution that involved a bit less hard work. Williams suggested adding resources in smaller chunks, with a number of entries for the CXL CFMWS ("fixed memory window structures") areas. Some of those entries could later be removed, Widawsky added, if it turned out they weren't being used for CXL memory.
The session came to an end with Wilcox asking what would happen in response
to a discovery that an assigned resource's range is too small. Could it be
expanded somehow? Williams said it would be good to be able to update the
address map as more information became available. All told, the session
described a problem but did not get close to finding a solution. This is a
problem that has been seen in numerous other contexts as computers have
become more dynamic. Solutions have been found in the past and will surely
be found this time too, but it may be challenging to find one that doesn't
involve a fair amount of hard work.
Index entries for this article | |
---|---|
Kernel | Compute Express Link (CXL) |
Kernel | I/O memory |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2022 |
Posted May 19, 2022 18:00 UTC (Thu)
by MattBBaker (guest, #28651)
[Link] (2 responses)
So if you want the kernel to manage this memory there will have to be two separate concepts, the memory the device owns and cannot be taken away, and memory that it does not own and subject to being ripped away. I'm half of the mind that a better model for this is to expose remote CXL devices in /dev/ and then mmap() that like a normal 'file'.
Posted May 19, 2022 22:35 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
This all needs specified in terms of failure modes. It may be so specified; I haven't kept up.
Posted Jun 5, 2022 8:05 UTC (Sun)
by njs (subscriber, #40338)
[Link]
I have no idea how well or poorly this will extend to CXL failure modes, but it's at least precedented.
Posted May 19, 2022 20:56 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
Posted May 20, 2022 3:57 UTC (Fri)
by xanni (subscriber, #361)
[Link] (1 responses)
Posted May 20, 2022 12:00 UTC (Fri)
by simcop2387 (subscriber, #101710)
[Link]
Posted May 20, 2022 10:27 UTC (Fri)
by atnot (subscriber, #124910)
[Link]
Posted May 21, 2022 1:45 UTC (Sat)
by willy (subscriber, #9762)
[Link] (7 responses)
https://www.intel.com/content/www/us/en/products/details/...
And, yes, the kernel needed (and still needs more) modification to support it well.
It's funny you bring it up though, since one of the use cases is to put PMem on CXL.
Posted May 21, 2022 1:50 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
You can buy a 256Gb module for an affordable $3,386.99 , a 512Gb module is on a discount right now at $9,768.99
It's fair to say that Optane memory is vaporware compared to the initial vision of systems with tens of terabytes of persistent RAM.
Posted May 21, 2022 3:38 UTC (Sat)
by willy (subscriber, #9762)
[Link] (4 responses)
https://www.insight.com/en_US/shop/product/P00924-B21/HEW...
It's not "vaporware" just because you don't want to pay for it. And you can buy an Exadata X9M with 18TB of PMem per rack (admittedly that's spread over three servers each with 6TB). I did think we'd have more capacity by now (about double what we have). But an undershoot on capacity is hardly the same thing as "doesn't exist". It was always going to be a high-end option.
Posted May 21, 2022 5:17 UTC (Sat)
by repnop (guest, #71931)
[Link]
Posted May 21, 2022 6:53 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
But yep, this means that the persistent RAM is pretty much a non-entity now. It needs special chipsets, it's expensive, it's not available through most cloud computing providers.
CXL will likely be similar once people find out that it's nowhere close in speed to normal RAM.
Posted May 21, 2022 11:58 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
If you were expecting it in your laptop by now, then you weren't paying attention. You also don't have 100Gbps networking in your laptop, but it very much exists.
I think I'm done here. You said something stupid and hyperbolic; now you're determined to Be Right. It doesn't really matter what the facts are.
Posted May 21, 2022 18:56 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Except that it's not. At best it's about the same, and certainly machines with tens of terabytes of NVRAM are in the realm of exoticware for at least a decade more. This situation is far away from the hyped state where NVRAM was going to be ubiquitous.
> You also don't have 100Gbps networking in your laptop, but it very much exists.
I have 40Gbps networking at home, and it's not even that expensive. It's not in my laptop (for some reason adapters max out at 10GBps), but I certainly can use it otherwise.
Posted May 21, 2022 8:10 UTC (Sat)
by zdzichu (subscriber, #17118)
[Link]
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
CXL 2: Pooling, sharing, and I/O-memory resources
Or HW vendors designing CPU features conflicting with the way Linux operates (like Intel's memory tags). It's like they sit in a cave, completely ignoring software supposed to run on their silicon.