Management of volatile CXL devices
When people think about the management of CXL memory, they often have tiering in mind, he said. But there are more basic questions that need to be answered before it is possible to just put memory on the PCI bus and expose it to the page allocator. There needs to be a way, he said, to avoid putting some applications into some classes of memory. The mempolicy API is the obvious tool to use, but he is not sure if it is the right one.
A running system, he said, has three broad components: the platform, the
kernel, and the workloads. The platform (or its firmware) dictates how the
kernel objects representing the system, such as NUMA nodes, memory banks,
and memory tiers, are created; these objects are then consumed by the
mempolicy machinery, by DAMON, and the
rest of the system. But question of how and when those
objects are created is important. For example, can the contiguous memory allocator (CMA) deal
properly with NUMA nodes that are not online at boot time? Getting the
answers wrong, he said, will make it hard to get tiering right.
Adam Manzanares suggested that it would be possible to create QEMU configurations that would match the expected hardware and asked if that would be helpful. Price said it would, enabling developers to figure out some basic guidelines governing what works and what does not.
The system BIOS, Price said, tells the kernel where the available memory
ends up; the kernel then uses that information to create NUMA nodes
describing that memory. A suitably informed kernel could create multiple
nodes with different use policies if needed to use the memory effectively.
Jonathan Cameron interjected that NUMA nodes are "garbage
", at least
for this use case; Price agreed that NUMA might not be the right
abstraction for the management of CXL memory.
At a minimum, he continued, somebody should create documents describing what the kernel expects. He has done some work in that direction; links can be found in his proposal for this session. Any vendors that go outside those guidelines would be expected, at least, to provide an example of how they think things should work. Dan Williams said that a wider set of tests would also be helpful here; the current tests are mostly there to prevent regressions rather than ensuring that the hardware is behaving as expected.
Price would like to isolate the kernel from CXL memory as much as possible — if kernel allocations land in CXL memory, the performance and reliability of the system as a whole could suffer. There are a lot of tests to ensure that specific tiering mechanisms work well, but not so many when it comes to landing kernel allocations in the right place. There is no easy way to say which kernel memory, if any, should be in CXL memory. Configuring that memory as being in ZONE_MOVABLE would solve that problem (kernel allocations are not movable, so cannot be made from that zone), but it creates other problems. For example, hugetlb memory cannot be in ZONE_MOVABLE either.
John Hubbard suggested creating a more general mechanism, perhaps there is a need for a ZONE_NO_KERNEL_ALLOC. He, too, asked whether NUMA is the right abstraction for managing this memory. Williams suggested just leaving the memory offline when it is not in use, but Hubbard said that would run into trouble when the memory is put back into the offline state after use. If any stray, non-movable allocations have ended up there, it will not be possible to make the memory go offline. Williams said that the moral of that story is to never put memory online if you want to get it back later; Hubbard answered that this is part of the problem. The rules around this memory do not match its real-world use.
Alistair Popple suggested that this memory could be set up as
device-private; that, however, would prevent the use of the NUMA API
entirely, so processes could not bind to it. Williams said that
device-private memory is kept in a driver that would manage how it is
presented to user space. Popple said that approach can work, but is a
problem for developers who want to use standard memory-management system
calls. Hubbard said that he would rather not be stuck with a
"niche
" approach like device-private memory.
Price continued, noting that the standard in-kernel allocation interface is kmalloc(). He wondered whether developers would be willing to change such a widely used interface to add a new memory abstraction. If they do, though, perhaps adding something for high-bandwidth memory at the same time would make sense. Michal Hocko suggested that it would be better to make a dedicated interface than to overload kmalloc() further.
User-space developers want something similar; there tend to be a number of
different workloads running on a system, some of which should be isolated
from CXL memory. Page demotion can ignore memory configurations now, with
the result that it can force pages into slower memory even if the
application has tried to prevent that. He has, he said, even seen stack
pages demoted to slow memory, which is "an awful idea
". That said,
he added, there are some times when it might make sense to demote unused
stack memory.
An important objective is "host/guest kernel parity
", where both a
host system and any guests it is running have full control over memory
placement. The guests should be able to do their own tiering, and should
be able to limit what ends up in slower memory, just like the host does.
He would like to define some KVM interfaces to pass information about
memory layout and needs back and forth.
A big problem is the current inability to place 1GB huge pages in ZONE_MOVABLE. Allocating those pages to hand to guests can improve performance, but this limitation keeps them from being allocated in CXL memory. He has heard suggestions to use CMA, but the CXL memory is not online when CMA initializes, so CMA cannot use it.
There are other problems as well, he said at the end. If the placement of a bank of CXL memory is not properly aligned, that memory will be trimmed to the page-block size, wasting some of that memory. There are a lot decisions that can ripple through the system and create problems later on. Hocko commented that the page-block size is arbitrary, and that the memory-management developers would like to get rid of it. The fact that it is exported via sysfs makes that change harder, though. A lot of these problems, he said, are self-inflicted and will be fixed eventually.
Price concluded with an observation that memory tiers, as currently implemented, do not make much sense. A common CXL use case involves interleaving pages across multiple devices; that forces those devices into a single NUMA node. That creates confusion on multi-socket systems and really needs to be rethought, he said.
Price has posted
the slides for this session.
| Index entries for this article | |
|---|---|
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
Posted Apr 10, 2025 18:39 UTC (Thu)
by gmprice (subscriber, #167884)
[Link]
The current memory-tier component will lump multiple CXL-backed NUMA nodes with similar performance characteristics into a single tier - even if those nodes are attached to different sockets. This can create a confusing environment for tiering (you might end up demoting or promoting across a socket, which may be sub-optimal).
Thank you for the coverage!
Posted Apr 10, 2025 22:43 UTC (Thu)
by willy (subscriber, #9762)
[Link] (31 responses)
Posted Apr 10, 2025 23:09 UTC (Thu)
by gmprice (subscriber, #167884)
[Link] (10 responses)
Posted Apr 10, 2025 23:36 UTC (Thu)
by willy (subscriber, #9762)
[Link] (9 responses)
Posted Apr 10, 2025 23:50 UTC (Thu)
by gmprice (subscriber, #167884)
[Link] (8 responses)
Which, fair enough, we can disagree.
But people buy systems with HBM and DDR, and why would I buy a system with DDR when I could buy a system with HBM? At a certain point, the $ matters more than the latency lol.
Posted Apr 11, 2025 2:27 UTC (Fri)
by jedix (subscriber, #116933)
[Link] (4 responses)
If you want some fancy 'pay for faster memory on my cloud', then yes you need your application to know how to allocate from a special interface.
I also don't think your argument of HBM vs DDR is valid, what you are asking is for someone to buy a system with both HBM and DDR and with no extra work (python, go, rust, c++, etc), make it run faster. Another way to look at it is you are asking someone to buy EXTRA hardware to run slower; $ == latency.
Posted Apr 11, 2025 6:12 UTC (Fri)
by gmprice (subscriber, #167884)
[Link] (3 responses)
Approaching it purely as a swap-ish tier is missing the boat.
I like data, so here's some data from a real system.
FIO benchmark w/ a file loaded up into the page cache - since that was a concern.
100% DRAM (128GB file)
100% CXL (128GB file)
DRAM+CXL (~900GB file cache, >128GB in CXL via demotions)
DRAM+Disk (~900GB, but that >128GB now faults from disk)
This is a fairly naive test, you'd really right-size your workloads to ensure projected hot capacity fits (mostly) in DRAM, and let CXL act as zswap-ish tier (compress if you need to, but otherwise leave it as pages!) to avoid the degenerate scenario (faulting from disk).
What we have now (ZONE_MOVABLE + demotion in reclaim + zswap) is already sufficient to do this - we just need to add some tweaks to deal with corner cases (transitory case 2). It'd also be nice to get better levels to allow opt-in/opt-out for this capacity (both for kernel and userland), but having pagecache use it seems perfectly fine in most every real-world test I've run.
I'm guessing I'll die on this hill - the latencies aren't as scary as people make them out to be, and approaching it with solutions built with disk-latencies in mind really misses the boat on how decent a memory solution this is. This isn't some weird proprietary memory technology behind a weird proprietary bus, it's literally DDR with extra steps.
It's a neat trick to point at the DRAM-only number and say "Look how much you're losing!" but hard admit you'd triple your server cost to get there in the real world. So no, you're not paying more money to go slower. You're paying significantly less money to go quite a bit faster, just not as fast as possible - and that's ok!
Posted Apr 11, 2025 18:12 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Latency is about 2.5x higher for CXL than for DDR, which is honestly even worse than I thought. I guess part of that can be attributed to the immaturity of the technology, but I've demonstrated that it will always be slower than DDR.
Using it for zswap seems like a great idea. Using it for slab, stack, page-cache; all of these seem like terrible ideas.
Posted Apr 11, 2025 19:45 UTC (Fri)
by jedix (subscriber, #116933)
[Link] (1 responses)
Trying to 'not miss the boat' is like predicting the future - you don't know what's going to be needed in memory - otherwise we'd not need CLX at all. If you could solve the issue of knowing what is needed to be fast vs slow, then you could just page the slow stuff out and move on. So, I think we either make the (z)swap approach work or spend the money on dram and not developer effort.
Look at your numbers, it's a solid win against disk. I don't think that's your point, but the tail is a _lot_ better. And that is using the swap code as it is today.
I mean, reusing your old ram is still stupid because you will spend the money you saved on dram on an army of people walking the data center floor replacing dead dimms (and probably killing some servers by accident while they go).
But let's say you are making new shiny tech from 2007 (ddr3) or 2014 (ddr4) to use in your 2025 servers, then you can have faster zswap. If you don't feel old enough yet (I do..): DDR3 is old enough to have voted in the last election. Although DDR4 is newer, it came out 11 years ago, the same year as Shake It Off topped the charts, so.. not yesterday (but, haters gonna hate).
The thing that is really starting to bother me is the amount of time we are spending trying to solve a self-made issue and burning good developers on it.
Posted Apr 11, 2025 19:58 UTC (Fri)
by gmprice (subscriber, #167884)
[Link]
It's actually not. The pages are pages. The tail latency is avoided precisely because the page is still present and mapped, it's just been demoted to CXL instead of removed from the page tables.
So it's "zswap-ish", because reclaim migrates the page somewhere else - but in this case that somewhere else is another plain old page on another node.
If you enable zswap on top of this and give it the same size as your CXL tier, you'll end up with CXL consumed like zswap as you suggested - but the tail latencies will be higher (maybe not multi-millisecond, but i haven't tested this yet).
Posted Apr 11, 2025 3:20 UTC (Fri)
by willy (subscriber, #9762)
[Link] (2 responses)
And yes, the notion absolutely was that "enlightening" the interpreter / runtime for managed languages was a viable approach. I mean ... maybe? But we're back to the question of "why would I deliberately choose to use slow memory".
What I do see as a realistic workload is "This is a cheap VM that doesn't get to use DRAM". So that's a question of enlightening the hypervisor to give CXL memory to the guest.
HBM is a distraction. That's never been available for CPUs in large quantities; it's always been used as L4 cache. I see it's available in real quantities for GPUs now, but I'm not seeing it on any CPU roadmap.
And, yes, I do want the technology to die. It's the same mistake as ATM and InfiniBand.
Posted Apr 11, 2025 6:37 UTC (Fri)
by gmprice (subscriber, #167884)
[Link] (1 responses)
The answer is pretty clearly - to avoid having to use all of those things.
There is maybe a good argument to formalize a zswap backend allocator that can consume dax-memory, and hide that from the page allocator - but there is still clear value in just exposing it as a page. It's literally DDR behind a controller, not some pseudo-nand or page-fetch interconnect.
Posted Apr 11, 2025 18:29 UTC (Fri)
by willy (subscriber, #9762)
[Link]
That's actually worse than the 3dxp business case. Because in the future where 3dxp had worked, the argument was that it came in VAST quantities. If I remember correctly, the argument at the time was for a 2 socket machine with 128GB of DRAM, 8TB of 3dxp and some amount of NVMe storage.
So you might reasonably want to say "Hey, give me 4TB of slow memory" because you'd designed your algorithm to be tolerant of that latency.
And then 3dxp also had the persistence argument for it; maybe there really was a use-case for storage-presented-as-memory. Again, CXL as envisaged by you doesn't provide that either. It's just DRAM behind a slow interconnect.
Posted Apr 10, 2025 23:17 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (19 responses)
Posted Apr 10, 2025 23:34 UTC (Thu)
by willy (subscriber, #9762)
[Link] (13 responses)
Posted Apr 11, 2025 14:29 UTC (Fri)
by edeloget (subscriber, #88392)
[Link] (9 responses)
Otherwise than that, I too have some difficulties to understand the need, especially since existing CXL modules seems to be quite expensive when compared to normal RAM modules.
Posted Apr 11, 2025 16:37 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (8 responses)
And then there's fun ideas like having VMs allocated entirely in CXL, so that migration to a different host can be done quickly (since the CXL memory is shared), or using CXL memory as a communication channel.
All of these use cases have to account for the increased latency of CXL, and come up with a good way of using it in conjunction with host memory, somehow. And it's entirely plausible that the cost savings of CXL as opposed to more DRAM in the hosts will be entirely wiped out by the increased costs from managing the different pools of memory.
Posted Apr 15, 2025 17:07 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (7 responses)
It's always funny when hardware engineers forget that hardware is bought to run software, not micro-benchmarks. Yes, software is obsessed about the best flops / s / Watt but... only software that _already exists_ actually obsesses about performance. Software that does not exist does not care. Isn't that what killed 3D XP despite magnificent hardware specs?
To survive, _one_ "killer app" is enough. Good engineers love generic code and doing more with less (and rightly so) but this is not the time. Find the most promising use case (not easy, granted) and put all efforts behind that one. Once you've escaped the chicken and egg problem and have hardware widely available on the shelves for a reasonable price, all other potential use cases will follow eventually. Then, the initial killer app code will be rewritten and made more generic. A few times over.
If you struggle to find even just one great use case, then... good bye? That does not necessarily mean the idea was bad. It could just mean that hardware companies (and everyone in general) was not willing to pay enough for software - especially not open software that helps the competition. Clearly, managing memory is hard, really hard. Neither the first time nor the last time software would be a hardware bottleneck.
Posted Apr 15, 2025 17:42 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (6 responses)
The other applications don't feel like they have a strong economic incentive attached to them; more memory than the base system can support was one of 3DXP's selling point, which failed, and the "fun" ideas are ones that feel more like "we can do this now, let's see if it's saleable".
That said, it's worth noting what CXL actually is; it's a set of cache coherency protocols layered atop PCIe, with some of the more esoteric options for the PCIe layer banned by the CXL spec. Anything that supports CXL can also trivially support PCIe, just losing cache coherency when attached via PCIe. I therefore wouldn't be hugely surprised if the long-term outcome is that CXL-attached DRAM is a flash in the pan, but CXL survives because it allows CPUs and GPUs/NPUs to be connected such that each device's local DRAM is visible cache-coherently to the other, making it simpler for software to get peak performance from CXL attached devices.
Posted Apr 15, 2025 21:52 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (3 responses)
Thanks, then the next questions are: how far from this is software?And: is this use case big enough to sustain the whole CXL ecosystem? At least for a while. Cause Small Fish Eat Big Fish etc.
"Killer App" or just "App"? And dies young like most technologies.
> but CXL survives because it allows CPUs and GPUs/NPUs to be connected such that each device's local DRAM is visible cache-coherently to the other, making it simpler for software to get peak performance from CXL attached devices.
Mmmm... I'm afraid there could be "too much" software there! GPUs are already talking to each using NVLink or whatever and the relevant software frameworks already know how to manage communications without hardware provided coherence. So, what will coherence bring to the table? Potentially better performance? Not easy to demonstrate when the raw link rate is much higher in the first place...
There's a saying that goes like "if you have too many excuses, it's probably because none of them is good". There are many potential use cases for CXL that make sense in theory. But AFAIK none you can just get and leverage in production yet. We'll see.
Posted Apr 16, 2025 10:31 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
Instead, I expect that CXL will gradually replace PCIe as the interface of choice for GPUs, higher speed NICs, SSDs etc, since it's backwards-compatible with PCIe (so you're not cutting off part of your market by putting CXL on your device instead of PCIe), but is a net improvement if the rest of the system supports CXL. And as it's mostly the same as PCIe, it's not a significant extra cost to support CXL as well as PCIe.
And CXL memory support as needed for the hyperscaler application is there already today; this is not a case of "write software to make it happen", this is a case of "if we don't improve software, then this application is less efficient than it might be", since from the host OS's point of view, CXL memory might as well be IMC-attached DDR, just with higher latency than the IMC-attached DRAM. There's wins if software can make use of the fact that 64 GiB of RAM has lower latency than the other 192 GiB, 448 GiB or 960 GiB of RAM, but you can meet the requirement with CXL-unaware software today. In this respect, it's like NUMA; there's wins on offer if you are NUMA-aware, but you still run just fine if you're not.
In particular, you can support CXL memory by rebooting to add or remove it - it's a quicker version of "turn the machine off, plug in more DIMMs, turn the machine on", since you're instead doing "boot to management firmware, claim/release a CXL chunk, boot to OS". It'd be nicer if you can do that without a reboot (by hotplugging CXL memory), but that's a nice-to-have, not a needed to make this product viable.
Posted Apr 16, 2025 19:29 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (1 responses)
I don't know NVLink but it does not seem to offer hardware coherence. Does it?
> If CXL is dead on arrival, NVLink should also have been dead on arrival.
There many intermediate possibilities between "dead on arrival" and "commercial success", notably: losing to the competition, "interesting idea but no thanks", "Embrace, Extend and Extinguish", etc.
> since it's backwards-compatible with PCIe
That's a big advantage, yes.
> it's not a significant extra cost to support CXL as well as PCIe.
I think it really depends what you're looking at. From a pure hardware, CPU development perspective, you could argue most of the development work is done but is it really? You know for sure only when entering actual production and I'm not aware of that yet. Moreover, "developers" tend to ignore everything outside development, notably testing and on-going validation costs.
From a hardware _device_ perspective I'm not so sure. I guess "it depends". CXL smart NICs anyone? A lot of that stuff is obviously confidential. If CXL devices are not commercially successful, CXL support on the CPU side will "bitrot" and could die.
From a software cost perspective, this looks very far from "done" https://docs.kernel.org/driver-api/cxl/maturity-map.html
> And CXL memory support as needed for the hyperscaler application is there already today;
Is it really? Genuine question, I really don't know enough but what I see and read here and there does not give a lot of confidence. I understand there are many different use cases and this seems like the simplest one.
> In this respect, it's like NUMA; there's wins on offer if you are NUMA-aware, but you still run just fine if you're not.
Good!
Posted Apr 17, 2025 8:52 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
And yes, CXL is basically done and ready to use if you're happy using it as "just" cache-coherent PCIe (which is what the AI accelerator world wants from it). The software stuff you've linked there is the stuff you need to do if you want to do more than cache-coherent PCIe - online reallocation of memory ownership, standardised EDAC (rather than per-board EDAC like in PCIe), multi-host support (rather than single-host), and so on. A lot of this is stuff that exists on an ad-hoc basis in various GPUs, NICs and SSDs already; the difference CXL makes is that instead of doing it differently in each driver, you're doing it in the CXL subsystem.
The specific thing that works today is booting systems with a mix of CXL and IMC memory, and rebooting to change the CXL memory configuration. That's enough for the basic hyperscaler application of "memory pool in a rack"; everything else is enhancements to make it better (e.g. being able to assign CXL memory at runtime, having shared CXL memory between two hosts in a rack and more).
Posted Apr 17, 2025 14:47 UTC (Thu)
by gmprice (subscriber, #167884)
[Link] (1 responses)
Do they though? Is this actually deployed anywhere in any reasonable capacity, or is it some idea some business wonk loves because he can make the numbers look pretty?
I ran the numbers at one point, and pooling seems like a mistake unless you have a large read-only shared memory use case (like FAMFS is trying to service). All you get with pooling is a giant failure domain waiting to blow up and cost you millions upon millions of dollars in downtime. The flexibility might let you provide more flexible VM shapes, but the question is how valuable such a service would be.
There are papers that say it "solves stranded memory" and other papers that say "Actually, you have a bin-packing problem, get good". Hard to say who has it right, but I can't say definitively that CXL provides a novel and useful solution to that problem.
Remember that this all takes rack space and power. For every 1U of memory-only space, you have to balance this against 1U of additional compute space. The numbers don't work out the way you think they do - the opportunity costs are real.
Posted Apr 17, 2025 15:27 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Today, you typically build a rack of either compute-optimized, general purpose or memory-optimized servers, and you get most active servers per rack if they're compute-optimized (since you can't actually power up all 42U or whatever height you have of rack at once, due to cooling and power constraints), and fewest if they're memory-optimized. This forces you into a balancing act; you want to bias towards compute-optimized servers, but you need enough general purpose and memory-optimized servers to handle workloads that need more RAM than a compute-optimized server has.
The CXL memory promise is that you have only compute-optimized racks with powered-down CXL memory devices in the rack. If you need general purpose or memory-optimized servers, you power down some compute-optimized servers to power up some CXL memory, and change the rack configuration from (numbers invented) 10 compute-optimized servers to 3 memory-optimized servers and 3 compute-optimized on the fly. When the workload balance changes (and you pressure your internal teams to get the balance to change if at all possible, because of the aforementioned power and cooling limits), you switch the rack back to compute-optimized servers.
Posted Apr 11, 2025 21:02 UTC (Fri)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Apr 11, 2025 21:08 UTC (Fri)
by willy (subscriber, #9762)
[Link]
The relative latency, the relative bandwidth, the relative capacities all drive towards different solutions. That's why it's so unhelpful when the CXL advocates bring up HBM as an example.
Posted Apr 12, 2025 2:59 UTC (Sat)
by hnaz (subscriber, #67104)
[Link]
I would call it the "warm set problem". Not all data that a workloads references needs the same access speeds, and there is more and more data between the hottest and the coldest bits for which storage is too slow, but first-class RAM is getting a bit too overkill and too expensive in terms of capex and power. Compression is a pretty great solution for this, which is why it's used by pretty much every phone, laptop and tablet currently sold, and widely used by hyper-scalers. It's kind of insane how far you can stretch a few gigabytes of RAM with compression, for workloads that would otherwise be dog slow or routinely OOM with nothing in between RAM and storage.
But compression is still using first-class DRAM, specced, clocked and powered for serving the hottest pages to the higher-level caches.
And folks who are worried about CXL access latencies likely won't be excited about the cycles spent on faults and decompression.
I'm not a fan of dumping the placement problem on the OS/userspace. This may kill it before any sort of market for second tier dimms can establish itself. And I'm saying all this despite having been critical of CXL adoption at my company based on the (current) implications on the software stack, the longevity of that design, and the uncertainty around sourcing hardware for that role long-term.
But I can definitely see how it's very attractive to provision less of the high-performance memory for that part of the data that doesn't really need it. Or would only need it if the alternative is compression or storage, which would waste or strand much more CPU. That's just a memory/cache hierarchy thing: the shittier level L is, the more you need of level L-1 to keep the CPU busy.
So I don't quite understand the argument to restrict it to certain types of memory, like zswap. If the latency is good enough for access AND decompression, why wouldn't it be good enough for access? Somewhat slower page cache seems better than the, what, thousand-fold cost of a miss. It's not just about the performance gap to first-class DRAM, but also about the gap to the next best thing.
Posted Apr 10, 2025 23:38 UTC (Thu)
by gmprice (subscriber, #167884)
[Link]
Keep things you KNOW can cause issues out and it limits the complexity when you go hunting performance regressions.
Posted Apr 11, 2025 5:03 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
1 light-nanosecond is 30 centimeters (around 1ft in Freedom Units).
Posted Apr 11, 2025 15:47 UTC (Fri)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Oh you wanted your light to go faster? Picky, picky! ;-)
Posted Apr 13, 2025 1:08 UTC (Sun)
by Kamilion (guest, #42576)
[Link]
During this conversation, I grabbed a micron 9300 U.2 drive, pointed at it's debug USB-Micro connector, and demanded of him a pair of USB-4 ports in the next hardware rev. He laughed and said he'd see what he could do. Pointed him at the AP33772 datasheet, suggested that bringup would be a lot easier if they had a method of direct-access for provisioning, bidirectional 12V PDOs for power exchange, displayport tunneling for their future in UEFI debugging, and a user-terminal. He got really excited about that. Apparently it took them something like nine months to figure out how to get a prototype dead-bugged with magnet wire just to get far enough to boot a kernel, "when it could have been as easy as plugging a portable monitor like yours in, and a keyboard." "... yep. Would have been even easier if the portable monitor had a bluetooth dongle hubbed into it."
All kinds of shenanigans went on to get sgabios stuffed into it's firmware load, in order to control it via serial terminal. Silly rabbits.
*if* they eventually get buyin from AMD or get large enough to start ordering customized silicon, they'll probably try moving to chiplet memory instead of discrete DRAM packages on the same carrier PCB. At that point I expect "most" memory visible to the system to be available over the CXL fabric. It's not too insane to think of a shoebox with four EDSFF bay slots handling whole-home compute when smart-tvs are already in wide use. A pair of processing elements, a memory element, and a storage element.
How such a system is to be managed by an enduser versus an enterprise, on the other hand, was/is still an open question to me.
My suggestion to him on that front was to start simple: Just throw an ASpeed BMC in any chassis, handling the fans and CXL switch. Load it with openbmc, then push an industry standard network boot configuration to any blade inserted. There's already existing JBOD boards for ATX chassis that do this, as well as existing blade chassis like supermicro's old 4 node C6100 design for dell.
"A great little lunch meeting" I had in january, ironically enough, started over how their end-of-life should be handled. I think I successfully convinced him that reuse after initial deployment is a positive factor, not a negative one. Hopefully he got positive responses from the rest of his team, but I've not had a chance to catch back up with him since then.
Posted Apr 13, 2025 8:17 UTC (Sun)
by gmprice (subscriber, #167884)
[Link]
IIRC the idea is basically you can route more CXL lanes to the chip than you can DDR, so you can get more bandwidth in exchange for latency.
Not sure I buy their assessments completely, but it's interesting I suppose.
Posted Apr 12, 2025 22:02 UTC (Sat)
by snajpa (subscriber, #73467)
[Link]
Reducing the whole CXL potential/debate to "memory recycling" technology, is also heavily underwhelming. It's also for CPU-to-CPU cache coherent comms, so I was expecting to get fiber bridges between systems which fit in PCIe slots, so we can finally build out rackscale NUMA machines, as mere mortals, without own R&D...
Please don't kill the technology with such a weak momentum by coming at it from a wrong angle, it can be far more than glorified swap - there's also nothing novel for us by thinking in that direction, I think.
One Correction
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
bw=85.8GiB/s (92.1GB/s)
lat (usec) : 50=98.34%, 100=1.61%, 250=0.05%, 500=0.01%
bw=41.1GiB/s (44.2GB/s) - note this test actually caps the usable bandwidth.
lat (usec) : 100=86.55%, 250=13.45%, 500=0.01%
bw=64.5GiB/s (69.3GB/s)
lat (usec) : 50=69.26%, 100=13.79%, 250=16.95%, 500=0.01%
bw=10.2GiB/s (10.9GB/s)
lat (usec) : 50=39.73%, 100=31.34%, 250=4.71%, 500=0.01%, 750=0.01%
lat (usec) : 1000=1.78%
lat (msec) : 2=21.63%, 4=0.81%, 10=0.01%, 20=0.01%
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
One use for CXL is to support large memory configurations on servers that can't handle that much RAM natively. Another is to have a set of "small memory" servers with (say) 64 GiB of local RAM, attached to a 1 TiB CXL DRAM unit. Then, if you need more than 64 GiB on a single server, it allocates a chunk of CXL-attached DRAM, and reconfigures a "64 GiB" server as a 256 GiB or even a 512 GiB server, getting you more efficiency from your data center. Both of these still have the DRAM dedicated to the host that's using it - all CXL memory is either idle, or assigned exclusively to a host.
Uses for CXL
Uses for CXL
The hyperscalers love the "pool of memory that can be attached to any server in the rack" application; that's the one use case for CXL DRAM that I think is most likely to get somewhere, since it lets you have fewer different types of rack in the data centre. Instead of having "general purpose", "compute-optimized" and "memory-optimized" systems in the rack, every rack is a set of "compute-optimized" servers with a hefty chunk of CXL-attached memory available, and "general purpose" and "memory-optimized" servers are formed by using a "compute-optimized" server and attaching a chunk of CXL memory to it from the lump in the rack.
Uses for CXL memory
Uses for CXL memory
CXL is basically a less proprietary variant on what NVLink offers for GPU→GPU comms, and thus supports more device types (like NICs, SSDs, and memory). If CXL is dead on arrival, NVLink should also have been dead on arrival.
Uses for CXL memory
Uses for CXL memory
NVLink is a brand name for multiple different (and incompatible) things. Some variants on NVLink do support cache coherency between GPUs, some don't (it depends on the generation of GPU you're using it with); the current generation does, in part because "AI" workloads need so much GPU memory that Nvidia is using NVLink to support attaching a large chunk of slightly higher latency RAM to a processing board.
Uses for CXL memory
Uses for CXL memory
I can't comment in detail, because of NDAs, but yes, they do, because they already have the bin packing problem, and CXL moves when you deal with it from "building the DC" to "while operating the DC".
Uses for CXL memory
What systems will have CXL?
What systems will have CXL?
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
kmalloc
CXL memory vs. CXL potential
