kmalloc [LWN.net]

kmalloc

Posted Apr 10, 2025 23:34 UTC (Thu) by willy (subscriber, #9762) [Link] (13 responses)

It would shock me. Latency of CXL memory is guaranteed to be 10ns worse than DIMMs, just on encode/decode overhead. In practice, I expect it to be significantly higher. What economic driver will cause consumers to want to buy slower computers?

kmalloc

Posted Apr 11, 2025 14:29 UTC (Fri) by edeloget (subscriber, #88392) [Link] (9 responses)

I think the whole idea is to be able to largelly augment the RAM of a server even if it was not built to support that much memory. That would indeed allow you to add more low cost VM on the same server, provided that the existing VMs are not eating your processing power.

Otherwise than that, I too have some difficulties to understand the need, especially since existing CXL modules seems to be quite expensive when compared to normal RAM modules.

Uses for CXL

Posted Apr 11, 2025 16:37 UTC (Fri) by farnz (subscriber, #17727) [Link] (8 responses)

One use for CXL is to support large memory configurations on servers that can't handle that much RAM natively. Another is to have a set of "small memory" servers with (say) 64 GiB of local RAM, attached to a 1 TiB CXL DRAM unit. Then, if you need more than 64 GiB on a single server, it allocates a chunk of CXL-attached DRAM, and reconfigures a "64 GiB" server as a 256 GiB or even a 512 GiB server, getting you more efficiency from your data center. Both of these still have the DRAM dedicated to the host that's using it - all CXL memory is either idle, or assigned exclusively to a host.

And then there's fun ideas like having VMs allocated entirely in CXL, so that migration to a different host can be done quickly (since the CXL memory is shared), or using CXL memory as a communication channel.

All of these use cases have to account for the increased latency of CXL, and come up with a good way of using it in conjunction with host memory, somehow. And it's entirely plausible that the cost savings of CXL as opposed to more DRAM in the hosts will be entirely wiped out by the increased costs from managing the different pools of memory.

Uses for CXL

Posted Apr 15, 2025 17:07 UTC (Tue) by marcH (subscriber, #57642) [Link] (7 responses)

> One use for CXL is ... Another is ... And then there's fun ideas like ...

It's always funny when hardware engineers forget that hardware is bought to run software, not micro-benchmarks. Yes, software is obsessed about the best flops / s / Watt but... only software that _already exists_ actually obsesses about performance. Software that does not exist does not care. Isn't that what killed 3D XP despite magnificent hardware specs?

To survive, _one_ "killer app" is enough. Good engineers love generic code and doing more with less (and rightly so) but this is not the time. Find the most promising use case (not easy, granted) and put all efforts behind that one. Once you've escaped the chicken and egg problem and have hardware widely available on the shelves for a reasonable price, all other potential use cases will follow eventually. Then, the initial killer app code will be rewritten and made more generic. A few times over.

If you struggle to find even just one great use case, then... good bye? That does not necessarily mean the idea was bad. It could just mean that hardware companies (and everyone in general) was not willing to pay enough for software - especially not open software that helps the competition. Clearly, managing memory is hard, really hard. Neither the first time nor the last time software would be a hardware bottleneck.

Uses for CXL memory

Posted Apr 15, 2025 17:42 UTC (Tue) by farnz (subscriber, #17727) [Link] (6 responses)

The hyperscalers love the "pool of memory that can be attached to any server in the rack" application; that's the one use case for CXL DRAM that I think is most likely to get somewhere, since it lets you have fewer different types of rack in the data centre. Instead of having "general purpose", "compute-optimized" and "memory-optimized" systems in the rack, every rack is a set of "compute-optimized" servers with a hefty chunk of CXL-attached memory available, and "general purpose" and "memory-optimized" servers are formed by using a "compute-optimized" server and attaching a chunk of CXL memory to it from the lump in the rack.

The other applications don't feel like they have a strong economic incentive attached to them; more memory than the base system can support was one of 3DXP's selling point, which failed, and the "fun" ideas are ones that feel more like "we can do this now, let's see if it's saleable".

That said, it's worth noting what CXL actually is; it's a set of cache coherency protocols layered atop PCIe, with some of the more esoteric options for the PCIe layer banned by the CXL spec. Anything that supports CXL can also trivially support PCIe, just losing cache coherency when attached via PCIe. I therefore wouldn't be hugely surprised if the long-term outcome is that CXL-attached DRAM is a flash in the pan, but CXL survives because it allows CPUs and GPUs/NPUs to be connected such that each device's local DRAM is visible cache-coherently to the other, making it simpler for software to get peak performance from CXL attached devices.

Uses for CXL memory

Posted Apr 15, 2025 21:52 UTC (Tue) by marcH (subscriber, #57642) [Link] (3 responses)

> The hyperscalers love the "pool of memory that can be attached to any server in the rack" application; that's the one use case for CXL DRAM that I think is most likely to get somewhere, since...

Thanks, then the next questions are: how far from this is software?And: is this use case big enough to sustain the whole CXL ecosystem? At least for a while. Cause Small Fish Eat Big Fish etc.

"Killer App" or just "App"? And dies young like most technologies.

> but CXL survives because it allows CPUs and GPUs/NPUs to be connected such that each device's local DRAM is visible cache-coherently to the other, making it simpler for software to get peak performance from CXL attached devices.

Mmmm... I'm afraid there could be "too much" software there! GPUs are already talking to each using NVLink or whatever and the relevant software frameworks already know how to manage communications without hardware provided coherence. So, what will coherence bring to the table? Potentially better performance? Not easy to demonstrate when the raw link rate is much higher in the first place...

There's a saying that goes like "if you have too many excuses, it's probably because none of them is good". There are many potential use cases for CXL that make sense in theory. But AFAIK none you can just get and leverage in production yet. We'll see.

Uses for CXL memory

Posted Apr 16, 2025 10:31 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

CXL is basically a less proprietary variant on what NVLink offers for GPU→GPU comms, and thus supports more device types (like NICs, SSDs, and memory). If CXL is dead on arrival, NVLink should also have been dead on arrival.

Instead, I expect that CXL will gradually replace PCIe as the interface of choice for GPUs, higher speed NICs, SSDs etc, since it's backwards-compatible with PCIe (so you're not cutting off part of your market by putting CXL on your device instead of PCIe), but is a net improvement if the rest of the system supports CXL. And as it's mostly the same as PCIe, it's not a significant extra cost to support CXL as well as PCIe.

And CXL memory support as needed for the hyperscaler application is there already today; this is not a case of "write software to make it happen", this is a case of "if we don't improve software, then this application is less efficient than it might be", since from the host OS's point of view, CXL memory might as well be IMC-attached DDR, just with higher latency than the IMC-attached DRAM. There's wins if software can make use of the fact that 64 GiB of RAM has lower latency than the other 192 GiB, 448 GiB or 960 GiB of RAM, but you can meet the requirement with CXL-unaware software today. In this respect, it's like NUMA; there's wins on offer if you are NUMA-aware, but you still run just fine if you're not.

In particular, you can support CXL memory by rebooting to add or remove it - it's a quicker version of "turn the machine off, plug in more DIMMs, turn the machine on", since you're instead doing "boot to management firmware, claim/release a CXL chunk, boot to OS". It'd be nicer if you can do that without a reboot (by hotplugging CXL memory), but that's a nice-to-have, not a needed to make this product viable.

Uses for CXL memory

Posted Apr 16, 2025 19:29 UTC (Wed) by marcH (subscriber, #57642) [Link] (1 responses)

> CXL is basically a less proprietary variant on what NVLink offers for GPU→GPU comms,

I don't know NVLink but it does not seem to offer hardware coherence. Does it?

> If CXL is dead on arrival, NVLink should also have been dead on arrival.

There many intermediate possibilities between "dead on arrival" and "commercial success", notably: losing to the competition, "interesting idea but no thanks", "Embrace, Extend and Extinguish", etc.

> since it's backwards-compatible with PCIe

That's a big advantage, yes.

> it's not a significant extra cost to support CXL as well as PCIe.

I think it really depends what you're looking at. From a pure hardware, CPU development perspective, you could argue most of the development work is done but is it really? You know for sure only when entering actual production and I'm not aware of that yet. Moreover, "developers" tend to ignore everything outside development, notably testing and on-going validation costs.

From a hardware _device_ perspective I'm not so sure. I guess "it depends". CXL smart NICs anyone? A lot of that stuff is obviously confidential. If CXL devices are not commercially successful, CXL support on the CPU side will "bitrot" and could die.

From a software cost perspective, this looks very far from "done" https://docs.kernel.org/driver-api/cxl/maturity-map.html

> And CXL memory support as needed for the hyperscaler application is there already today;

Is it really? Genuine question, I really don't know enough but what I see and read here and there does not give a lot of confidence. I understand there are many different use cases and this seems like the simplest one.

> In this respect, it's like NUMA; there's wins on offer if you are NUMA-aware, but you still run just fine if you're not.

Good!

Uses for CXL memory

Posted Apr 17, 2025 8:52 UTC (Thu) by farnz (subscriber, #17727) [Link]

NVLink is a brand name for multiple different (and incompatible) things. Some variants on NVLink do support cache coherency between GPUs, some don't (it depends on the generation of GPU you're using it with); the current generation does, in part because "AI" workloads need so much GPU memory that Nvidia is using NVLink to support attaching a large chunk of slightly higher latency RAM to a processing board.

And yes, CXL is basically done and ready to use if you're happy using it as "just" cache-coherent PCIe (which is what the AI accelerator world wants from it). The software stuff you've linked there is the stuff you need to do if you want to do more than cache-coherent PCIe - online reallocation of memory ownership, standardised EDAC (rather than per-board EDAC like in PCIe), multi-host support (rather than single-host), and so on. A lot of this is stuff that exists on an ad-hoc basis in various GPUs, NICs and SSDs already; the difference CXL makes is that instead of doing it differently in each driver, you're doing it in the CXL subsystem.

The specific thing that works today is booting systems with a mix of CXL and IMC memory, and rebooting to change the CXL memory configuration. That's enough for the basic hyperscaler application of "memory pool in a rack"; everything else is enhancements to make it better (e.g. being able to assign CXL memory at runtime, having shared CXL memory between two hosts in a rack and more).

Uses for CXL memory

Posted Apr 17, 2025 14:47 UTC (Thu) by gmprice (subscriber, #167884) [Link] (1 responses)

> The hyperscalers love the "pool of memory that can be attached to any server in the rack" application;

Do they though? Is this actually deployed anywhere in any reasonable capacity, or is it some idea some business wonk loves because he can make the numbers look pretty?

I ran the numbers at one point, and pooling seems like a mistake unless you have a large read-only shared memory use case (like FAMFS is trying to service). All you get with pooling is a giant failure domain waiting to blow up and cost you millions upon millions of dollars in downtime. The flexibility might let you provide more flexible VM shapes, but the question is how valuable such a service would be.

There are papers that say it "solves stranded memory" and other papers that say "Actually, you have a bin-packing problem, get good". Hard to say who has it right, but I can't say definitively that CXL provides a novel and useful solution to that problem.

Remember that this all takes rack space and power. For every 1U of memory-only space, you have to balance this against 1U of additional compute space. The numbers don't work out the way you think they do - the opportunity costs are real.

Uses for CXL memory

Posted Apr 17, 2025 15:27 UTC (Thu) by farnz (subscriber, #17727) [Link]

I can't comment in detail, because of NDAs, but yes, they do, because they already have the bin packing problem, and CXL moves when you deal with it from "building the DC" to "while operating the DC".

Today, you typically build a rack of either compute-optimized, general purpose or memory-optimized servers, and you get most active servers per rack if they're compute-optimized (since you can't actually power up all 42U or whatever height you have of rack at once, due to cooling and power constraints), and fewest if they're memory-optimized. This forces you into a balancing act; you want to bias towards compute-optimized servers, but you need enough general purpose and memory-optimized servers to handle workloads that need more RAM than a compute-optimized server has.

The CXL memory promise is that you have only compute-optimized racks with powered-down CXL memory devices in the rack. If you need general purpose or memory-optimized servers, you power down some compute-optimized servers to power up some CXL memory, and change the rack configuration from (numbers invented) 10 compute-optimized servers to 3 memory-optimized servers and 3 compute-optimized on the fly. When the workload balance changes (and you pressure your internal teams to get the balance to change if at all possible, because of the aforementioned power and cooling limits), you switch the rack back to compute-optimized servers.

What systems will have CXL?

Posted Apr 11, 2025 21:02 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

I believe Apple’s systems — at least their laptops — have CPU and memory in a single package. Apple manufacture versions with more RAM on the die, and sell them dearer; but eventually must reach the point where it’s not feasible to cram more memory into the same package. I can imagine them starting to sell systems with ‘hybrid’ memory, some faster and some not quite as fast, just as for a time Apple marketed computers with a large hard disk and small SSD.

What systems will have CXL?

Posted Apr 11, 2025 21:08 UTC (Fri) by willy (subscriber, #9762) [Link]

Yes, I can imagine all kinds of possible systems. The problem is that the right answer for "how do we use this" is very different between 8GB of HBM + 64GB of DRAM vs 256GB of DRAM + 128GB of CXL.

The relative latency, the relative bandwidth, the relative capacities all drive towards different solutions. That's why it's so unhelpful when the CXL advocates bring up HBM as an example.

kmalloc

Posted Apr 12, 2025 2:59 UTC (Sat) by hnaz (subscriber, #67104) [Link]

> What economic driver will cause consumers to want to buy slower computers?

I would call it the "warm set problem". Not all data that a workloads references needs the same access speeds, and there is more and more data between the hottest and the coldest bits for which storage is too slow, but first-class RAM is getting a bit too overkill and too expensive in terms of capex and power. Compression is a pretty great solution for this, which is why it's used by pretty much every phone, laptop and tablet currently sold, and widely used by hyper-scalers. It's kind of insane how far you can stretch a few gigabytes of RAM with compression, for workloads that would otherwise be dog slow or routinely OOM with nothing in between RAM and storage.

But compression is still using first-class DRAM, specced, clocked and powered for serving the hottest pages to the higher-level caches.

And folks who are worried about CXL access latencies likely won't be excited about the cycles spent on faults and decompression.

I'm not a fan of dumping the placement problem on the OS/userspace. This may kill it before any sort of market for second tier dimms can establish itself. And I'm saying all this despite having been critical of CXL adoption at my company based on the (current) implications on the software stack, the longevity of that design, and the uncertainty around sourcing hardware for that role long-term.

But I can definitely see how it's very attractive to provision less of the high-performance memory for that part of the data that doesn't really need it. Or would only need it if the alternative is compression or storage, which would waste or strand much more CPU. That's just a memory/cache hierarchy thing: the shittier level L is, the more you need of level L-1 to keep the CPU busy.

So I don't quite understand the argument to restrict it to certain types of memory, like zswap. If the latency is good enough for access AND decompression, why wouldn't it be good enough for access? Somewhat slower page cache seems better than the, what, thousand-fold cost of a miss. It's not just about the performance gap to first-class DRAM, but also about the gap to the next best thing.

kmalloc

Posted Apr 10, 2025 23:38 UTC (Thu) by gmprice (subscriber, #167884) [Link]

You really only want to limit it's use in a heterogenous environment where it's significantly slower. Relegate it to demotion in reclaim and fallback allocations under pressure, and you'll get reliable high performance under normal conditions and better performance where you might otherwise have had to use swap.

Keep things you KNOW can cause issues out and it limits the complexity when you go hunting performance regressions.

kmalloc

Posted Apr 11, 2025 5:03 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Things you can't change:

1 light-nanosecond is 30 centimeters (around 1ft in Freedom Units).

kmalloc

Posted Apr 11, 2025 15:47 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Put some glass its path and light will slow right down.

Oh you wanted your light to go faster? Picky, picky! ;-)

kmalloc

Posted Apr 13, 2025 1:08 UTC (Sun) by Kamilion (guest, #42576) [Link]

Such systems are already in the works; "someone I know" has been working on getting AMD APUs onto E.3S blades roughly the size of a 2.5" drive. In an EDSFF backplane with CXL fabric support. Each blade only has 8GB currently; and I've heard there are plans to "figure out" how to attach CXL memory to run larger workloads, but so far they've just gotten fabric storage working, which is where I came in with nvme-cli-fu.

During this conversation, I grabbed a micron 9300 U.2 drive, pointed at it's debug USB-Micro connector, and demanded of him a pair of USB-4 ports in the next hardware rev. He laughed and said he'd see what he could do. Pointed him at the AP33772 datasheet, suggested that bringup would be a lot easier if they had a method of direct-access for provisioning, bidirectional 12V PDOs for power exchange, displayport tunneling for their future in UEFI debugging, and a user-terminal. He got really excited about that. Apparently it took them something like nine months to figure out how to get a prototype dead-bugged with magnet wire just to get far enough to boot a kernel, "when it could have been as easy as plugging a portable monitor like yours in, and a keyboard." "... yep. Would have been even easier if the portable monitor had a bluetooth dongle hubbed into it."

All kinds of shenanigans went on to get sgabios stuffed into it's firmware load, in order to control it via serial terminal. Silly rabbits.

*if* they eventually get buyin from AMD or get large enough to start ordering customized silicon, they'll probably try moving to chiplet memory instead of discrete DRAM packages on the same carrier PCB. At that point I expect "most" memory visible to the system to be available over the CXL fabric. It's not too insane to think of a shoebox with four EDSFF bay slots handling whole-home compute when smart-tvs are already in wide use. A pair of processing elements, a memory element, and a storage element.

How such a system is to be managed by an enduser versus an enterprise, on the other hand, was/is still an open question to me.

My suggestion to him on that front was to start simple: Just throw an ASpeed BMC in any chassis, handling the fans and CXL switch. Load it with openbmc, then push an industry standard network boot configuration to any blade inserted. There's already existing JBOD boards for ATX chassis that do this, as well as existing blade chassis like supermicro's old 4 node C6100 design for dell.

"A great little lunch meeting" I had in january, ironically enough, started over how their end-of-life should be handled. I think I successfully convinced him that reuse after initial deployment is a positive factor, not a negative one. Hopefully he got positive responses from the rest of his team, but I've not had a chance to catch back up with him since then.

kmalloc

Posted Apr 13, 2025 8:17 UTC (Sun) by gmprice (subscriber, #167884) [Link]

This has already been proposed. https://arxiv.org/pdf/2305.05033

IIRC the idea is basically you can route more CXL lanes to the chip than you can DDR, so you can get more bandwidth in exchange for latency.

Not sure I buy their assessments completely, but it's interesting I suppose.