kmalloc

Posted Apr 10, 2025 23:09 UTC (Thu) by gmprice (subscriber, #167884)
In reply to: kmalloc by willy
Parent article: Management of volatile CXL devices

I'm not sure there's is a good way to keep page cache out easily.

kmalloc

Posted Apr 10, 2025 23:36 UTC (Thu) by willy (subscriber, #9762) [Link] (9 responses)

Sure there is. Don't add CXL memory to the page allocator. Now all you need is a way to allow anon memory to allocate from the CXL allocator (if userspace has configured that)

kmalloc

Posted Apr 10, 2025 23:50 UTC (Thu) by gmprice (subscriber, #167884) [Link] (8 responses)

Asking workloads to use anything other than standard system allocators is basically saying you want the technology to die - because good luck getting some random runtime (python, go, rust, c++, etc) to "understand CXL".

Which, fair enough, we can disagree.

But people buy systems with HBM and DDR, and why would I buy a system with DDR when I could buy a system with HBM? At a certain point, the $ matters more than the latency lol.

kmalloc

Posted Apr 11, 2025 2:27 UTC (Fri) by jedix (subscriber, #116933) [Link] (4 responses)

The only thing that makes remote sense is to use your slower memory under memory pressure. In other words, use your new technology for a glorified swap. Otherwise you underutilized the faster memory, which I think we can agree, is suboptimal?

If you want some fancy 'pay for faster memory on my cloud', then yes you need your application to know how to allocate from a special interface.

I also don't think your argument of HBM vs DDR is valid, what you are asking is for someone to buy a system with both HBM and DDR and with no extra work (python, go, rust, c++, etc), make it run faster. Another way to look at it is you are asking someone to buy EXTRA hardware to run slower; $ == latency.

kmalloc

Posted Apr 11, 2025 6:12 UTC (Fri) by gmprice (subscriber, #167884) [Link] (3 responses)

The problem is exactly comparing it to swap - swap implies faults. There's no faults here, and in most cases there's no real reason to treat it as anything other than plain old RAM (except keeping certain classes of data out of it by default).

Approaching it purely as a swap-ish tier is missing the boat.

I like data, so here's some data from a real system.

FIO benchmark w/ a file loaded up into the page cache - since that was a concern.

100% DRAM (128GB file)
bw=85.8GiB/s (92.1GB/s)
lat (usec) : 50=98.34%, 100=1.61%, 250=0.05%, 500=0.01%

100% CXL (128GB file)
bw=41.1GiB/s (44.2GB/s) - note this test actually caps the usable bandwidth.
lat (usec) : 100=86.55%, 250=13.45%, 500=0.01%

DRAM+CXL (~900GB file cache, >128GB in CXL via demotions)
bw=64.5GiB/s (69.3GB/s)
lat (usec) : 50=69.26%, 100=13.79%, 250=16.95%, 500=0.01%

DRAM+Disk (~900GB, but that >128GB now faults from disk)
bw=10.2GiB/s (10.9GB/s)
lat (usec) : 50=39.73%, 100=31.34%, 250=4.71%, 500=0.01%, 750=0.01%
lat (usec) : 1000=1.78%
lat (msec) : 2=21.63%, 4=0.81%, 10=0.01%, 20=0.01%

This is a fairly naive test, you'd really right-size your workloads to ensure projected hot capacity fits (mostly) in DRAM, and let CXL act as zswap-ish tier (compress if you need to, but otherwise leave it as pages!) to avoid the degenerate scenario (faulting from disk).

What we have now (ZONE_MOVABLE + demotion in reclaim + zswap) is already sufficient to do this - we just need to add some tweaks to deal with corner cases (transitory case 2). It'd also be nice to get better levels to allow opt-in/opt-out for this capacity (both for kernel and userland), but having pagecache use it seems perfectly fine in most every real-world test I've run.

I'm guessing I'll die on this hill - the latencies aren't as scary as people make them out to be, and approaching it with solutions built with disk-latencies in mind really misses the boat on how decent a memory solution this is. This isn't some weird proprietary memory technology behind a weird proprietary bus, it's literally DDR with extra steps.

It's a neat trick to point at the DRAM-only number and say "Look how much you're losing!" but hard admit you'd triple your server cost to get there in the real world. So no, you're not paying more money to go slower. You're paying significantly less money to go quite a bit faster, just not as fast as possible - and that's ok!

kmalloc

Posted Apr 11, 2025 18:12 UTC (Fri) by willy (subscriber, #9762) [Link]

Again, people made these arguments for 3dxp, and nobody was convinced.

Latency is about 2.5x higher for CXL than for DDR, which is honestly even worse than I thought. I guess part of that can be attributed to the immaturity of the technology, but I've demonstrated that it will always be slower than DDR.

Using it for zswap seems like a great idea. Using it for slab, stack, page-cache; all of these seem like terrible ideas.

kmalloc

Posted Apr 11, 2025 19:45 UTC (Fri) by jedix (subscriber, #116933) [Link] (1 responses)

Right now, the biggest bottleneck is getting things into memory. Superscalar has shifted what is expensive, time wise. Your benchmark shows that dram is 50% faster than cxl. This will cause more CPU stalls.

Trying to 'not miss the boat' is like predicting the future - you don't know what's going to be needed in memory - otherwise we'd not need CLX at all. If you could solve the issue of knowing what is needed to be fast vs slow, then you could just page the slow stuff out and move on. So, I think we either make the (z)swap approach work or spend the money on dram and not developer effort.

Look at your numbers, it's a solid win against disk. I don't think that's your point, but the tail is a _lot_ better. And that is using the swap code as it is today.

I mean, reusing your old ram is still stupid because you will spend the money you saved on dram on an army of people walking the data center floor replacing dead dimms (and probably killing some servers by accident while they go).

But let's say you are making new shiny tech from 2007 (ddr3) or 2014 (ddr4) to use in your 2025 servers, then you can have faster zswap. If you don't feel old enough yet (I do..): DDR3 is old enough to have voted in the last election. Although DDR4 is newer, it came out 11 years ago, the same year as Shake It Off topped the charts, so.. not yesterday (but, haters gonna hate).

The thing that is really starting to bother me is the amount of time we are spending trying to solve a self-made issue and burning good developers on it.

kmalloc

Posted Apr 11, 2025 19:58 UTC (Fri) by gmprice (subscriber, #167884) [Link]

> And that is using the swap code as it is today.

It's actually not. The pages are pages. The tail latency is avoided precisely because the page is still present and mapped, it's just been demoted to CXL instead of removed from the page tables.

So it's "zswap-ish", because reclaim migrates the page somewhere else - but in this case that somewhere else is another plain old page on another node.

If you enable zswap on top of this and give it the same size as your CXL tier, you'll end up with CXL consumed like zswap as you suggested - but the tail latencies will be higher (maybe not multi-millisecond, but i haven't tested this yet).

kmalloc

Posted Apr 11, 2025 3:20 UTC (Fri) by willy (subscriber, #9762) [Link] (2 responses)

Remember that I've been through this stupidity once before already with 3dxp. You're not saying anything new.

And yes, the notion absolutely was that "enlightening" the interpreter / runtime for managed languages was a viable approach. I mean ... maybe? But we're back to the question of "why would I deliberately choose to use slow memory".

What I do see as a realistic workload is "This is a cheap VM that doesn't get to use DRAM". So that's a question of enlightening the hypervisor to give CXL memory to the guest.

HBM is a distraction. That's never been available for CPUs in large quantities; it's always been used as L4 cache. I see it's available in real quantities for GPUs now, but I'm not seeing it on any CPU roadmap.

And, yes, I do want the technology to die. It's the same mistake as ATM and InfiniBand.

kmalloc

Posted Apr 11, 2025 6:37 UTC (Fri) by gmprice (subscriber, #167884) [Link] (1 responses)

Framing the discussion as "why would i deliberately use slower memory" is quite disingenuous considering zswap, zram, and swap all exist.

The answer is pretty clearly - to avoid having to use all of those things.

There is maybe a good argument to formalize a zswap backend allocator that can consume dax-memory, and hide that from the page allocator - but there is still clear value in just exposing it as a page. It's literally DDR behind a controller, not some pseudo-nand or page-fetch interconnect.

kmalloc

Posted Apr 11, 2025 18:29 UTC (Fri) by willy (subscriber, #9762) [Link]

> It's literally DDR behind a controller, not some pseudo-nand or page-fetch interconnect.

That's actually worse than the 3dxp business case. Because in the future where 3dxp had worked, the argument was that it came in VAST quantities. If I remember correctly, the argument at the time was for a 2 socket machine with 128GB of DRAM, 8TB of 3dxp and some amount of NVMe storage.

So you might reasonably want to say "Hey, give me 4TB of slow memory" because you'd designed your algorithm to be tolerant of that latency.

And then 3dxp also had the persistence argument for it; maybe there really was a use-case for storage-presented-as-memory. Again, CXL as envisaged by you doesn't provide that either. It's just DRAM behind a slow interconnect.