LWN: Comments on "CXL 1: Management and tiering"

CXL 1: Management and tiering

sdalley — Tue, 17 May 2022 08:27:19 +0000

Personally, I'm more than happy to trade a little more memory for a little more sanity.

And message passing needn't consume more memory if the message buffer simply changes hands between owners rather than being copied. A good interface to such a message interface can also ensure, under the hood, that references to buffers no longer owned are always NULLed.

CXL 1: Management and tiering

ballombe — Mon, 16 May 2022 16:30:16 +0000

> Message passing requires an extra programming effort but unlike shared memory, performance and correctness issues can be traced and debugged in a reasonable amount of time.

But message passing often requires more total memory for the same task.
And often the performance issue is traced to "not enough memory by node".
So...

CXL 1: Management and tiering

willy — Sun, 15 May 2022 21:58:40 +0000

Oh yes, the CXL boosters have a future where everything becomes magically cheap. I don't believe that world will come to pass. I think the future of HPC will remain as one-two socket CPU boxes with one-two GPUs, much more closely connected over CXL, but the scale-out interconnect isn't going away, and I doubt the scale-out interconnect will be CXL. Maybe it will; I've been wrong before.

I have no faith in disaggregated systems. You want your memory close to your [CG]PU. If you upgrade your [CG]PU, you normally want to upgrade your interconnect and memory at the same time. The only way this makes sense is if we've actually hit a limit in bandwidth and latency, and that doesn't seem to have happened yet, despite the sluggish adoption of PCIe4.

The people who claim "oh you want a big pool of memory on the fabric behind a switch connected to lots of front end systems" have simply not thought about the reality of firmware updates on the switch or the memory controller. How do you schedule downtime for the N customers using the [CG]PUs? Tuesday Mornings 7-9am Are Network Maintenance Windows just aren't a thing any more.

CXL 1: Management and tiering

Paf — Sun, 15 May 2022 19:46:09 +0000

This makes more sense - it’s mostly a way to make the clean cases easier and well supported by hardware.

Looking it up, I’m seeing a lot of stuff about disaggregated systems which just seems crazy. But marketing doesn’t have to match reality of intent for the main implementers.

CXL 1: Management and tiering

marcH — Sun, 15 May 2022 03:17:23 +0000

> instead of getting surprised when memory access is magically slow.

Well, it's not like single-thread performance is deterministic either. However I agree shared memory is really crossing a line, it's the biggest programming footgun ever invented by hardware engineers.

Message passing requires an extra programming effort but unlike shared memory, performance and correctness issues can be traced and debugged in a reasonable amount of time.

https://queue.acm.org/detail.cfm?id=3212479

> Caches are large, but their size isn't the only reason for their complexity. The cache coherency protocol is one of the hardest parts of a modern CPU to make both fast and correct. Most of the complexity involved comes from supporting a language in which data is expected to be both shared and mutable as a matter of course.

CXL 1: Management and tiering

willy — Sat, 14 May 2022 20:16:18 +0000

I don't think we're going to see 2048-node clusters built on top of CXL. The physics just doesn't support it.

The use cases I'm seeing are:

- Memory-only devices. Sharing (and cache coherency) is handled by the CPUs that access them. Basically CXL as a replacement for the DDR bus.

- GPU/similar devices. They can access memory coherently, but if you have any kind of contention between the CPU and the GPU, performance will tank. Programs are generally written to operate in phases of GPU-only and CPU-only access, but want migration handled for them.

Maybe there are other uses, but there's no getting around physics.

CXL 1: Management and tiering

Paf — Sat, 14 May 2022 16:01:42 +0000

As someone who’s worked in HPC and watched the shared memory machines be displaced by true clustered systems, despite the intense and remarkable engineering that went in to keeping those shared memory machines coherent at scale … yeah.

So this is the thing we’re doing again this week.

That doesn’t mean it’s not worth it - those machines made a lot of sense for a while and shifting trends may make that true again - but the costs of coherency across links like these is *intense*. I wonder where and how much use this will see, what cases it will be fast enough for, etc.

CXL 1: Management and tiering

MattBBaker — Fri, 13 May 2022 20:37:35 +0000

Will there be an option to bypass the kernel and program the memory directly? It's nice when the kernel and hide the details, but for some applications it's really better if it's aware of the different tiers of memory access and can explicitly pass messages instead of getting surprised when memory access is magically slow.