Kernel-text replication on NUMA systems

Posted Jan 5, 2024 16:29 UTC (Fri) by willy (subscriber, #9762)
Parent article: Kernel-text replication on NUMA systems

The motivation on arm64 is not for phones but for servers. There are several arm64 vendors targetting the server space, although I'm not sure which ones have NUMA support. A quick search suggests both Ampere and Cavium have NUMA chips while Graviton does not.

Kernel-text replication on NUMA systems

Posted Jan 5, 2024 17:29 UTC (Fri) by MattBBaker (guest, #28651) [Link]

In the HPC space the A64FX from Fujitsu is a NUMA ARM system. There isn't a choice either, the chip has 4 HBM links for 4 separate L2 caches (and no L3). Probably won't help Fugaku's Top500 placement since the name of the HPC game is kernel bypass.

Kernel-text replication on NUMA systems

Posted Jan 6, 2024 15:33 UTC (Sat) by snajpa (subscriber, #73467) [Link] (6 responses)

Graviton seems to have low core-to-core latency as a design goal, others do not (while Ampere isn't so bad, it's nowhere near what Graviton can do). Graviton is also single socket only, dual socket systems will always be NUMA these days.

Kernel-text replication on NUMA systems

Posted Jan 6, 2024 21:37 UTC (Sat) by willy (subscriber, #9762) [Link] (5 responses)

One caveat is that we sometimes see multiple NUMA nodes in a single socket. I believe Intel can do this; statically partition the L3 between two or more groups of cores and present each partition as a NUMA node. There's a latency benefit as you don't have to traverse as many ring stops to get to the L3 slice that holds your data. I think they call it Cluster On Die (COD) because at Intel everything has to have an acronym.

I haven't seen this done on ARM yet, but I haven't been looking terribly hard.

Kernel-text replication on NUMA systems

Posted Jan 7, 2024 10:48 UTC (Sun) by snajpa (subscriber, #73467) [Link] (1 responses)

yup; btw I find that our workloads (hundreds of OS-level containers on a single kernel with even more containers such as Docker, k8s nested in there) work better with NPS=4 with AMD, which splits the chip in 4 quadrants; which is interesting, b/c I'd have guessed it'd actually be "NUMA node per CCX" which would have the best results in local core to core latency :-D

Kernel-text replication on NUMA systems

Posted Jan 7, 2024 10:51 UTC (Sun) by snajpa (subscriber, #73467) [Link]

oh sorry I should have worded it better - ofc node per CCX has the best core to core latencies, but those are low core counts to low core counts, which is why I think the 4 quadrants are doing better in our setup (originally when I started experimenting with this I thought the main problem is saving infinity fabric bandwidth, but that doesn't seem to be as much of a problem after all...)

Kernel-text replication on NUMA systems

Posted Jan 7, 2024 11:53 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (2 responses)

Ampere Altra can be configured to run as 1, 2 or 4 NUMA nodes. While I don't really understood what it does internally, I did notice significant differences in core-to-core latency when configured as >1 node, where threads spread across nodes would interact as badly as on platforms with partitioned L3 caches.

Kernel-text replication on NUMA systems

Posted Jan 7, 2024 12:52 UTC (Sun) by snajpa (subscriber, #73467) [Link] (1 responses)

as with AMD, it seems to couple a quadrant or a half with the nearest respective memory controllers - https://amperecomputing.com/assets/Altra_Max_UM_v1_15_202... page 36

Kernel-text replication on NUMA systems

Posted Jan 8, 2024 4:16 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

Indeed, and the cache is actually behind the memory controllers:


2.4 System Level Cache (SLC)


The SLC is a memory-side cache that is mostly exclusive with the L2 caches. The SLC is used for processor evictions and caches large data and instruction structures to improve system performance. The SLC is not a traditional processor-side Last Level Cache (LLC), sometimes called an L3 or L4 cache

This can explain why the inter-core performance differs with NUMA setup.

Kernel-text replication on NUMA systems

Posted Jan 8, 2024 14:06 UTC (Mon) by neggles (subscriber, #153254) [Link]

The just-released graviton4 is a 2-socket system, actually; this may not be entirely a coincidence