memory mirroring?

Posted Mar 24, 2012 4:37 UTC (Sat) by jzbiciak (guest, #5246)
In reply to: memory mirroring? by dlang
Parent article: Toward better NUMA scheduling

I was thinking about this earlier. If you have a page from a shared library (eg. libc), if it's hot enough to be truly important, then multiple tasks will have pulled it into at least the shared L3 on a modern processor. You don't need further duplication at the NUMA-node level then.

If the page is shared, but not hot, then the cost of missing on it won't register very highly on the performance of the app, because it's a small portion of its run time.

So, that leaves us with these weird middle-ground pages that are shared, moderately used (ie. neither hot nor cold, or only hot in sporadic bursts), but their users are so spread out and diffuse that they can't manage to keep copies resident in the onchip caches. It seems like those will truly benefit from duplication.

All that said, the crossover thresholds that determine the size and impact of this weird middle ground are a function of the cost of the remote fetch (larger latency/less bandwidth makes this middle-ground window larger) and the size of the last-level-of-cache-before-NUMA (smaller size makes this middle-ground window larger). Modern systems seem to be working to close this gap from both sides, with increasing L3 sizes, and an emphasis on moderating the chip-to-chip latency while increasing the chip-to-chip bandwidth.

Or am I thinking about this wrongly?

memory mirroring?

Posted Mar 24, 2012 8:47 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

you are missing cases where the working set does not all fit in the cache. On large systems (which most NUMA systems tend to be), it's very common for the apps to use lots of memory, and exceed the cache size for their data working set (they may have their hot code fit in the cache, but not all the data that it's manipulating, you can do a lot of processing on each memory address while waiting for the system to prefetch the next hunk of memory without taking any more wall-clock time

memory mirroring?

Posted Mar 24, 2012 15:46 UTC (Sat) by jzbiciak (guest, #5246) [Link]

I think you discount LRU action. "Working set" in my mind implies read-write, and either private to a process, or at least private to a tree of closely related processes. (I know that it also should include all the code pages involved, but typically those wouldn't be the thrashy bits.) That large working set is most likely private and not one of these shared, read-only things. A large working set will definitely thrash the cache, but will it really thrash all the cache equally?

That said, library / shared pages still will get referenced at least somewhat regularly by all of the processors on the NUMA node, and so the LRU will prevent the hottest lines from getting evicted. If you assume non-random replacement (which, unfortunately, you can't with certain recent processors), the hot library pages will remain near the front of the LRU, so only the back of the LRU gets cycled.

(The "unfortunately you can't" comment applies to recent ARM Cortex-A series processors, which have a highly associative shared L2 ("That's good!") with random replacement in lieu of an LRU ("That's bad!").)