User: Password:
|
|
Subscribe / Log in / New account

memory mirroring?

memory mirroring?

Posted Mar 17, 2012 1:43 UTC (Sat) by martinfick (subscriber, #4455)
Parent article: Toward better NUMA scheduling

I wonder if it wouldn't make sense for the kernel to mirror certain pages across the memory of multiple nodes? In particular I am thinking of FS caches for files opened for read only. This way commonly used shared libraries, or output files during compiles could end up getting mirrored in memory across all the nodes making the read only accesses much faster.


(Log in to post comments)

memory mirroring?

Posted Mar 17, 2012 2:03 UTC (Sat) by dlang (subscriber, #313) [Link]

under some conditions yes, under others, no :-)

which makes your system faster, having these RO pages be accessed a little faster, or having more data cached in ram so you do less disk I/O?

if your working set will fit in ram with the duplication, then it's a pretty clear (but still smallish) win, if the duplication forces pages of your working set out of ram, then it's a clear loss.

I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.

remember that accessing memory is already orders of magnitude slower than accessing the data in the CPU cache, so if the memory is a little slower it frequently has less of an effect than you would expect.

memory mirroring?

Posted Mar 17, 2012 8:40 UTC (Sat) by khim (subscriber, #9252) [Link]

I think you may be mistaking how much of a win this is. There is an improvement in the speed of accessing memory, and even if we say that it's a 2x improvement, it still may not make much practical difference.

Note that 2x improvement limit is recent improvement (when you have 8-16 cores in a single CPU you can build quite capable NUMA system with very few CPUs). It's not uncommon for older NUMA systems to have 10x or even 100x difference between access to local memory and remote memory. Not sure if anyone still builds such systems (SGI used to, but it's dead now).

memory mirroring?

Posted Mar 17, 2012 12:24 UTC (Sat) by dlang (subscriber, #313) [Link]

True, current NUMA machines are almost all in the category of multi-socket AMD64 systems, which have fast enough interconnects that it's usable if you ignore NUMA ad just treat is as a SMP machine.

Historic NUMA machines had MUCH slower interconnects, the best comparison in moderns systems would be if you were connecting your CPU nodes together with high speed networks.

There are still some people building such machines (I think the current Cray systems are this category), but when you get to interconnects that are that expensive, you are frequently better segmenting the system and running it as if it was a cluster of systems, or (the more common case), just build a cluster of commodity systems instead of the monster NUMA system in the first place.

I think that if AMD hadn't introduced NUMA to the commodity desktop/server with the Opteron, NUMA would be something that's so rare that the overhead and complexity of it's logic wouldn't be acceptable in the kernel.

There are some applications that really are hard to split into a multi-machine cluster, and for those NUMA (including RDMA setups that tie multiple commodity machine together) are the right tool for the task, but they are pretty rare, it's almost always worth re-architecting the application to avoid this requirement.

memory mirroring?

Posted Mar 17, 2012 23:07 UTC (Sat) by davecb (subscriber, #1574) [Link]

Quite large systems have large penalties for using distant nodes than small ones: bus backplane latency is not your friend (;-)) Smaller systems with bus lengths in the millimeters don't pay so great a penalty.

For one modern architecture there is a big hit after 32 sockets even when using a backplane derived from Cray's lowest-latency design. The speed of light needs improvement!

Ancient mainframes used a radial design to avoid having to be NUMA, at the expense of having an exceedingly complicated, multi-ported "system controller" where we'd put a bus.

--dave

memory mirroring?

Posted Mar 18, 2012 0:56 UTC (Sun) by dlang (subscriber, #313) [Link]

> For one modern architecture there is a big hit after 32 sockets

32 sockets * 6 (true) cores/socket = 192 core system

at that sort of scale, I'll bet that locking overhead is at least as big a problem as the memory access times.

now, the 'commodity' NUMA keeps creeping up the scale, what is it now, 8 sockets * 6 cores = 48 core systems (*2 or more if you want to include hyperthread 'cores')?

you're a bit low....

Posted Mar 30, 2012 21:55 UTC (Fri) by cbf123 (guest, #74020) [Link]

Current high-end xeons have 8 "real" cores.

memory mirroring?

Posted Mar 17, 2012 17:13 UTC (Sat) by arjan (subscriber, #36785) [Link]

also note that while latency might be 2x longer due to distance, there are cases where it's more beneficial to actually aggregate the memory bandwidth of local + remote rather than being local only....
single threaded, highly memory bandwidth bound apps come to mind in this regard.

memory mirroring?

Posted Apr 16, 2012 19:34 UTC (Mon) by adavid (guest, #42044) [Link]

SGI might be dead but sgi lives on after Rackable's 'switcheroo' . SGI NUMA is certainly alive with their Ultraviolet range.

memory mirroring?

Posted Mar 24, 2012 4:37 UTC (Sat) by jzbiciak (subscriber, #5246) [Link]

I was thinking about this earlier. If you have a page from a shared library (eg. libc), if it's hot enough to be truly important, then multiple tasks will have pulled it into at least the shared L3 on a modern processor. You don't need further duplication at the NUMA-node level then.

If the page is shared, but not hot, then the cost of missing on it won't register very highly on the performance of the app, because it's a small portion of its run time.

So, that leaves us with these weird middle-ground pages that are shared, moderately used (ie. neither hot nor cold, or only hot in sporadic bursts), but their users are so spread out and diffuse that they can't manage to keep copies resident in the onchip caches. It seems like those will truly benefit from duplication.

All that said, the crossover thresholds that determine the size and impact of this weird middle ground are a function of the cost of the remote fetch (larger latency/less bandwidth makes this middle-ground window larger) and the size of the last-level-of-cache-before-NUMA (smaller size makes this middle-ground window larger). Modern systems seem to be working to close this gap from both sides, with increasing L3 sizes, and an emphasis on moderating the chip-to-chip latency while increasing the chip-to-chip bandwidth.

Or am I thinking about this wrongly?

memory mirroring?

Posted Mar 24, 2012 8:47 UTC (Sat) by dlang (subscriber, #313) [Link]

you are missing cases where the working set does not all fit in the cache. On large systems (which most NUMA systems tend to be), it's very common for the apps to use lots of memory, and exceed the cache size for their data working set (they may have their hot code fit in the cache, but not all the data that it's manipulating, you can do a lot of processing on each memory address while waiting for the system to prefetch the next hunk of memory without taking any more wall-clock time

memory mirroring?

Posted Mar 24, 2012 15:46 UTC (Sat) by jzbiciak (subscriber, #5246) [Link]

I think you discount LRU action. "Working set" in my mind implies read-write, and either private to a process, or at least private to a tree of closely related processes. (I know that it also should include all the code pages involved, but typically those wouldn't be the thrashy bits.) That large working set is most likely private and not one of these shared, read-only things. A large working set will definitely thrash the cache, but will it really thrash all the cache equally?

That said, library / shared pages still will get referenced at least somewhat regularly by all of the processors on the NUMA node, and so the LRU will prevent the hottest lines from getting evicted. If you assume non-random replacement (which, unfortunately, you can't with certain recent processors), the hot library pages will remain near the front of the LRU, so only the back of the LRU gets cycled.

(The "unfortunately you can't" comment applies to recent ARM Cortex-A series processors, which have a highly associative shared L2 ("That's good!") with random replacement in lieu of an LRU ("That's bad!").)


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds