User-data replication on NUMA systems
The latest step in that direction is this patch from Dave Hansen. Dave notes that one source of cross-node traffic is shared user text - things like shared libraries and executible images. Once a particular page from, say, glibc has been faulted into memory, it will exist in a particular node's range. Every other node will have to reach across the system to run code out of that page (though processor caches also figure into this picture, of course). In some cases, such as with the C library, it may well make sense to make a local copy of each page as needed.
To that end, Dave's patch makes some fundamental changes to the kernel's page cache. This change is required, since the cache can now contain more than one memory page for each corresponding file page. So the page cache now contains a set of page_cache_leaf structures, the main component of which is a per-node array of struct page pointers. A page cache lookup will preferentially return a node-local copy of the page if it exists; depending on the situation, it can return a page on a remote node if that's all that is available.
When the kernel handles a page fault for a mapped text page, it insists on a local copy of the page. If no such copy exists, and memory is available, a local copy will be made and added to the page cache. The processor then continues with its work, using the local version of the shared page. The results, from a set of quick benchmarks posted with the patch, is a performance improvement of 109% to 143%. In other words, it may well be worth the trouble.
This patch is not quite ready for prime time, however; Dave notes:
The current code punts on a couple of important issues. When a process
tried to write to a file with replicated pages, for example, those pages
must be collapsed down to a single copy before the write can be allowed -
or inconsistent copies will result. Similarly, if the last writer closes a
file, that file suddenly becomes a candidate for replication. The patch,
as posted, detects these situations but does not fully implement their
resolution. A
production-ready patch would also certainly have a mechanism for freeing
replicated pages when memory gets tight. Given that this patch is clearly
not 2.6 material, however, Dave has a long time to work out those details.
Posted Aug 21, 2003 15:19 UTC (Thu)
by pflugstad (subscriber, #224)
[Link] (2 responses)
SDET Average Throughput (NUMA-Q): SDET Average Throughput (16-way P4): The baseline is 100%, so that's an *improvement* of 43.1% or 8.8%, not 143.2/108.8. While these are good numbers, one wonders how the "production ready" patch would change them.
Posted Aug 21, 2003 15:52 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Sep 2, 2003 10:32 UTC (Tue)
by balbir-_singh (guest, #887)
[Link]
The title can be misleading if what I am saying is correct. Balbir
I think you're looking at the numbers wrong. The posting has:Numbers are off
2.6.0-test3 100.0%
2.6.0-test3+urepl 143.1%
2.6.0-test3 100.0%
2.6.0-test3+urepl 108.8%
I read the numbers right, I just wrote poorly. It was getting late in a long week... In any case, you're right; the new performance is as good as 143% of the old value, the improvement is 43%.
Numbers are off
I think only executable text pages are replicated. As far as User data Page replication
I can remember data is never replicated (I worked on such systems
about two years ago).