User-data replication on NUMA systems

[Posted August 20, 2003 by corbet]

Non-Uniform Memory Access (NUMA) systems have the interesting feature that access times to memory vary from one node (group of one or more processors) to another. Each node has local memory, which is relatively fast, but access to another node's memory will be slower. So performance work on NUMA systems tends to emphasize getting rid of cross-node memory traffic.

The latest step in that direction is this patch from Dave Hansen. Dave notes that one source of cross-node traffic is shared user text - things like shared libraries and executible images. Once a particular page from, say, glibc has been faulted into memory, it will exist in a particular node's range. Every other node will have to reach across the system to run code out of that page (though processor caches also figure into this picture, of course). In some cases, such as with the C library, it may well make sense to make a local copy of each page as needed.

To that end, Dave's patch makes some fundamental changes to the kernel's page cache. This change is required, since the cache can now contain more than one memory page for each corresponding file page. So the page cache now contains a set of page_cache_leaf structures, the main component of which is a per-node array of struct page pointers. A page cache lookup will preferentially return a node-local copy of the page if it exists; depending on the situation, it can return a page on a remote node if that's all that is available.

When the kernel handles a page fault for a mapped text page, it insists on a local copy of the page. If no such copy exists, and memory is available, a local copy will be made and added to the page cache. The processor then continues with its work, using the local version of the shared page. The results, from a set of quick benchmarks posted with the patch, is a performance improvement of 109% to 143%. In other words, it may well be worth the trouble.

This patch is not quite ready for prime time, however; Dave notes:

This is still pretty experimental, so don't give it to your bank or anything. I've lightly corrupted data playing with it, although not in at least a week :)

The current code punts on a couple of important issues. When a process tried to write to a file with replicated pages, for example, those pages must be collapsed down to a single copy before the write can be allowed - or inconsistent copies will result. Similarly, if the last writer closes a file, that file suddenly becomes a candidate for replication. The patch, as posted, detects these situations but does not fully implement their resolution. A production-ready patch would also certainly have a mechanism for freeing replicated pages when memory gets tight. Given that this patch is clearly not 2.6 material, however, Dave has a long time to work out those details.

Numbers are off

Posted Aug 21, 2003 15:19 UTC (Thu) by pflugstad (subscriber, #224) [Link] (2 responses)

I think you're looking at the numbers wrong. The posting has:

SDET Average Throughput (NUMA-Q):
2.6.0-test3 100.0%
2.6.0-test3+urepl 143.1%

SDET Average Throughput (16-way P4):
2.6.0-test3 100.0%
2.6.0-test3+urepl 108.8%

The baseline is 100%, so that's an *improvement* of 43.1% or 8.8%, not 143.2/108.8.

While these are good numbers, one wonders how the "production ready" patch would change them.

Numbers are off

Posted Aug 21, 2003 15:52 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

I read the numbers right, I just wrote poorly. It was getting late in a long week... In any case, you're right; the new performance is as good as 143% of the old value, the improvement is 43%.

User data Page replication

Posted Sep 2, 2003 10:32 UTC (Tue) by balbir-_singh (guest, #887) [Link]

I think only executable text pages are replicated. As far as
I can remember data is never replicated (I worked on such systems
about two years ago).

The title can be misleading if what I am saying is correct.

Balbir