User-data replication on NUMA systems
[Posted August 20, 2003 by corbet]
Non-Uniform Memory Access (NUMA) systems have the interesting feature that
access times to memory vary from one node (group of one or more processors)
to another. Each node has local memory, which is relatively fast, but
access to another node's memory will be slower. So performance work on
NUMA systems tends to emphasize getting rid of cross-node memory traffic.
The latest step in that direction is this
patch from Dave Hansen. Dave notes that one source of cross-node
traffic is shared user text - things like shared libraries and executible
images. Once a particular page from, say, glibc has been faulted into
memory, it will exist in a particular node's range. Every other node will
have to reach across the system to run code out of that page (though
processor caches also figure into this picture, of course). In some cases,
such as with the C library, it may well make sense to make a local copy of
each page as needed.
To that end, Dave's patch makes some fundamental changes to the kernel's
page cache. This change is required, since the cache can now contain more
than one memory page for each corresponding file page. So the page cache
now contains a set of page_cache_leaf structures, the main
component of which is a per-node array of struct page pointers. A
page cache lookup will preferentially return a node-local copy of the page
if it exists; depending on the situation, it can return a page on a remote
node if that's all that is available.
When the kernel handles a page fault for a mapped text page, it insists on
a local copy of the page. If no such copy exists, and memory is available,
a local copy will be made and added to the page cache. The processor then
continues with its work, using the local version of the shared page. The
results, from a set of quick benchmarks posted with the patch, is a
performance improvement of 109% to 143%. In other words, it may well be
worth the trouble.
This patch is not quite ready for prime time, however; Dave notes:
This is still pretty experimental, so don't give it to your bank or
anything. I've lightly corrupted data playing with it, although
not in at least a week :)
The current code punts on a couple of important issues. When a process
tried to write to a file with replicated pages, for example, those pages
must be collapsed down to a single copy before the write can be allowed -
or inconsistent copies will result. Similarly, if the last writer closes a
file, that file suddenly becomes a candidate for replication. The patch,
as posted, detects these situations but does not fully implement their
resolution. A
production-ready patch would also certainly have a mechanism for freeing
replicated pages when memory gets tight. Given that this patch is clearly
not 2.6 material, however, Dave has a long time to work out those details.
(
Log in to post comments)