|| ||Stefan Lankes <firstname.lastname@example.org> |
|| ||email@example.com |
|| ||[RFC PATCH 0/4]: affinity-on-next-touch |
|| ||Mon, 11 May 2009 10:27:18 +0200|
|| ||Article, Thread
I wrote a patch to support the adaptive data distribution strategy
"affinity-on-next-touch" for NUMA architectures. The patch is in an early
state and I am interested in your comments.
The basic idea of "affinity-on-next-touch" is this: Via some runtime
mechanism, a user-level process activates "affinity-on-next-touch" for a
certain region of its virtual memory space. Afterwards, each page in this
region will be migrated to that node which next tries to access it.
Noordergraaf and van der Pas  have proposed to extend the OpenMP standard
to support this strategy. Since version 9, the "affinity-on-next-touch"
mechanism is available in Solaris and can be triggered via the madvise
system call. Löf and Homgren  and Terboven et al.  have described
their encouraging experiences with this implementation.
Linux does not yet support "affinity-on-next-touch". Terboven et al. 
have presented a user-level implementation of this strategy for Linux. To
realize "affinity-on-next-touch" in user space, they protect a specific
memory area from read and write accesses and install a signal handler to
catch access violations. If a thread accesses a page in the protected memory
area, the signal handler migrates the page to the node which handled the
access violation. Afterwards, the signal handler clears the page protection
and the interrupted thread is resumed.
Unfortunately, the overhead of this solution is very high. For instance, to
distribute 512 MByte via "affinity-on-next-touch" the user-level solution
needs 2518ms on our dual-socket, quad-core Opteron 2376 system with the
current kernel (2.6.30-rc4). I evaluated this overhead with the following
madvise(array, sizeof(int) * SIZE, MADV_ACCESS_LWP);
start = omp_get_wtime();
#pragma omp parallel for
for (j = 0; j < SIZE; j += pagesize/sizeof(int))
end = omp_get_wtime();
printf("time: %lf ms\n", (end - start) * 1000.0);
The benchmark uses 8 threads and each thread is bound to one core.
I divide my patch into the following 4 parts:
[Patch 1/4]: Extend the system call madvise with a new parameter
MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area
then uses "affinity-on-next-touch". In this case, madvise_access_lwp
protects the memory area from read and write access. To avoid unnecessary
list operations, the patch changes the permissions only in the page table
entries and does not update the list of VMAs. Beside this, the system call
madvise set also the new "untouched bit" of the "page" record.
[Patch 2/4]: The pte fault handler detects, via a new "untouched bit" inside
of the "page" record, that the page which the thread tried to access uses
"affinity-on-next-touch". Afterwards, the kernel reads the original
permissions from vm_area_struct, restores them in the page tables and
migrates the page to the current node. To accelerate page migration, the
patch avoids unnecessary calls to migrate_prep().
[Patch 3/4]: If the "untouched" bit is set, mprotect isn't permitted to
change the permission in the page table entry. By using of
"affinity-on-next-touch", the access permission will be set by the pte fault
[Patch 4/4]: This part of the patch adds some counters to detect migration
errors and publishes these counters via /proc/vmstat. Besides this, the
Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH.
With this patch, the kernel reduces the overhead of page distribution via
"affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
approach. Currently, I'm evaluating the performance of the patch with some
other benchmarks and test applications (stream benchmark, Jacobi solver, PDE
I am very interested in your comments!
 Noordergraaf, L., van der Pas, R.: Performance Experiences on Suns
WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE conference on
Supercomputing,Portland, Oregon, USA (November 1999)
 Löf, H., Holmgren, S.: affinity-on-next-touch: Increasing the
Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings
of the 19th Annual International Conference on Supercomputing, Cambridge,
Massachusetts, USA (June 2005) 387-392
. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data
and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop
on Memory Access on future Processors: A solved problem?, ACM International
Conference on Computing Frontiers, Ischia, Italy (May 2008) 377-384
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/