LWN.net Logo

[patch 2.6.0-test1] node affine NUMA scheduler extension

From:  Erich Focht <efocht@hpce.nec.com>
To:  LSE <lse-tech@lists.sourceforge.net>, "linux-kernel" <linux-kernel@vger.kernel.org>
Subject:  [Lse-tech] [patch 2.6.0-test1] node affine NUMA scheduler extension
Date:  Fri, 18 Jul 2003 18:29:43 +0200

No real change compared to the previous version, patch was only
adapted to fit into 2.6.0-test1. I append the description from my
previous posting.

The patch shows 5-8% gain in the numa_test benchmark on a TX7 Itanium2
machine with 8 CPUs/4 nodes. The interesting numbers are ElapsedTime
and TotalUserTime. In numa_test I changed the PROBLEMSIZE from 1000000
to 2000000 in order to get longer execution/test times. The results
are avergaes over 10 measurements, the standard deviation is in
brackets.

2.6.0-test1 kernel: original NUMA scheduler

Tasks   AverageUserTime   ElapsedTime   TotalUserTime    TotalSysTime
  4      52.67(3.51)      61.30(8.04)   210.70(14.05)     0.16(0.02)	
  8      50.29(1.85)      55.19(6.36)   402.38(14.78)     0.34(0.02)	
 16      53.27(2.30)     115.30(5.40)   852.40(36.75)     0.62(0.02)	
 32      51.92(1.13)     215.98(5.95)  1661.66(36.08)     1.21(0.04)	


2.6.0-test1 kernel: node affine NUMA scheduler

Tasks   AverageUserTime   ElapsedTime   TotalUserTime    TotalSysTime
  4      50.13(2.09)      56.72(8.46)   200.55(8.34)      0.15(0.01)	
  8      49.78(1.29)      54.43(4.90)   398.26(10.31)     0.34(0.02)	
 16      50.37(0.96)     110.79(8.46)   806.01(15.33)     0.63(0.03)	
 32      51.10(0.51)     210.18(3.27)  1635.40(16.16)     1.23(0.04)	

In order to see the UserTime / CPU one needs an additional patch which
gets back the per cpu times in /proc/pid/cpu. The patch comes in a
separate post.

> This patch is an adaptation of the earlier work on the node affine
> NUMA scheduler to the NUMA features meanwhile integrated into
> 2.5. Compared to the patch posted for 2.5.39 this one is much simpler
> and easier to understand.
> 
> The main idea is (still) that tasks are assigned a homenode to which
> they are preferentially scheduled. They are not only sticking as much
> as possible to a node (as in the current 2.5 NUMA scheduler) but will
> also be attracted back to their homenode if they had to be scheduled
> away. Therefore the tasks can be called "affine" to the homenode.
> 
> The implementation is straight forward:
> - Tasks have an additional element in their task structure (node).
> - The scheduler keeps track of the homenodes of the tasks running in
> each node and on each runqueue.
> - At cross-node load balance time nodes/runqueues which run tasks
> originating from the stealer node are preferred. They get a weight
> bonus for each task with the homenode of the stealer.
> - When stealing from a remote node one tries to get the own tasks (if
> any) or tasks from other nodes (if any). This way tasks are kept on
> their homenode as long as possible.
> 
> The selection of the homenode is currently done at initial load
> balancing, i.e. at exec(). A smarter selection method might be needed
> for improving the situation for multithreaded processes. An option is
> the dynamic_homenode patch I posted for 2.5.39 or some other scheme
> based on an RSS/node measure. But that's another story...

Regards,
Erich


[2. text/x-diff; node_affine_sched-2.6.0t1-23.diff]...


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.