|
|
Log in / Subscribe / Register

libnuma/numactl and NUMA API for 2.6 released

From:  Andi Kleen <ak@suse.de>
To:  lse-tech@lists.sourceforge.net
Subject:  [Lse-tech] libnuma/numactl and NUMA API for 2.6 released
Date:  Mon, 12 Jan 2004 17:09:16 +0100


An implementation of a NUMA policy API for Linux 2.6 has been released. It consists
of an implementation of the Linux kernel NUMA policy API discussed at last kernel summit,
an higher level library named libnuma for applications, an user space policy tool
numactl and some test programs. The libnuma interface is still very similar to 
the older specification I posted some time ago (there were only a few minor
changes in it). numactl is also largely unchanged.

This version has been tested on x86-64. It should be portable to other architectures,
although you may need to get an system call allocation for them first and add them
to the user library and the kernel code.

This is still a quite rough release, but I think it's good enough now for some 
wider testing and review.

It can be downloaded from:

ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.5.tar.gz
User space tools and libraries and manpages

ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz
Kernel patch for 2.6.1, with support for x86-64

The new kernel API supports several memory policies for NUMA system:
MPOL_BIND     only allocate on a specific set of nodes)
MPOL_PREFERED allocate preferable on a specific node, but fall back to others if it fails
MPOL_DEFAULT  (standard policy) allocate preferable on the current node and fall back to others.
MPOL_INTERLEAVE interleave allocation to a specific set of nodes.

It allows to set policies for a process or for a memory area.

It adds three new system calls: 
mbind to set a policy for a specific memory area
See http://www.firstfloor.org/~andi/mbind.html

set_mempolicy to set the process policy for the current process
See http://www.firstfloor.org/~andi/set_mempolicy.html

get_mempolicy to get the memory policy for an memory area or process.

This kernel API should be normally not used directly by programs, instead they 
should use the higher level libnuma. libnuma has a lot of functions to allocate
memory with various policies, discover the NUMA topology and also some wrapper functions 
for other system calls (e.g. for controlling scheduler affinity). You 

See http://www.firstfloor.org/~andi/numa.html for details

numactl is a command line utility that allows to run programs and their
children with a specific policy. You can use it like

	numactl --interleave=0-2 memhog 100m

to set an interleaving policy for nodes 0 to 2 for memhog. All memory
allocated in there will be interleaved to these nodes.

There is also an program numastat to print the new numa statistics from sysfs.

There are some test programs, especially a program called numademo that attempts
to benchmark most possible policy combinations on your machine.

Any feedback welcome, especially from bigger machines.

Some design issues in the kernel implementation:

All policy is always applied at fault time. This means when you set a process
policy you have to fault pages to let it take any effect. The higher level API
takes care of that.

Process policy is not persistent over swapping. This is not easily fixable. If you
need that persistency use mbind() 

Currently the interleaving state is per VMA. This implies that e.g. when you 
set an interleave state for a shared memory VMA each process accessing 
it does its own interleaving, which may end with the object not being
very evenly interleaved. Better would be to share the interleaving
state for VMAs pointing to the same object between processes.

Should have a way to set global policy for a file (especially in 
hugetlbfs) or a shared memory object (related to the previous item).
It would be useful for all files too to control the page cache.

Only the highest zone in the zone hierarchy of each node is policied. This
implies that on 32bit systems there is no policy for the lowmem zone if
there is highmem, only for highmem. If the system doesn't have highmem 
the lowmem zone will be policied. The dma zone cannot be policied. 
On 64bit systems it doesn't make any difference (except for DMA)

Known problems:

Needs more testing (especially all the corner cases in mbind and large 
pages support) 

The sysfs cpu parser may not be completely uptodate with the ever changing
cpumap format. It works on an 4 node Opteron, but that is easy because the cpu
mask there fits into a single word. 

The user space tools and libraries still have quite some rough edges and need
more polishing.

The man pages need proofreading and cleaning up, especially get_mempolicy.2 
which is quite bad currently.

-Andi


-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Lse-tech mailing list
Lse-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lse-tech


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds