| From: |
| Andi Kleen <ak@suse.de> |
| To: |
| lse-tech@lse.sourceforge.net |
| Subject: |
| [Lse-tech] Simple NUMA library for AMD64 |
| Date: |
| Sun, 18 May 2003 10:56:11 +0200 |
| Cc: |
| discuss@x86-64.org |
Hallo,
For some tunings on AMD64 I implemented simple prctl based commands for an 2.4
kernel. It allows to page interleave memory over nodes, allocate memory
locally or allocate memory only on a specific set of nodes.
To make it easier to use and shield the application from changing kernel
APIs (2.5 is already different) I implemented a simple higher level library
on top of it.
Here is the current specification of it. I would appreciate comments.
To be honest I'm not very interested in feature requests (e.g. if you
want a fully directed visible graph of node distances or similar the simple
NUMA library is probably not the right place), more possible simplifications
where this library should be hard to port to other NUMA architectures.
There is a matching numactl utility to set all this on the command line.
I also attached its manpage.
Design notes:
It aims to be simple, not a great design.
In my implementation all NUMA policy is applied at fault time, not map time.
This currently shines through in the library. This is a particularity of my
current implementation. 2.5 seems to go more towards per VMA policy. I
decided to keep it like this for now.
The reason I exposed this in the library is that I didn't want to add a new
numa_alloc_* family for shared memory. There are multiple ways to get
shared memory (mmap, shmat etc.), and adding them all to the library would
be quite complicated. Keeping the allocation of shared memory to the application
and just providing "police" functions to change the policy seemed simpler.
In an application with per VMA policy it could be possibly implemented
by implementing a new system call that changes the policy for an existing
VMA.
The way to deal with changing kernel APIs is very simple: it's only
compiled as a shared library. If the kernel API changes the shared library
can be hopefully simply replaced for existing applications.
It only deals with nodes, not CPUs. One reason for this is that it is
AMD64 centric where CPU equals node, but even on other architectures with
multiple CPUs per node more finegrained settings than nodes do not seem to be
commonly used. Inside a node conventional SMP tunings can be used, no need
for an NUMA library.
The only possible exception is the CPU binding (numa_run_on_node*), but
node granuality seems to be enough for that too. If it should be a problem
the application can call sched_setaffinity directly.
Possible distance between nodes is ignored. On current AMD64 it doesn't
exist and it seems like a very big complication for little gain even
on other architectures. If it should be needed it can be read from
sysfs in 2.5.
The set of nodes is defined as unsigned long. I did this because I don't see
Linux breaking this limit any time soon (note this is talking about nodes, not
CPUs again; e.g. on a 64bit 4cpus per node machine the CPU limit is 256 CPUs).
On AMD64 it allows upto 64 CPUs. I expect this to be controversal, but the
alternatives (defining bitset types etc.) seemed too ugly.
Homenode is a specific concept from my NUMA scheduler that may not exist
in others (e.g. it doesn't in 2.5). I decided to show it in the library for now,
but it's only a hint and could be ignored. The main reason I did this is
that the automatic balancing has some nasty corner cases where it may make
sense for the application or numactl to overwrite it. The concept of a homenode is
different from just memory binding because it implies order.
Ignoring the homenode hint scheduler changes the changes in the kernel to implement
this are all rather simple. Basically it only consists of a couple
or prctls and some simple changes to the page allocation function.
The patch for the homenode NUMA scheduler is still in development and not
released yet.
-Andi
NUMA(3) Linux Programmer's Manual NUMA(3)
NAME
numa - NUMA policy library
SYNOPSIS
#include <numa.h>
cc ... -lnuma
int numa_available(void)
int numa_max_node(void)
int numa_homenode(void)
void numa_set_interleave_mask(unsigned long mask)
unsigned long numa_get_interleave_mask(void)
void numa_set_homenode(int node)
void numa_set_localalloc(int flag)
void numa_set_membind(unsigned long mask)
void numa_get_membind(unsigned long mask)
void *numa_alloc_stripped_subset(size_t size, unsigned
long mask)
void *numa_alloc_stripped(size_t size)
void *numa_alloc_onnode(size_t size, int node)
void *numa_alloc_local(size_t size)
void *numa_alloc(size_t size)
void numa_free(void *mem, size_t size)
int numa_run_on_node_mask(unsigned long mask)
int numa_run_on_node(int node)
void numa_interleave_memory(void *mem, size_t size,
unsigned long mask)
void numa_tonode_memory(void *mem, size_t size, int node)
void numa_setlocal_memory(void *mem, size_t size)
void numa_police_memory(void *mem, size_t size)
DESCRIPTION
libnuma offers a simple programming interface to the NUMA
policy supported by the Linux kernel. Available policies
are page interleaving, home node allocation, local alloca-
tion. It also allows to bind threads to specific nodes.
All policy is per thread, but inherited to children. If
you just want to set the global policy per process con-
sider using the numactl(8) utility. Otherwise this library
offers fine grained choices for applications.
All numa memory allocation policy only takes effect when a
page is actually faulted into the address space of a pro-
cess by accessing it. The numa_alloc_* functions take care
of this automatically.
A node is defined as an area where memory is the same
speed as seen from a particular CPU.
The mapping of nodes to cpus depends on the architecture.
On the AMD64 architecture each CPU is an own node. This
library is only concerned about nodes.
Some of these functions accept a node mask. A node mask
is an unsigned long with a bit set for each node number
that is in the mask. Bits above numa_max_node are
undefined.
Before any other calls in this library can be used
numa_available must be called. When it returns an negative
value all other functions in this library are undefined.
numa_max_node returns the highest node number available on
the current system. When a node number or a node mask with
a bit set above the value returned by this function is
passed to a libnuma the result is undefined.
numa_homenode returns the homenode of the current thread.
It is the node the kernel preferably allocates memory on,
unless some other policy overwrites this.
numa_set_interleave_mask Set an memory interleave mask for
the current thread. All new memory allocations are page
interleaved over all nodes in the interleave mask. The
page interleaving only occurs on the actual page fault
that puts a new page into the current address space, not
during mmap. This is a low level function, it may be more
convenient to use the higher level functions like
numa_alloc_stripped or numa_alloc_stripped_subset.
numa_get_interleave_mask returns the current interleave
mask.
numa_set_homenode sets the homenode for the current thread
to node. Homenode is the node memory is preferably allo-
cated from.
numa_set_localalloc sets a local memory allocation policy
for the current thread. When flag is not null memory is
preferably allocated from the current node. Otherwise it
is allocated from the homenode. These are normally identi-
cal, but can differ in some special situations.
numa_set_membind sets a memory allocation mask. Only allo-
cate memory from the nodes set in mask. A mask of 0 or
-1UL turns membinding off.
numa_get_membind returns the current node mask from which
memory can be allocated. 0 or -1UL means all nodes.
numa_alloc_stripped allocates size bytes of memory page
stripped on all nodes. This function is relatively slow
and should only be used for large areas consisting of mul-
tiple pages. The interleaving works on page level and will
only show an effect when the area is large. It must be
freed with numa_free
numa_alloc_stripped_subset is like numa_alloc_stripped
except that it also accepts a mask of the nodes to inter-
leave on.
numa_alloc_onnode allocates memory on a specific node.
This function is relatively slow and allocations are
rounded to pagesize. The memory must be freed with
numa_free
numa_alloc_local allocates memory on the local node. This
function is relatively slow and allocations are rounded to
pagesize. The memory must be freed with numa_free.
numa_alloc allocates memory with the current NUMA policy.
This function is relatively slow and allocations are
rounded to pagesize. The memory must be freed with
numa_free.
numa_free frees memory allocates by the numa_alloc_* func-
tions above.
numa_run_on_node runs the current thread on a specific
node. The thread will not migrate to other nodes until
this is reset with numa_run_on_node_mask with an -1UL
argument.
numa_run_on_node_mask runs the current thread only on a
specific node mask.
numa_interleave_memory is a lower level function to inter-
leave not yet faulted in, but allocated memory. Not yet
faulted in means the memory is allocated using mmap(2) or
shmat(2), but has not been accessed by the current process
yet. The memory is page interleaved to all nodes specified
in mask. Normally numa_alloc_stripped should be used for
private memory instead, but this function is useful to
handle shared memory areas. To be useful the memory area
should be significantly larger than a page.
numa_tonode_memory locates memory on a specific node. The
constraints described for numa_interleave_memory apply
here too.
numa_setlocal_memory locates memory on the current node.
The constraints described for numa_interleave_memory apply
here too.
numa_police_memory locates memory with the current NUMA
policy. The constraints described for numa_interleave_mem-
ory apply here too.
NOTES
The kernel internal interface for libnuma is subject to
change. For this reason it is recommended to only use lib-
numa as shared library so that it can be easily replaced
for a new kernel.
BUGS
The library and the kernel interface used by it currently
assumes internally that each CPU is an own node. This is
the case on the AMD64 architecture.
The maximum number of nodes supported by this API is lim-
ited to 64 on 64bit systems and 32bit on 32bit systems.
SEE ALSO
getpagesize(2), mmap(2), shmat(2)
AUTHOR
libnuma and the manpage was written by Andi Kleen.
SuSE Labs May 2003 NUMA(3)
NUMACTL(8) Linux Administrator's Manual NUMACTL(8)
NAME
numactl - Control NUMA policy for processes
SYNOPSIS
numactl [ --interleave=nodes ] [ --homenode=homenode ] [
--cpubind=cpu ] [ --membind=nodes ] [ --localalloc ] com-
mand {arguments ...}
numactl [ -i nodes ] [ -h homenode ] [ -m nodes ] [ -b
cpus ] command {arguments ...}
numactl --show
DESCRIPTION
numactl runs processes with a specific NUMA scheduling or
memory placement policy. The policy is set for command
and inherited by all of its children.
Policy settings are:
--interleave=nodes, -i nodes
Set an memory interleave policy. Memory will be
allocated using round robin on nodes.
--homenode=node, -h node
Set the homenode to node. homenode is the node the
process first tries to allocate memory from. Nor-
mally it is assigned dynamically at exec(2) or
fork(2) / clone(2)
time (the later only when the kernel.homenode_bal-
ance_threads is set). In addition the scheduler
gives strong preference to the homenode to schedule
the process near its memory. If the memory alloca-
tion does not succeed the allocation is tried on
other nodes.
--membind=nodes, -m nodes
Only allocate memory from nodes.
--cpubind=cpus, -b cpus
Only execute process on cpus. The syntax for cpus
is the same as for node specifiers.
--localalloc, -l
Do always local allocation on the current node.
This overwrites the homenode and interleave set-
tings. It is also default when the
vm.node_local_alloc sysctl is set.
--show, -s
Show current NUMA policy settings.
Valid node specifiers
all All nodes
number Node containing CPU number.
number1{,number2} Set of nodes containing the CPUs number1 and number2
number1-number2 Nodes containing CPUs from number1 to number2
! nodes Invert selection of the following specification.
EXAMPLES
numactl --interleave=all bigdatabase arguments Run big
database with its memory interleaved on all CPUs.
numactl --homenode=0 --membind=0,1 process Run process
preferably on node 0 with memory allocated on node 0 and
1.
SYSCTLS
kernel.homenode_balance_threads Balance the homenode on
fork and clone for all threads. Otherwise it is only bal-
anced at execve(2) time.
vm.node_local_alloc Enable local node allocation policy
for all processes. This disables the homenode and NUMA
policy settings except CPU and memory binding.
NOTES
Requires an NUMA aware kernel with the homenode scheduling
/ NUMA policy patch applied.
Command is not executed using a shell. If you want to use
shell metacharacters in the child use sh -c as wrapper.
FILES
/proc/cpuinfo for the listing of active CPUs. See proc(5)
for details.
/proc/numa for NUMA memory hit statistics.
BUGS
Currently only works on architectures with Node equal CPU
(in particular on AMD64)
SEE ALSO
fork(2) , execve(2) , clone(2) , sched_setaffinity(2) ,
sched_getaffinity(2) , proc(5)
AUTHOR
numactl was written by Andi Kleen.
SuSE Labs May 2003 NUMACTL(8)
-------------------------------------------------------
This SF.net email is sponsored by: If flattening out C++ or Java
code to make your application fit in a relational database is painful,
don't do it! Check out ObjectStore. Now part of Progress Software.
http://www.objectstore.net/sourceforge
_______________________________________________
Lse-tech mailing list
Lse-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lse-tech