LWN.net Logo

Advertisement

Interested in hardware, diags, validation, Linux, C, ARM, Microcode and low level programming and blazing networks?

Advertise here

Simple NUMA library for AMD64

From:  Andi Kleen <ak@suse.de>
To:  lse-tech@lse.sourceforge.net
Subject:  [Lse-tech] Simple NUMA library for AMD64
Date:  Sun, 18 May 2003 10:56:11 +0200
Cc:  discuss@x86-64.org


Hallo,

For some tunings on AMD64 I implemented simple prctl based commands for an 2.4
kernel. It allows to page interleave memory over nodes, allocate memory 
locally or allocate memory only on a specific set of nodes.

To make it easier to use and shield the application from changing kernel
APIs (2.5 is already different) I implemented a simple higher level library
on top of it.

Here is the current specification of it. I would appreciate comments.

To be honest I'm not very interested in feature requests (e.g. if you
want a fully directed visible graph of node distances or similar the simple
NUMA library is probably not the right place), more possible simplifications 
where this library should be hard to port to other NUMA architectures.

There is a matching numactl utility to set all this on the command line.
I also attached its manpage.

Design notes:

It aims to be simple, not a great design. 

In my implementation all NUMA policy is applied at fault time, not map time.
This currently shines through in the library. This is a particularity of my 
current implementation.  2.5 seems to go more towards per VMA policy. I 
decided to keep it like this for now.

The reason I exposed this in the library is that I didn't want to add a new 
numa_alloc_* family for shared memory. There are multiple ways to get
shared memory (mmap, shmat etc.), and adding them all to the library would 
be quite complicated. Keeping the allocation of shared memory to the application
and just providing "police" functions to change the policy seemed simpler.

In an application with per VMA policy it could be possibly implemented
by implementing a new system call that changes the policy for an existing
VMA.

The way to deal with changing kernel APIs is very simple: it's only
compiled as a shared library. If the kernel API changes the shared library
can be hopefully simply replaced for existing applications.

It only deals with nodes, not CPUs. One reason for this is that it is 
AMD64 centric where CPU equals node, but even on other architectures with 
multiple CPUs per node more finegrained settings than nodes do not seem to be 
commonly used. Inside a node conventional SMP tunings can be used, no need
for an NUMA library.

The only possible exception is the CPU binding (numa_run_on_node*), but
node granuality seems to be enough for that too. If it should be a problem
the application can call sched_setaffinity directly.

Possible distance between nodes is ignored. On current AMD64 it doesn't 
exist and it seems like a very big complication for little gain even
on other architectures. If it should be needed it can be read from 
sysfs in 2.5.

The set of nodes is defined as unsigned long. I did this because I don't see
Linux breaking this limit any time soon (note this is talking about nodes, not
CPUs again; e.g. on a 64bit 4cpus per node machine the CPU limit is 256 CPUs). 
On AMD64 it allows upto 64 CPUs.  I expect this to be controversal, but the 
alternatives (defining bitset types etc.) seemed too ugly.

Homenode is a specific concept from my NUMA scheduler that may not exist
in others (e.g. it doesn't in 2.5). I decided to show it in the library for now,
but it's only a hint and could be ignored. The main reason I did this is 
that the automatic balancing has some nasty corner cases where it may make
sense for the application or numactl to overwrite it. The concept of a homenode is 
different from just memory binding because it implies order.

Ignoring the homenode hint scheduler changes the changes in the kernel to implement 
this are all rather simple.  Basically it only consists of a couple
or prctls and some simple changes to the page allocation function.

The patch for the homenode NUMA scheduler is still in development and not 
released yet.

-Andi


NUMA(3)             Linux Programmer's Manual             NUMA(3)



NAME
       numa - NUMA policy library

SYNOPSIS
       #include <numa.h>

       cc ... -lnuma

       int numa_available(void)

       int numa_max_node(void)
       int numa_homenode(void)

       void numa_set_interleave_mask(unsigned long mask)
       unsigned long numa_get_interleave_mask(void)
       void numa_set_homenode(int node)
       void numa_set_localalloc(int flag)
       void numa_set_membind(unsigned long mask)
       void numa_get_membind(unsigned long mask)

       void   *numa_alloc_stripped_subset(size_t  size,  unsigned
       long mask)
       void *numa_alloc_stripped(size_t size)
       void *numa_alloc_onnode(size_t size, int node)
       void *numa_alloc_local(size_t size)
       void *numa_alloc(size_t size)
       void numa_free(void *mem, size_t size)

       int numa_run_on_node_mask(unsigned long mask)
       int numa_run_on_node(int node)

       void  numa_interleave_memory(void   *mem,   size_t   size,
       unsigned long mask)
       void numa_tonode_memory(void *mem, size_t size, int node)
       void numa_setlocal_memory(void *mem, size_t size)
       void numa_police_memory(void *mem, size_t size)

DESCRIPTION
       libnuma  offers a simple programming interface to the NUMA
       policy supported by the Linux kernel.  Available  policies
       are page interleaving, home node allocation, local alloca-
       tion.  It also allows to bind threads to  specific  nodes.
       All  policy  is  per thread, but inherited to children. If
       you just want to set the global policy  per  process  con-
       sider using the numactl(8) utility. Otherwise this library
       offers fine grained choices for applications.

       All numa memory allocation policy only takes effect when a
       page  is actually faulted into the address space of a pro-
       cess by accessing it. The numa_alloc_* functions take care
       of this automatically.

       A  node  is  defined  as  an area where memory is the same
       speed as seen from a particular CPU.

       The mapping of nodes to cpus depends on the  architecture.
       On  the  AMD64  architecture each CPU is an own node. This
       library is only concerned about nodes.

       Some of these functions accept a node mask.  A  node  mask
       is  an  unsigned  long with a bit set for each node number
       that  is  in  the  mask.  Bits  above  numa_max_node   are
       undefined.

       Before  any  other  calls  in  this  library  can  be used
       numa_available must be called. When it returns an negative
       value all other functions in this library are undefined.

       numa_max_node returns the highest node number available on
       the current system. When a node number or a node mask with
       a  bit  set  above  the value returned by this function is
       passed to a libnuma the result is undefined.

       numa_homenode returns the homenode of the current  thread.
       It  is the node the kernel preferably allocates memory on,
       unless some other policy overwrites this.

       numa_set_interleave_mask Set an memory interleave mask for
       the  current  thread.  All new memory allocations are page
       interleaved over all nodes in the  interleave  mask.   The
       page  interleaving  only  occurs  on the actual page fault
       that puts a new page into the current address  space,  not
       during  mmap. This is a low level function, it may be more
       convenient  to  use  the  higher  level   functions   like
       numa_alloc_stripped or numa_alloc_stripped_subset.

       numa_get_interleave_mask  returns  the  current interleave
       mask.

       numa_set_homenode sets the homenode for the current thread
       to  node.  Homenode is the node memory is preferably allo-
       cated from.

       numa_set_localalloc sets a local memory allocation  policy
       for  the  current  thread. When flag is not null memory is
       preferably allocated from the current node.  Otherwise  it
       is allocated from the homenode. These are normally identi-
       cal, but can differ in some special situations.

       numa_set_membind sets a memory allocation mask. Only allo-
       cate  memory  from  the nodes set in mask.  A mask of 0 or
       -1UL turns membinding off.

       numa_get_membind returns the current node mask from  which
       memory can be allocated.  0 or -1UL means all nodes.

       numa_alloc_stripped  allocates  size  bytes of memory page
       stripped on all nodes. This function  is  relatively  slow
       and should only be used for large areas consisting of mul-
       tiple pages. The interleaving works on page level and will
       only  show  an  effect  when the area is large. It must be
       freed with numa_free

       numa_alloc_stripped_subset  is  like   numa_alloc_stripped
       except  that it also accepts a mask of the nodes to inter-
       leave on.

       numa_alloc_onnode allocates memory  on  a  specific  node.
       This  function  is  relatively  slow  and  allocations are
       rounded  to  pagesize.  The  memory  must  be  freed  with
       numa_free

       numa_alloc_local  allocates memory on the local node. This
       function is relatively slow and allocations are rounded to
       pagesize. The memory must be freed with numa_free.

       numa_alloc  allocates memory with the current NUMA policy.
       This function  is  relatively  slow  and  allocations  are
       rounded  to  pagesize.  The  memory  must  be  freed  with
       numa_free.

       numa_free frees memory allocates by the numa_alloc_* func-
       tions above.

       numa_run_on_node  runs  the  current  thread on a specific
       node. The thread will not migrate  to  other  nodes  until
       this  is  reset  with  numa_run_on_node_mask  with an -1UL
       argument.

       numa_run_on_node_mask runs the current thread  only  on  a
       specific node mask.

       numa_interleave_memory is a lower level function to inter-
       leave not yet faulted in, but allocated  memory.  Not  yet
       faulted  in means the memory is allocated using mmap(2) or
       shmat(2), but has not been accessed by the current process
       yet. The memory is page interleaved to all nodes specified
       in mask.  Normally numa_alloc_stripped should be used  for
       private  memory  instead,  but  this function is useful to
       handle shared memory areas. To be useful the  memory  area
       should be significantly larger than a page.

       numa_tonode_memory  locates memory on a specific node. The
       constraints  described  for  numa_interleave_memory  apply
       here too.

       numa_setlocal_memory  locates  memory on the current node.
       The constraints described for numa_interleave_memory apply
       here too.

       numa_police_memory  locates  memory  with the current NUMA
       policy. The constraints described for numa_interleave_mem-
       ory apply here too.


NOTES
       The  kernel  internal  interface for libnuma is subject to
       change. For this reason it is recommended to only use lib-
       numa  as  shared library so that it can be easily replaced
       for a new kernel.


BUGS
       The library and the kernel interface used by it  currently
       assumes  internally  that each CPU is an own node. This is
       the case on the AMD64 architecture.

       The maximum number of nodes supported by this API is  lim-
       ited to 64 on 64bit systems and 32bit on 32bit systems.


SEE ALSO
       getpagesize(2), mmap(2), shmat(2)


AUTHOR
       libnuma and the manpage was written by Andi Kleen.



SuSE Labs                    May 2003                     NUMA(3)


NUMACTL(8)         Linux Administrator's Manual        NUMACTL(8)



NAME
       numactl - Control NUMA policy for processes

SYNOPSIS
       numactl  [  --interleave=nodes ] [ --homenode=homenode ] [
       --cpubind=cpu ] [ --membind=nodes ] [ --localalloc ]  com-
       mand {arguments ...}
       numactl  [  -i  nodes  ] [ -h homenode ] [ -m nodes ] [ -b
       cpus ] command {arguments ...}
       numactl --show

DESCRIPTION
       numactl runs processes with a specific NUMA scheduling  or
       memory  placement  policy.   The policy is set for command
       and inherited by all of its children.

       Policy settings are:

       --interleave=nodes, -i nodes
              Set an memory interleave  policy.  Memory  will  be
              allocated using round robin on nodes.

       --homenode=node, -h node
              Set  the homenode to node. homenode is the node the
              process first tries to allocate memory  from.  Nor-
              mally  it  is  assigned  dynamically  at exec(2) or
              fork(2) / clone(2)
               time (the later only when the kernel.homenode_bal-
              ance_threads  is  set).  In  addition the scheduler
              gives strong preference to the homenode to schedule
              the process near its memory.  If the memory alloca-
              tion does not succeed the allocation  is  tried  on
              other nodes.

       --membind=nodes, -m nodes
              Only allocate memory from nodes.

       --cpubind=cpus, -b cpus
              Only  execute  process on cpus. The syntax for cpus
              is the same as for node specifiers.

       --localalloc, -l
              Do always local allocation  on  the  current  node.
              This  overwrites  the  homenode and interleave set-
              tings.   It    is    also    default    when    the
              vm.node_local_alloc sysctl is set.

       --show, -s
              Show current NUMA policy settings.

       Valid node specifiers
              all                 All nodes
              number              Node containing CPU number.
              number1{,number2}   Set of nodes containing the CPUs number1 and number2
              number1-number2     Nodes containing CPUs from number1 to number2
              ! nodes             Invert selection of the following specification.

EXAMPLES
       numactl  --interleave=all  bigdatabase  arguments  Run big
       database with its memory interleaved on all CPUs.

       numactl --homenode=0  --membind=0,1  process  Run  process
       preferably  on  node 0 with memory allocated on node 0 and
       1.


SYSCTLS
       kernel.homenode_balance_threads Balance  the  homenode  on
       fork  and clone for all threads. Otherwise it is only bal-
       anced at execve(2) time.

       vm.node_local_alloc Enable local  node  allocation  policy
       for  all  processes.  This  disables the homenode and NUMA
       policy settings except CPU and memory binding.


NOTES
       Requires an NUMA aware kernel with the homenode scheduling
       / NUMA policy patch applied.

       Command  is not executed using a shell. If you want to use
       shell metacharacters in the child use sh -c as wrapper.


FILES
       /proc/cpuinfo for the listing of active CPUs. See  proc(5)
       for details.

       /proc/numa for NUMA memory hit statistics.

BUGS
       Currently  only works on architectures with Node equal CPU
       (in particular on AMD64)


SEE ALSO
       fork(2) , execve(2) , clone(2)  ,  sched_setaffinity(2)  ,
       sched_getaffinity(2) , proc(5)

AUTHOR
       numactl was written by Andi Kleen.



SuSE Labs                    May 2003                  NUMACTL(8)


-------------------------------------------------------
This SF.net email is sponsored by: If flattening out C++ or Java
code to make your application fit in a relational database is painful, 
don't do it! Check out ObjectStore. Now part of Progress Software.
http://www.objectstore.net/sourceforge
_______________________________________________
Lse-tech mailing list
Lse-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lse-tech

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds