LWN.net Logo

Huge pages part 3: Administration

March 3, 2010

This article was contributed by Mel Gorman

[Editor's note: this is the third part in Mel Gorman's series on the use of huge pages in Linux. For those who missed them, a look at part 1 and part 2 is recommended before diving into this installment.]

In this chapter, the setup and the administration of huge pages within the system is addressed. Part 2 discussed the different interfaces between user and kernel space such as hugetlbfs and shared memory. For an application to use these interfaces, though, the system must first be properly configured. Use of hugetlbfs requires only that the filesystem must be mounted; shared memory needs additional tuning and huge pages must also be allocated. Huge pages can be statically allocated as part of a pool early in the lifetime of the system or the pool can be allowed to grow dynamically as required. Libhugetlbfs provides a hugeadm utility that removes much of the tedium involved in these tasks.

1 Identifying Available Page Sizes

Since kernel 2.6.27, Linux has supported more than one huge page size if the underlying hardware does. There will be one directory per page size supported in /sys/kernel/mm/hugepages and the "default" huge page size will be stored in the Hugepagesize field in /proc/meminfo.

The default huge page size can be important. While hugetlbfs can specify the page size at mount time, the same option is not available for shared memory or MAP_HUGETLB. This can be important when using 1G pages on AMD or 16G pages on Power 5+ and later. The default huge page size can be set either with the last hugepagesz= option on the kernel command line (see below) or explicitly with default_hugepagesz=.

Libhugetlbfs provides two means of identifying the huge page sizes. The first is using the pagesize utility with the -H switch printing the available huge page sizes and -a showing all page sizes. The programming equivalent are the gethugepagesizes() and getpagesizes() calls.

2 Huge Page Pool

Due to the inability to swap huge pages, none are allocated by default, so a pool must be configured with either a static or a dynamic size. The static size is the number of huge pages that are pre-allocated and guaranteed to be available for use by applications. Where it is known in advance how many huge pages are required, the static size should be set.

The size of the static pool may be set in a number of ways. First, it may be set at boot-time using the hugepages= kernel boot parameter. If there are multiple huge page sizes, the hugepagesz= parameter must be used and interleaved with hugepages= as described in Documentation/kernel-parameters. For example, Power 5+ and later support multiple page sizes including 64K and 16M; both could be configured with:

    hugepagesz=64k hugepages=128 hugepagesz=16M hugepages=4

Second, the default huge page pool size can be set with the vm.nr_hugepages sysctl, which, again, tunes the default huge page pool. Third, it may be set via sysfs by finding the appropriate nr_hugepages virtual file below /sys/kernel/mm/hugepages.

Knowing the exact huge page requirements in advance may not be possible. For example, the huge page requirements may be expected to vary throughout the lifetime of the system. In this case, the maximum number of additional huge pages that should be allocated is specified with the vm.nr_overcommit_hugepages. When a huge page pool does not have sufficient pages to satisfy a request for huge pages, an attempt to allocate up to nr_overcommit_hugepages is made. If an allocation fails, the result will be that mmap() will fail to avoid page fault failures as described in Huge Page Fault Behaviour in part 1.

It is easiest to tune the pools with hugeadm. The --pool-pages-min argument specifies the minimum number of huge pages that are guaranteed to be available. The --pool-pages-max argument specifies the maximum number of huge pages that will exist in the system, whether statically or dynamically allocated. The page size can be specified or it can be simply DEFAULT. The amount to allocate can be specified as either a number of huge pages or a size requirement.

In the following example, run on an x86 machine, the 4M huge page pool is being tuned. As 4M also happens to be the default huge page size, it could also have been specified as DEFAULT:32M and DEFAULT:64M respectively.

    $ hugeadm --pool-pages-min 4M:32M
    $ hugeadm --pool-pages-max 4M:64M
    $ hugeadm --pool-list
          Size  Minimum  Current  Maximum  Default
       4194304        8        8       16        *

To confirm the huge page pools are tuned to the satisfaction of requirements, hugeadm --pool-list will report on the minimum, maximum and current usage of huge pages of each size supported by the system.

3 Mounting HugeTLBFS

To access the special filesystem described in HugeTLBFS in part 2, it must first be mounted. What may be less obvious is that this is required to benefit from the use of the allocation API, or to automatically back segments with huge pages (as also described in part 2). The default huge page size is used for the mount if the pagesize= is not used. The following mounts two filesystem instances with different page sizes as supported on Power 5+.

  $ mount -t hugetlbfs /mnt/hugetlbfs-default
  $ mount -t hugetlbfs /mnt/hugetlbfs-64k -o pagesize=64K

Ordinarily it would be the responsibility of the administrator to set the permissions on this filesystem appropriately. hugeadm provides a range of different options for creating mount points with different permissions. The list of options are as follows and are self-explanatory.

--create-mounts
Creates a mount point for each available huge page size on this system under /var/lib/hugetlbfs.

--create-user-mounts <user>
Creates a mount point for each available huge page size under /var/lib/hugetlbfs/<user> usable by user <user>.

--create-group-mounts <group>
Creates a mount point for each available huge page size under /var/lib/hugetlbfs/<group> usable by group <group>.

--create-global-mounts
Creates a mount point for each available huge page size under /var/lib/hugetlbfs/global usable by anyone.

It is up to the discretion of the administrator whether to call hugeadm from a system initialization script or to create appropriate fstab entries. If it is unclear what mount points already exist, use --list-all-mounts to list all current hugetlbfs mounts and the options used.

3.1 Quotas

A little-used feature of hugetlbfs is quota support which limits the number of huge pages that a filesystem instance can use even if more huge pages are available in the system. The expected use case would be to limit the number of huge pages available to a user or group. While it is not currently supported by hugeadm, the quota can be set with the size= option at mount time.

4 Enabling Shared Memory Use

There are two tunables that are relevant to the use of huge pages with shared memory. The first is the sysctl kernel.shmmax kernel parameter configured permanently in /etc/sysctl.conf or temporarily in /proc/sys/kernel/shmmax. The second is the sysctl vm.hugetlb_shm_group which stores which group ID (GID) is allowed to create shared memory segments. For example, lets say a JVM was to use shared memory with huge pages and ran as the user JVM with UID 1500 and GID 3000, then the value of this tunable should be 3000.

Again, hugeadm is able to tune both of these parameters with the switches --set-recommended-shmmax and --set-shm-group. As the recommended value is calculated based on the size of the static and dynamic huge page pools, this should be called after the pools have been configured.

5 Huge Page Allocation Success Rates

If the huge page pool is statically allocated at boot-time, then this section will not be relevant as the huge pages are guaranteed to exist. In the event the system needs to dynamically allocate huge pages throughout its lifetime, then external fragmentation may be a problem. "External fragmentation" in this context refers to the inability of the system to allocate a huge page even if enough memory is free overall because the free memory is not physically contiguous. There are two means by which external fragmentation can be controlled, greatly increasing the success rate of huge page allocations.

The first means is by tuning vm.min_free_kbytes to a higher value which helps the kernel's fragmentation-avoidance mechanism. The exact value depends on the type of system, the number of NUMA nodes and the huge page size, but hugeadm can calculate and set it with the --set-recommended-min_free_kbytes switch. If necessary, the effectiveness of this step can be measured by using the trace_mm_page_alloc_extfrag tracepoint and ftrace although how to do it is beyond the scope of this article.

While the static huge page pool is guaranteed to be available as it has already been allocated, tuning min_free_kbytes improves the success rate when dynamically growing the huge page pool beyond its minimum size. The static pool sets the lower bound but there is no guaranteed upper bound on the number of huge pages that are available. For example, an administrator might request a minimum pool of 1G and a maximum pool 8G but fragmentation may mean that the real upper bound is 4G.

If a guaranteed upper bound is required, a memory partition can be created using either the kernelcore= or movablecore= switch at boot time. These switches create a “Movable” zone that can be seen in /proc/zoneinfo or /proc/buddyinfo. Only pages that the kernel can migrate or reclaim exist in this zone. By default, huge pages are not allocated from this zone but it can be enabled by setting either vm.hugepages_treat_as_movable or using the hugeadm --enable-zone-movable switch.

6 Summary

In this chapter, four sets of system tunables were described. These relate to the allocation of huge pages, use of hugetlbfs filesystems, the use of shared memory, and simplifying the allocation of huge pages when dynamic pool sizing is in use. Once the administrator has made a choice, it should be implemented as part of a system initialization script. In the next chapter, it will be shown how some common benchmarks can be easily converted to use huge pages.


(Log in to post comments)

Huge pages part 3: Administration

Posted Mar 4, 2010 16:59 UTC (Thu) by nix (subscriber, #2304) [Link]

I wonder how much of this will be rendered less important by transparent hugepages, when they're accepted? Will hugetlbfs ever be considered a legacy thing for the sake of those rare apps that need it?

Huge pages part 3: Administration

Posted Mar 4, 2010 17:32 UTC (Thu) by mel (subscriber, #5484) [Link]

> I wonder how much of this will be rendered less important by transparent hugepages, when
> they're accepted? Will hugetlbfs ever be considered a legacy thing for the sake of those rare
> apps that need it?

Not much of it is rendered unimportant in the event transparent support
gets merged. Transparent support solves one particular problem well but it is far from covering
all the bases. There are some important limitations to transparent support including;

1. backing text/data with huge pages (important for scientific apps written in Fortran).
Whatever about backing data, backing text would require significant changes to how the
page cache lookup is implemented in Linux which is done lightly
2. shared memory (databases being the biggest consumer)
3. Using huge pages beyond what the page allocator provide - 1G on AMD or 16G on powerpc
4. Not all architectures can implement transparent huge page support. X86 32-bit because of
a page flag limitation (could be worked around), powerpc has limitations on where huge
pages can be placed (technically could be worked around but it would be of immense
difficulty), IA-64 would need major rearchitecting to change its pagetable format with
corresponding changes to the core before it could even consider transparent support.
Only X86-64 works right now but conceivably SPARC and sh support could be added
without too much hair loss

Call me biased, but the stuff covered in this articles will be important for a long time yet.

Huge pages part 3: Administration

Posted Mar 5, 2010 0:18 UTC (Fri) by nix (subscriber, #2304) [Link]

I see what you mean. I hadn't spotted that the current transparent
hugepage support was restricted to data pages, and item 4 is a killer.
(Well, it's a killer as long as Linux cares about 'minority'
architectures, which I hope it always will, although the really old ones
like sparc32 and voyager do eventually die off because there's nobody left
who cares about them).

Huge pages part 3: Administration

Posted Mar 11, 2010 19:28 UTC (Thu) by quartz (guest, #37351) [Link]

Is there any place where we can see the page sizes and TLB cache sizes for
different processors in tabular form?

I've been trying to look for it since a commenter on a previous article
mentioned that some x86-32 processors only had 4 pages in the TLB cache...

Huge pages part 3: Administration

Posted Mar 11, 2010 22:26 UTC (Thu) by PaXTeam (subscriber, #24616) [Link]

no longer up-to-date: http://sandpile.org/

Huge pages part 3: Administration

Posted Jun 21, 2011 13:20 UTC (Tue) by ingomueller.net (guest, #75817) [Link]

It took me some time to enalbe 1GB pages on my x86_64 machine. There are two essential steps I needed:

1) Make sure the processor supports 1GB pages. If the CPU supports 2MB pages, it has the PSE cpuinfo flag, for 1GB it has the PDPE1GB flag. /proc/cpuinfo shows whether the two flags are set.
% grep pdpe1gb /proc/cpuinfo | uniq
flags           : [...] pdpe1gb [...]
At this point, they are not yet activated:
% hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152        0        0        0        *
2) Enable the support at boot time. The following kernel boot parameters enable 1GB pages and create a pool of one 1GB page:
hugepagesz=1GB hugepages=1
Now, after booting with the above options, we have a 1GB page pool with one page:
% hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152        0        0        0        *
1073741824        1        1        1
See more for the kernel options here: http://www.fogproject.org/wiki/index.php?title=Kernel_Parameters

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds