LWN.net Logo

Huge pages part 2: Interfaces

February 24, 2010

This article was contributed by Mel Gorman

In an ideal world, the operating system would automatically use huge pages where appropriate, but there are a few problems. First, the operating system must decide when it is appropriate to promote base pages to huge pages requiring the maintenance of metadata which, itself, has an associated cost which may or may not be offset by the use of huge pages. Second, there can be architectural limitations that prevent a different page size being used within an address range once one page has been inserted. Finally, differences in TLB structure make predicting how many huge pages can be used and still be of benefit problematic.

For these reasons, with one notable exception, operating systems provide a more explicit interface for huge pages to user space. It is up to application developers and system administrators to decide how they best be used. This chapter will cover the interfaces that exist for Linux.

1 Shared Memory

One of the oldest interfaces backs shared memory segments created by shmget() with huge pages. Today, it is commonly used due to its simplicity and the length of time it has been supported. Huge pages are requested by specifying the SHM_HUGETLB flag and ensuring the size is huge-page-aligned. Examples of how to do this are included in the kernel source tree under Documentation/vm/hugetlbpage.txt.

A limitation of this interface is that only the default huge page size (as indicated by the Hugepagesize field in /proc/meminfo) will be used. If one wanted to use 16GB pages as supported on later versions of POWER for example, the default_hugepagesz= field must be used on the kernel command line as documented in Documentation/kernel-parameters.txt in the kernel source.

The maximum amount of memory that can be committed to shared-memory huge pages is controlled by the shmmax sysctl parameter. This parameter will be discussed further in the next installment.

2 HugeTLBFS

For the creation of shared or private mappings, Linux provides a RAM-based filesystem called "hugetlbfs." Every file on this filesystem is backed by huge pages and is accessed with mmap() or read(). If no options are specified at mount time, the default huge page size is used to back the files. To use a different page size, specify pagesize=.

    $ mount -t hugetlbfs none /mnt/hugetlbfs -o pagesize=64K

There are two ways to control the amount of memory which can be consumed by huge pages attached to a mount point. The size= mount parameter specifies (in bytes; the "K," "M," and "G" suffixes are understood) the maximum amount of memory which will be used by this mount. The size is rounded down to the nearest huge page size. It can also be specified as a percentage of the static huge page pool, though this option appears to be rarely used. The nr_inodes= parameter limits the number of files that can exist on the mount point which, in effect, limits the number of possible mappings. In combination, these options can be used to divvy up the available huge pages to groups or users in a shared system.

Hugetlbfs is a bare interface to the huge page capabilities of the underlying hardware; taking advantage of it requires application awareness or library support. Libhugetlbfs makes heavy use of this interface when automatically backing regions with huge pages.

For an application wishing to use the interface, the initial step is to discover the mount point by either reading /proc/mounts or using libhugetlbfs. Finding the mount point manually is relatively straightforward and already well known for debugfs but, for completeness, a very simple example program is shown below:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/param.h>

char *find_hugetlbfs(char *fsmount, int len)
{
	char format[256];
	char fstype[256];
	char *ret = NULL;
	FILE *fd;

	snprintf(format, 255, "%%*s %%%ds %%255s %%*s %%*d %%*d", len);

	fd = fopen("/proc/mounts", "r");
	if (!fd) {
		perror("fopen");
		return NULL;
	}

	while (fscanf(fd, format, fsmount, fstype) == 2) {
		if (!strcmp(fstype, "hugetlbfs")) {
			ret = fsmount;
			break;
		}
	}

	fclose(fd);
	return ret;
}

int main() {
	char buffer[PATH_MAX+1];
	printf("hugetlbfs mounted at %s\n", find_hugetlbfs(buffer, PATH_MAX));
	return 0;
}

When there are multiple mount points (to make different page sizes available), it gets more complicated; libhugetlbfs also provides a number of functions to help with these mount points. hugetlbfs_find_path() returns a mount point similar to the example program above, while hugetlbfs_find_path_for_size() will return a mount point for a specific huge page size. If the developer wishes to test a particular path to see if it hugetlbfs or not, use hugetlbfs_test_path().

3 Anonymous mmap()

As of kernel 2.6.32, support is available that allows anonymous mappings to be created backed by huge pages with mmap() by specifying the flags MAP_ANONYMOUS|MAP_HUGETLB. These mappings can be private or shared. It is somewhat of an oversight that the amount of memory that can be pinned for anonymous mmap() is limited only by huge page availability. This potential problem may be addressed in future kernel releases.

4 libhugetlbfs Allocation APIs

It is recognised that a number of applications want to simply get a buffer backed by huge pages. To facilitate this, libhugetlbfs provides two APIs since release 2.0, get_hugepage_region() and get_huge_pages() with corresponding free functions called free_hugepage_region() and free_huge_pages(). These are all provided with manual pages distributed with the libhugetlbfs package.

get_huge_pages() is intended for use with the development of custom allocators and not as a drop-in replacement for malloc(). It is required that the size parameter to this API be hugepage-aligned which can be discovered with the function gethugepagesize().

If an application wants to allocate a number of very large buffers but is not concerned with alignment or some wastage, it should use get_hugepage_region(). The calling convention to this function is much more relaxed and will optionally fallback to using small pages if necessary.

It is possible that applications need very tight control over how the mapping is placed in memory. If this is the case, libhugetlbfs provides hugetlbfs_unlinked_fd() and hugetlbfs_unlinked_fd_for_size() to create a file descriptor representing an unlinked file on a suitable hugetlbfs mount point. Using the file-descriptor, the application can mmap() with the appropriate parameters for accurate placement.

Converting existing applications and libraries to use the API where applicable should be straightforward, but basic examples of how to do it with the STREAM memory bandwidth benchmark suite are available [gorman09a].

5 Automatic Backing of Memory Regions

While applications can be modified to use any of the interfaces, it imposes a significant burden on the application developer. To make life easier, libhugetlbfs can back a number of memory region types automatically when it is either pre-linked or pre-loaded. This process is described in the HOWTO documentation and manual pages that come with libhugetlbfs.

Once loaded, libhugetlbfs's behaviour is determined by environment variables described in the libhugetlbfs.7 manual page. As manipulating environment variables is time-consuming and error-prone, a hugectl utility exists that does much of the configuring automatically and will output what steps it took if the --dry-run switch is specified.

To determine if huge pages are really being used, /proc can be examined, but libhugetlbfs will also warn if the verbosity is set sufficiently high and sufficient numbers of huge pages are not available. See below for an example of using a simple program that backs a 32MB segment with huge pages. Note how the first attempt to use huge pages failed and some configuration was required as no huge pages were previously configured on this system.

The manual pages are quite comprehensive so this section will only give a brief introduction as to how different sections of memory can be backed by huge pages without modification.

  $ ./hugetlbfs-shmget-test 
  shmid: 0x2130007
  shmaddr: 0xb5e37000
  Starting the writes: ................................
  Starting the Check...Done.

  $ hugectl --shm ./hugetlbfs-shmget-test
  libhugetlbfs: WARNING: While overriding shmget(33554432) to add
                         SHM_HUGETLB: Cannot allocate memory
  libhugetlbfs: WARNING: Using small pages for shmget despite
                         HUGETLB_SHM shmid: 0x2128007
  shmaddr: 0xb5d57000
  Starting the writes: ................................
  Starting the Check...Done.

  $ hugeadm --pool-pages-min 4M:32M
  $ hugectl --shm ./hugetlbfs-shmget-test 
  shmid: 0x2158007
  shmaddr: 0xb5c00000
  Starting the writes: ................................
  Starting the Check...Done.

5.1 Shared Memory

When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_SHM is set to yes, libhugetlbfs will override all calls to shmget(). Alternatively, launch the application with hugectl $--$shm. On setup, all shmget() requests will become aligned to a hugepage boundary and backed with huge pages if possible. If the system configuration does not allow huge pages to be used, the original request is honoured.

5.2 Heap

Glibc defines a __morecore hook that is is called when the heap size needs to be increased; libhugetlbfs uses this hook to create regions of memory backed by huge pages. Similar to shared memory, base pages are used when huge pages are not available.

When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_MORECORE set to yes, libhugetlbfs will configure the __morecore hook, causing malloc() requests will use huge pages. Alternatively, launch the application with hugectl --heap.

Unlike shared memory, the page size can also be specified if more than one page size is supported by the system. The first example below uses the default page size (e.g. 16M on Power5+) and the second example explicitly overrides a default, using 64K pages.

    $ hugectl --heap ./target-application
    $ hugectl --heap=64k ./target-application

If the application has already been linked with libhugetlbfs, it may be necessary to specify --no-preload when using --heap so that an attempt is not made to load the library twice.

By using the __morecore hook and setting the mallopt() option M_MMAP_MAX to zero, libhugetlbfs prevents glibc from making use of brk() to expand the heap. An application that calls brk() directly will be using base pages.

If a custom memory allocator is being used, it must support the __morecore hook to use huge pages. An alternative may be to provide a wrapper around malloc() that called the real underlying malloc() or get_hugepage_region() depending on the size of the buffer. A heavy solution would be to provide a fully-fledged implementation of malloc() with libhugetlbfs that uses huge pages where appropriate, but this is currently unavailable due to the lack of a demonstrable use case.

5.3 Text and Data

Backing text or data is more involved as the application should first be relinked to align the sections to a huge page boundary. This is accomplished by linking against libhugetlbfs and specifying -Wl,--hugetlbfs-align -- assuming the version of binutils installed is sufficiently recent. More information on relinking applications is described in the libhugetlbfs HOWTO. Once the application is relinked, as before control is with environment variables or with hugectl.

    $ hugectl --text --data --bss ./target-application

When backing text or data by text, the relevant sections are copied to files on the hugetlbfs filesystem and mapped with mmap(). The files are then unlinked so that the memory is freed on application exit. If the application is to be invoked multiple times, it is worth sharing that data by specifying the --share-text switch. The consequence is that the memory remains in use when the application exits and must be manually deleted.

If it is not possible to relink the application, it is possible to force the loading of segments backed by huge pages by setting the environment variable HUGETLB_FORCE_ELFMAP to yes. This is not the preferred option as the method is not guaranteed to work. Segments must be large enough to overlap with a huge page and on architectures with limitations on where segments can be placed, it can be particularly problematic.

5.4 Stack

Currently, the stack cannot be backed by huge pages. Support was implemented in the past but the vast majority of applications did not aggressively use the stack. In many distributions, there are ulimits on the size of the stack that are smaller than a huge page size. Upon investigation, only the bwaves test from the SPEC CPU 2006 benchmark benefited from stacks being backed by huge pages and only then when using a commercial compiler. When compiled with gcc, there was no benefit, hence support was dropped.

6 Summary

There are a small number of interfaces provided by Linux to access huge pages. While cumbersome to develop applications against, there is a programming API available with libhugetlbfs and it is possible to automatically back segments of memory with huge pages without application modification. In the next section, it will be discussed how the system should be tuned.


(Log in to post comments)

Huge pages part 2: Interfaces

Posted Feb 25, 2010 18:03 UTC (Thu) by madscientist (subscriber, #16861) [Link]

Mel, I just wanted to say "thanks" for these and please keep them coming: I'm greatly enjoying them. I have a particular interest in these issues with some of the work I'm doing so this is very timely.

Can't wait for next week!

Huge pages part 2: Interfaces

Posted Feb 26, 2010 3:48 UTC (Fri) by skissane (subscriber, #38675) [Link]

Why did the "commercial compiler" (whichever one it is) produce code that worked better with hugepage stacks but GCC didn't? Could GCC be enhanced to produce code that would take advantage of them as that other copmiler seems capable of doing?

Huge pages part 2: Interfaces

Posted Feb 26, 2010 18:16 UTC (Fri) by cesarb (subscriber, #6266) [Link]

It is possible that the "commercial compiler" is using the stack more than gcc, and thus is more affected by TLB slowdowns on the stack area.

Huge pages part 2: Interfaces

Posted Feb 26, 2010 6:21 UTC (Fri) by dkg (subscriber, #55359) [Link]

The article says:
If one wanted to use 16GB pages as supported on later versions of POWER for example
But surely it means 16MB ?

hugetlbpage.txt makes no mention of an arch with 16GB page sizes.

Huge pages part 2: Interfaces

Posted Feb 26, 2010 9:12 UTC (Fri) by hppnq (guest, #14462) [Link]

No, you can actually have 16GB pages on recent power systems. See here for instance.

Huge pages part 2: Interfaces

Posted Mar 5, 2010 5:54 UTC (Fri) by glennewton (guest, #64085) [Link]

I've collected a number of excellent huge page resources for mostly Linux, but also a few for Solaris, AIX, as well as its impact on architectures (AMD, Sparc) and software: Java, MySql, Oracle. http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds