User: Password:
|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.33 kernel is out, released on February 24. Linus says:

The most noticeable features in 2.6.33 are likely the Nouveau and DRBD integration (and a _lot_ more people will notice the Nouveau part of that). And the Radeon KMS parts aren't considered experimental any more. Oh, and the AS IO scheduler is gone, since keeping it around and just causing confusion seemed to not be worth it any more. You're supposed to use CFQ instead.

Other interesting stuff in 2.6.33 includes dynamic tracing, the block I/O bandwidth controller, and the compressed cache mechanism.

See the KernelNewbies 2.6.33 page for more information on this release.

The current stable kernel is 2.6.32.9, released on February 23. There are 93 fixes in this update, many of which are security-related. See below for our detailed look at this release.

Comments (4 posted)

Quotes of the week

Course this is all completely useless, but it would be if the locks were inline (which is actually an askable question now). There was just so much awesomeness going on with the 64-bit rwsem constructs I felt I had to add even more awesomeness to the plate. For some definition of awesomeness.
-- Zachary Amsden

So I'm going to stop being so predictable that people can tell that exactly two weeks after the last release is where the merge window closes, and if people want to make sure their stuff merged, I had better have a merge request in my inbox earlier than thirteen days after the release.
-- Linus Torvalds

Comments (none posted)

Open by handle

By Jonathan Corbet
February 23, 2010
Most Linux users never deal directly with file handles; indeed, most may not even know they exist. Of the rest, the bulk will have an experience limited to the cheery "stale file handle" message seen by NFS users at horribly inopportune times. In fact, a file handle is just a means by which a file can be uniquely identified within a filesystem. Handles are used in NFS, for example, to represent an open file in a way which allows the server to be almost entirely stateless. Handles are not normally used by, or even available to user-space applications.

Aneesh Kumar is trying to change that situation with a short patch series adding two new system calls:

    int name_to_handle(const char *name, struct file_handle *handle);
    int open_by_handle(struct file_handle *handle, int flags);

The first takes the given name and looks up the associated file handle, which is returned in the handle structure. That handle can then be passed to open_by_handle() to get an open file descriptor for the file. Only privileged users can call open_by_handle(); otherwise it could be possible for a malicious local user to bypass the normal permission checks on the directories in the path to a specific file.

Why would an application developer want to open a file in two steps instead of just calling open()? It comes down to the ability to write filesystem servers that run in user space. Such a server could use name_to_handle() to generate handles for files on the underlying filesystem; those handles are then passed to the filesystem's clients. At some future time, the client can pass the handle back to actually open the file. This type of feature is also already used with the XFS filesystem for backup and restore operations and with a hierarchical storage management system.

Discussion of these system calls has been minimal, thus far. It does seem that some work will be needed still to better describe what a file handle really is, and, in particular, what its expected lifetime will be. Without some clarity in that area, it will be hard to write applications which can make proper use of file handles.

Comments (6 posted)

Reserved network ports

By Jonathan Corbet
February 24, 2010
It is not all that uncommon to have a network application which needs to be able to bind to a specific port. Often, such requirements result from a firewall configuration allowing incoming connections only to a specific port, but there can be other reasons as well. When running such an application, it can be unpleasant to discover that somebody else's long-running ssh connection happened to stumble onto the required port. It would be nice to be able to avoid this kind of conflict if at all possible.

This patch set from Octavian Purdila aims to make it possible. It adds a new sysctl knob (called ip_local_reserved_ports) under /proc/sys/net/ipv4. Should the system administrator write a comma-separated list of ports (or ranges of ports denoted by a hyphen) to this parameter, the networking layer will avoid those ports whenever it picks a port number for a new socket. Reserving ports in this manner will not interfere with any application which binds to those ports explicitly.

This patch has been through a surprising number of revisions; chances seem good that it will show up in the mainline once the 2.6.34 merge window opens.

Comments (16 posted)

Kernel development news

A Checkpoint/restart update

By Jonathan Corbet
February 24, 2010
It has been exactly one year since LWN last checked up on the checkpoint/restart patch set. This code has just been reposted with a request for inclusion into the -mm tree, so it seems like an opportune time to restart our coverage of it. A lot of progress has been made on this front over the last year, but checkpoint/restart remains a difficult task which can probably never be implemented completely.

"Checkpointing" refers to the act of saving the state of a group of processes to a file, with the intent of restarting those processes at some future time. For many years, checkpointing has been used to save the state of long-running calculations to avoid losing work should the system fail. More recently, it has become a desired part of the virtualization toolkit, enabling the live migration of processes between physical hosts. The checkpoint/restart developers also see other potential advantages, such as the ability to quickly launch a set of processes on demand from a checkpoint image.

This patch set addresses checkpoint/restart in the containers context. In the context of full virtualization, checkpointing is relatively easy; the system just needs to save the entire memory image associated with the virtual machine and a bit of associated data. The "containers" model of virtualization tends to be messier in almost every way, and checkpointing is no exception. There is no memory image to be saved in one big chunk; instead, the kernel must track down every bit of state associated with the checkpointed processes and save it independently. When it works, it can be faster and more efficient than full virtual machine checkpointing; the checkpoint image will be much smaller. But getting it to work is a challenge. The complexity of this task can be seen in the current checkpoint/restart tree, which, despite being far from a complete solution of the problem, is a 27,000-line diff from 2.6.33-rc8.

Checkpointing

To checkpoint a group of processes, the following new system call is used:

    int checkpoint(pid_t pid, int fd, unsigned long flags, int logfd);

The pid parameter identifies the top-level process to be checkpointed; all children of that process will also be included in the checkpoint image, which will be written to the file indicated by fd. There is currently only one possible flag value, CHECKPOINT_SUBTREE, which turns off the normal requirement that an entire container be checkpointed as a whole. Checkpointing just a subtree is a bit riskier than checkpointing a full container because it is harder to ensure that all needed resources have been saved. The logfd parameter is file descriptor open for writing; the kernel will write relevant logging information there. There are vast numbers of possible ways for a checkpoint to fail; the log file is intended to help users figure out what is happening when a checkpoint refuses to work. If logging is not desired, logfd can be -1.

The set of processes to be checkpointed should be frozen prior to the call to checkpoint(). One exception to that rule is a process running in checkpoint() itself; this exception allows processes to checkpoint themselves.

Internally, the checkpointing process is implemented as a two-phase operation:

  • The kernel traverses the tree of processes and "collects" every object which is to be a part of the checkpoint image. Essentially, "collecting" means building a hash table with an entry for every process, every open file, every virtual memory area, every open socket, etc. which must be saved. Scanning the tree in this way helps the kernel to abort the checkpoint process early if something which cannot be checkpointed is encountered. Just as importantly, the collecting process also lets the system track objects which have multiple references and ensure that they are only written to the image file once.

  • The second pass then iterates over the collected objects and causes each to be serialized and written to the image file.

Once this is done, the checkpoint is finished. The just-checkpointed processes can either go on with their business or be killed, depending on the reason for the checkpoint.

These two phases are reflected in the changes made to the lower levels of the system. For example, the none-too-svelte file_operations structure gains two new operations:

    int collect(struct ckpt_ctx *ctx, struct file *filp);
    int checkpoint(struct ckpt_ctx *ctx, struct file *filp);

The collect() operation should identify every object which must be saved, eventually passing each to ckpt_obj_collect() (or one of several higher-level interfaces) for tracking. Later, a call to checkpoint() is made to request that the given filp be serialized for saving to the checkpoint image. Similar methods have been added to a number of other structure types, including vm_operations_struct and proto_ops.

The serialization process requires copying data from kernel data structures into a series of special structures intended to be written to the image file. So, for example, a file descriptor finds its way from struct fdtable into one of these:

    struct ckpt_hdr_file_desc {
	struct ckpt_hdr h;
	__s32 fd_objref;
	__s32 fd_descriptor;
	__u32 fd_close_on_exec;
    } __attribute__((aligned(8)));

Doing this copy requires a 75-line function which grabs the requisite information and very carefully checks that everything can be checkpointed successfully. In this case, the presence of locks on the file or an owner (to be notified with SIGIO) will cause the checkpoint to fail. In the absence of such roadblocks, the completed structure is handed to the checkpoint code for saving to the image file.

This serialization process is one of the scarier parts of the whole checkpoint/restart concept. Any changes to struct fdtable will almost certainly break this serialization, quite possibly in ways which will not be detected until some user runs into a problem. Even if a VFS developer cared about checkpointing, they might not think to look at the code in checkpoint/files.c to see if anything might require changing. Similar dependencies are created for every other kernel data structure which must be checkpointed. The whole setup looks like it could be a little fragile; keeping it working is almost certain to require significant ongoing maintenance.

Restarting

On the restart side, the application performing the restart is first expected to create a set of processes to be animated with the checkpointed information. That creation will be done with the much-reviewed "extended clone()" system call, which, in this iteration, looks like:

    int eclone(u32 flags_low, struct clone_args *cargs, int cargs_size,
	       pid_t *pids);

With eclone(), the processes can be created with specific pids and with an extended set of flags.

Once the process hierarchy exists, the restart() system call can be used:

    int restart(pid_t pid, int fd, unsigned long flags, int logfd);

The checkpoint image found at fd will be restored into the process hierarchy starting at pid. Once again, logfd can be used to gain information on how the process went. There are a number of flags defined: RESTART_TASKSELF (a single task is being restarted on top of the process calling restart()), RESTART_FROZEN (causes the restarted processes to be left frozen at the end), RESTART_GHOST (appears to be a debugging feature), RESTART_KEEP_LSM (restore security labels too), and RESTART_CONN_RESET (force the closing of open sockets). On a successful return from restart(), the process hierarchy should be ready to go.

Once again, restart requires support at the lower levels of the kernel. So our long-suffering file_operations structure gains another function:

    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *);

This function (along with its analogs elsewhere in the kernel) is charged with reanimating the given object from the checkpoint file.

Security

It is not hard to imagine that these new system calls could have any of a number of security-related consequences, so it is surprising to see that, in the current implementation, both checkpoint() and restart() are unprivileged operations. This decision was made deliberately, with the idea of forcing the developers to think about security issues from the outset.

The biggest potential problem with checkpoint() is probably information disclosure. To avoid this problem, checkpoint() is only able to checkpoint processes which the caller would be able to call ptrace() on. So there should be no way for a hostile user to gain information from a checkpoint image which would not be available anyway.

The restart side is a little more frightening; it allows the caller to load vast amounts of potentially arbitrary data into kernel data structures. This risk is, one hopes, mitigated by causing all operations to be done in the context of the calling process. If the caller cannot open a file directly, that file cannot be opened via a corrupted checkpoint image either. Doing things this way will break certain use cases, such as checkpointing a setuid program which has since dropped its privileges, but there is probably no way to make that case work securely for unprivileged users.

For an added challenge, the checkpoint/restart developers have also implemented the checkpointing of security labels on objects. By default, these labels will not be used during the restart process, but the RESTART_KEEP_LSM flag can change that. Again, the labels are created in the context of the calling process, so the active security module should prevent the attachment of labels which compromise the security of the system.

Even with these measures in place, one still has to wonder about the security of the process as a whole. The kernel is populating a wide array of data structures from input which may be under the control of a hostile user; it is not hard to imagine that, somewhere in tens of thousands of lines of checkpoint/restart code, an important check has not been made. Perhaps as a result of this concern, the patch set adds a sysctl knob which can be set to disallow unprivileged checkpoint/restart operations.

Where things stand

According to the most recent patch posting:

This one is able to checkpoint/restart screen and vnc sessions, and live-migrate network servers between hosts. It also adds support for x86-64 (in addition to x86-32, s390x and powerpc).

So the patch set appears to be sufficiently functional to be minimally useful. There are, however, a lot of things which can stil prevent the creation of a successful checkpoint; they are summarized on this page. Problem areas include private filesystem mounts, network sockets in some states, open-but-unlinked files, use of any of the file event notification interfaces, open files on network or FUSE filesystems, use of netlink, ptrace(), asynchronous I/O, and more. There are patches in the works for some of these problems; others look hard.

As of this writing, there has been no response to the developers' request for inclusion in the -mm kernel. In the past, there have been concerns about how much work would be required to finish the job. Over the last year, much of that work is done, but checkpoint/restart looks like a job which is never truly finished. It's mostly a matter of whether what has been done so far appears to be good enough for real work, and whether the maintenance cost of this code is deemed to be worth paying.

Comments (10 posted)

2.6.32.9 Release notes

By Jonathan Corbet
February 21, 2010
Stable kernel update announcements posted on LWN have a certain tendency to be followed by complaints about the amount of information which is made available. It seems that there is a desire for a description of the changes which is more accessible than the patches themselves, and for attention to be drawn to the security-relevant fixes. As an exercise in determining what kind of effort is being asked of the kernel maintainers, your editor decided to make a pass through the proposed 2.6.32.9 update and attempt to describe the impact of each of the changes - all 93 of them. The results can be found below.

Disclaimers: there is no way to review 93 patches in a finite time and fully understand each of them. So there are probably certainly errors in what follows. The simple truth of the matter is that it is very hard to say which fixes have security implications; a determined attacker can find a way to exploit some very obscure bugs.

Your editor would also like to discourage anybody from thinking that this will become a regular LWN feature. The amount of work required is considerable; it's not something we're able to commit to doing for every release.

That said, here's a look at what's in this update.

Security-related fixes

Other bug fixes

Enhancements

Conclusions

Out of 93 patches, 18 struck your editor as having clear security implications. Quite a few other patches fix crashes which could possibly be security problems; if they are not listed as such, it's because there was no immediately evident way that a user could trigger the problem. Doubtless people with more imagination will figure out ways to take advantage of some of these bugs.

What it comes down to is that the identification of security problems is often hard. In the kernel environment, almost any bug could potentially create some kind of vulnerability. So it is not surprising to see developers "silently fix" security bugs; they simply fix bugs without realizing the implications. It is also not surprising that some developers are reluctant to call attention to security-related fixes. The list above almost certainly includes "security fixes" for bugs that nobody can exploit while classifying true vulnerabilities as mere bug fixes. Any list of security-relevant patches is sure to be an incomplete and partially deceptive thing.

That said, it may well be that fixes which are truly known to have security implications should be marked as such. Attackers will make the effort to figure that out anyway; it's not clear that making life harder for everybody else has any benefits. Still, those who would complain about how the stable tree is managed would do well to remember that, a few years ago, we had no such tree. It came into being because people stepped up to do the work of maintaining it. There can be no doubt that a better job could be done here (as is the case almost everywhere else too); its just a matter of somebody finding the time and the energy to do it.

Comments (95 posted)

Huge pages part 2: Interfaces

February 24, 2010

This article was contributed by Mel Gorman

In an ideal world, the operating system would automatically use huge pages where appropriate, but there are a few problems. First, the operating system must decide when it is appropriate to promote base pages to huge pages requiring the maintenance of metadata which, itself, has an associated cost which may or may not be offset by the use of huge pages. Second, there can be architectural limitations that prevent a different page size being used within an address range once one page has been inserted. Finally, differences in TLB structure make predicting how many huge pages can be used and still be of benefit problematic.

For these reasons, with one notable exception, operating systems provide a more explicit interface for huge pages to user space. It is up to application developers and system administrators to decide how they best be used. This chapter will cover the interfaces that exist for Linux.

1 Shared Memory

One of the oldest interfaces backs shared memory segments created by shmget() with huge pages. Today, it is commonly used due to its simplicity and the length of time it has been supported. Huge pages are requested by specifying the SHM_HUGETLB flag and ensuring the size is huge-page-aligned. Examples of how to do this are included in the kernel source tree under Documentation/vm/hugetlbpage.txt.

A limitation of this interface is that only the default huge page size (as indicated by the Hugepagesize field in /proc/meminfo) will be used. If one wanted to use 16GB pages as supported on later versions of POWER for example, the default_hugepagesz= field must be used on the kernel command line as documented in Documentation/kernel-parameters.txt in the kernel source.

The maximum amount of memory that can be committed to shared-memory huge pages is controlled by the shmmax sysctl parameter. This parameter will be discussed further in the next installment.

2 HugeTLBFS

For the creation of shared or private mappings, Linux provides a RAM-based filesystem called "hugetlbfs." Every file on this filesystem is backed by huge pages and is accessed with mmap() or read(). If no options are specified at mount time, the default huge page size is used to back the files. To use a different page size, specify pagesize=.

    $ mount -t hugetlbfs none /mnt/hugetlbfs -o pagesize=64K

There are two ways to control the amount of memory which can be consumed by huge pages attached to a mount point. The size= mount parameter specifies (in bytes; the "K," "M," and "G" suffixes are understood) the maximum amount of memory which will be used by this mount. The size is rounded down to the nearest huge page size. It can also be specified as a percentage of the static huge page pool, though this option appears to be rarely used. The nr_inodes= parameter limits the number of files that can exist on the mount point which, in effect, limits the number of possible mappings. In combination, these options can be used to divvy up the available huge pages to groups or users in a shared system.

Hugetlbfs is a bare interface to the huge page capabilities of the underlying hardware; taking advantage of it requires application awareness or library support. Libhugetlbfs makes heavy use of this interface when automatically backing regions with huge pages.

For an application wishing to use the interface, the initial step is to discover the mount point by either reading /proc/mounts or using libhugetlbfs. Finding the mount point manually is relatively straightforward and already well known for debugfs but, for completeness, a very simple example program is shown below:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/param.h>

char *find_hugetlbfs(char *fsmount, int len)
{
	char format[256];
	char fstype[256];
	char *ret = NULL;
	FILE *fd;

	snprintf(format, 255, "%%*s %%%ds %%255s %%*s %%*d %%*d", len);

	fd = fopen("/proc/mounts", "r");
	if (!fd) {
		perror("fopen");
		return NULL;
	}

	while (fscanf(fd, format, fsmount, fstype) == 2) {
		if (!strcmp(fstype, "hugetlbfs")) {
			ret = fsmount;
			break;
		}
	}

	fclose(fd);
	return ret;
}

int main() {
	char buffer[PATH_MAX+1];
	printf("hugetlbfs mounted at %s\n", find_hugetlbfs(buffer, PATH_MAX));
	return 0;
}

When there are multiple mount points (to make different page sizes available), it gets more complicated; libhugetlbfs also provides a number of functions to help with these mount points. hugetlbfs_find_path() returns a mount point similar to the example program above, while hugetlbfs_find_path_for_size() will return a mount point for a specific huge page size. If the developer wishes to test a particular path to see if it hugetlbfs or not, use hugetlbfs_test_path().

3 Anonymous mmap()

As of kernel 2.6.32, support is available that allows anonymous mappings to be created backed by huge pages with mmap() by specifying the flags MAP_ANONYMOUS|MAP_HUGETLB. These mappings can be private or shared. It is somewhat of an oversight that the amount of memory that can be pinned for anonymous mmap() is limited only by huge page availability. This potential problem may be addressed in future kernel releases.

4 libhugetlbfs Allocation APIs

It is recognised that a number of applications want to simply get a buffer backed by huge pages. To facilitate this, libhugetlbfs provides two APIs since release 2.0, get_hugepage_region() and get_huge_pages() with corresponding free functions called free_hugepage_region() and free_huge_pages(). These are all provided with manual pages distributed with the libhugetlbfs package.

get_huge_pages() is intended for use with the development of custom allocators and not as a drop-in replacement for malloc(). It is required that the size parameter to this API be hugepage-aligned which can be discovered with the function gethugepagesize().

If an application wants to allocate a number of very large buffers but is not concerned with alignment or some wastage, it should use get_hugepage_region(). The calling convention to this function is much more relaxed and will optionally fallback to using small pages if necessary.

It is possible that applications need very tight control over how the mapping is placed in memory. If this is the case, libhugetlbfs provides hugetlbfs_unlinked_fd() and hugetlbfs_unlinked_fd_for_size() to create a file descriptor representing an unlinked file on a suitable hugetlbfs mount point. Using the file-descriptor, the application can mmap() with the appropriate parameters for accurate placement.

Converting existing applications and libraries to use the API where applicable should be straightforward, but basic examples of how to do it with the STREAM memory bandwidth benchmark suite are available [gorman09a].

5 Automatic Backing of Memory Regions

While applications can be modified to use any of the interfaces, it imposes a significant burden on the application developer. To make life easier, libhugetlbfs can back a number of memory region types automatically when it is either pre-linked or pre-loaded. This process is described in the HOWTO documentation and manual pages that come with libhugetlbfs.

Once loaded, libhugetlbfs's behaviour is determined by environment variables described in the libhugetlbfs.7 manual page. As manipulating environment variables is time-consuming and error-prone, a hugectl utility exists that does much of the configuring automatically and will output what steps it took if the --dry-run switch is specified.

To determine if huge pages are really being used, /proc can be examined, but libhugetlbfs will also warn if the verbosity is set sufficiently high and sufficient numbers of huge pages are not available. See below for an example of using a simple program that backs a 32MB segment with huge pages. Note how the first attempt to use huge pages failed and some configuration was required as no huge pages were previously configured on this system.

The manual pages are quite comprehensive so this section will only give a brief introduction as to how different sections of memory can be backed by huge pages without modification.

  $ ./hugetlbfs-shmget-test 
  shmid: 0x2130007
  shmaddr: 0xb5e37000
  Starting the writes: ................................
  Starting the Check...Done.

  $ hugectl --shm ./hugetlbfs-shmget-test
  libhugetlbfs: WARNING: While overriding shmget(33554432) to add
                         SHM_HUGETLB: Cannot allocate memory
  libhugetlbfs: WARNING: Using small pages for shmget despite
                         HUGETLB_SHM shmid: 0x2128007
  shmaddr: 0xb5d57000
  Starting the writes: ................................
  Starting the Check...Done.

  $ hugeadm --pool-pages-min 4M:32M
  $ hugectl --shm ./hugetlbfs-shmget-test 
  shmid: 0x2158007
  shmaddr: 0xb5c00000
  Starting the writes: ................................
  Starting the Check...Done.

5.1 Shared Memory

When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_SHM is set to yes, libhugetlbfs will override all calls to shmget(). Alternatively, launch the application with hugectl $--$shm. On setup, all shmget() requests will become aligned to a hugepage boundary and backed with huge pages if possible. If the system configuration does not allow huge pages to be used, the original request is honoured.

5.2 Heap

Glibc defines a __morecore hook that is is called when the heap size needs to be increased; libhugetlbfs uses this hook to create regions of memory backed by huge pages. Similar to shared memory, base pages are used when huge pages are not available.

When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_MORECORE set to yes, libhugetlbfs will configure the __morecore hook, causing malloc() requests will use huge pages. Alternatively, launch the application with hugectl --heap.

Unlike shared memory, the page size can also be specified if more than one page size is supported by the system. The first example below uses the default page size (e.g. 16M on Power5+) and the second example explicitly overrides a default, using 64K pages.

    $ hugectl --heap ./target-application
    $ hugectl --heap=64k ./target-application

If the application has already been linked with libhugetlbfs, it may be necessary to specify --no-preload when using --heap so that an attempt is not made to load the library twice.

By using the __morecore hook and setting the mallopt() option M_MMAP_MAX to zero, libhugetlbfs prevents glibc from making use of brk() to expand the heap. An application that calls brk() directly will be using base pages.

If a custom memory allocator is being used, it must support the __morecore hook to use huge pages. An alternative may be to provide a wrapper around malloc() that called the real underlying malloc() or get_hugepage_region() depending on the size of the buffer. A heavy solution would be to provide a fully-fledged implementation of malloc() with libhugetlbfs that uses huge pages where appropriate, but this is currently unavailable due to the lack of a demonstrable use case.

5.3 Text and Data

Backing text or data is more involved as the application should first be relinked to align the sections to a huge page boundary. This is accomplished by linking against libhugetlbfs and specifying -Wl,--hugetlbfs-align -- assuming the version of binutils installed is sufficiently recent. More information on relinking applications is described in the libhugetlbfs HOWTO. Once the application is relinked, as before control is with environment variables or with hugectl.

    $ hugectl --text --data --bss ./target-application

When backing text or data by text, the relevant sections are copied to files on the hugetlbfs filesystem and mapped with mmap(). The files are then unlinked so that the memory is freed on application exit. If the application is to be invoked multiple times, it is worth sharing that data by specifying the --share-text switch. The consequence is that the memory remains in use when the application exits and must be manually deleted.

If it is not possible to relink the application, it is possible to force the loading of segments backed by huge pages by setting the environment variable HUGETLB_FORCE_ELFMAP to yes. This is not the preferred option as the method is not guaranteed to work. Segments must be large enough to overlap with a huge page and on architectures with limitations on where segments can be placed, it can be particularly problematic.

5.4 Stack

Currently, the stack cannot be backed by huge pages. Support was implemented in the past but the vast majority of applications did not aggressively use the stack. In many distributions, there are ulimits on the size of the stack that are smaller than a huge page size. Upon investigation, only the bwaves test from the SPEC CPU 2006 benchmark benefited from stacks being backed by huge pages and only then when using a commercial compiler. When compiled with gcc, there was no benefit, hence support was dropped.

6 Summary

There are a small number of interfaces provided by Linux to access huge pages. While cumbersome to develop applications against, there is a programming API available with libhugetlbfs and it is possible to automatically back segments of memory with huge pages without application modification. In the next section, it will be discussed how the system should be tuned.

Comments (6 posted)

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds