Brief items
The 2.6.33 kernel is out,
released on February 24.
Linus says:
The most noticeable features in 2.6.33 are likely the Nouveau and
DRBD integration (and a _lot_ more people will notice the Nouveau
part of that). And the Radeon KMS parts aren't considered
experimental any more. Oh, and the AS IO scheduler is gone, since
keeping it around and just causing confusion seemed to not be worth
it any more. You're supposed to use CFQ instead.
Other interesting stuff
in 2.6.33 includes dynamic
tracing, the block I/O
bandwidth controller, and the compressed cache mechanism.
See the KernelNewbies 2.6.33
page for more information on this release.
The current stable kernel is 2.6.32.9, released on February 23.
There are 93 fixes in this update, many of which are security-related. See
below for our detailed look at this release.
Comments (4 posted)
Course this is all completely useless, but it would be if the locks
were inline (which is actually an askable question now). There was
just so much awesomeness going on with the 64-bit rwsem constructs
I felt I had to add even more awesomeness to the plate. For some
definition of awesomeness.
--
Zachary Amsden
So I'm going to stop being so predictable that people can tell that
exactly two weeks after the last release is where the merge window
closes, and if people want to make sure their stuff merged, I had
better have a merge request in my inbox earlier than thirteen days
after the release.
--
Linus Torvalds
Comments (none posted)
By Jonathan Corbet
February 23, 2010
Most Linux users never deal directly with file handles; indeed, most may
not even know they exist. Of the rest, the bulk will have an experience
limited to the cheery "stale file handle" message seen by NFS users at
horribly inopportune times. In fact, a file handle is just a means by
which a file can be uniquely identified within a filesystem. Handles are
used in NFS, for example, to represent an open file in a way which allows
the server to be almost entirely stateless. Handles are not normally used
by, or even available to user-space applications.
Aneesh Kumar is trying to change that situation with a short patch series adding two
new system calls:
int name_to_handle(const char *name, struct file_handle *handle);
int open_by_handle(struct file_handle *handle, int flags);
The first takes the given name and looks up the associated file
handle, which is returned in the handle structure. That handle
can then be passed to open_by_handle() to get an open file
descriptor for the file. Only privileged users can call
open_by_handle(); otherwise it could be possible for a malicious
local user to bypass the normal permission checks on the directories in the
path to a specific file.
Why would an application developer want to open a file in two steps instead
of just calling open()? It comes down to the ability to write
filesystem servers that run in user space. Such a server could use
name_to_handle() to generate handles for files on the underlying
filesystem; those handles are then passed to the filesystem's clients. At
some future time, the client can pass the handle back to actually open the
file. This type of feature is also already
used with the XFS filesystem
for backup and restore operations and with a hierarchical storage
management system.
Discussion of these system calls has been minimal, thus far. It does seem
that some work will be needed still to better describe what a file handle
really is, and, in particular, what its expected lifetime will be. Without
some clarity in that area, it will be hard to write applications which can
make proper use of file handles.
Comments (6 posted)
By Jonathan Corbet
February 24, 2010
It is not all that uncommon to have a network application which needs to be
able to bind to a specific port. Often, such requirements result from a
firewall configuration allowing incoming connections only to a specific
port, but there can be other reasons as well. When running such an
application, it can be unpleasant to discover that somebody else's
long-running ssh connection happened to stumble onto the required port. It
would be nice to be able to avoid this kind of conflict if at all possible.
This patch set from Octavian
Purdila aims to make it possible. It adds a new sysctl knob (called
ip_local_reserved_ports) under /proc/sys/net/ipv4.
Should the system administrator write a comma-separated list of ports (or
ranges of ports denoted by a hyphen) to this parameter, the networking
layer will avoid those
ports whenever it picks a port number for a new socket. Reserving ports in
this manner will not interfere with any application which binds to those
ports explicitly.
This patch has been through a surprising number of revisions; chances seem
good that it will show up in the mainline once the 2.6.34 merge window
opens.
Comments (16 posted)
Kernel development news
By Jonathan Corbet
February 24, 2010
It has been exactly one year since LWN last
checked up on the checkpoint/restart
patch set. This code has just been
reposted with a request for
inclusion into the -mm tree, so it seems like an opportune time to restart
our coverage of it. A lot of progress has been made on this front over the
last year, but checkpoint/restart remains a difficult task which can
probably never be implemented completely.
"Checkpointing" refers to the act of saving the state of a group of
processes to a file, with the intent of restarting those processes at some
future time. For many years, checkpointing has been used to save the state
of long-running calculations to avoid losing work should the system fail.
More recently, it has become a desired part of the virtualization toolkit,
enabling the live migration of processes between physical hosts. The
checkpoint/restart developers also see other potential advantages, such as
the ability to quickly launch a set of processes on demand from a
checkpoint image.
This patch set addresses checkpoint/restart in the containers context.
In the context of full virtualization, checkpointing is relatively easy;
the system just needs to save the entire memory image associated with the
virtual machine and a bit of associated data. The "containers" model of
virtualization tends to be messier in almost every way, and checkpointing
is no exception. There is no memory image to be saved in one big chunk;
instead, the kernel must track down every bit of state associated with the
checkpointed processes and save it independently. When it works, it can be
faster and more efficient than full virtual machine checkpointing; the
checkpoint image will be much smaller. But getting it to work is a
challenge. The complexity of this task can be seen in the
current checkpoint/restart tree, which, despite being far from a complete
solution of the problem, is a 27,000-line diff from
2.6.33-rc8.
Checkpointing
To checkpoint a group of processes, the following new system call is used:
int checkpoint(pid_t pid, int fd, unsigned long flags, int logfd);
The pid parameter identifies the top-level process to be
checkpointed; all children of that process will also be included in the
checkpoint image, which will be written to the file indicated by
fd. There is currently only one possible flag value,
CHECKPOINT_SUBTREE, which turns off the normal requirement that an
entire container be checkpointed as a whole. Checkpointing just a subtree
is a bit riskier than checkpointing a full container because it is harder to
ensure that all needed resources have been saved. The logfd
parameter is file descriptor open for writing;
the kernel will write relevant logging information there. There are vast
numbers of possible ways for a checkpoint to fail; the log file is intended
to help users figure out what is happening when a checkpoint refuses to
work. If logging is not desired, logfd can be -1.
The set of processes to be checkpointed should be frozen prior to the call
to checkpoint(). One exception to that rule is a process running
in checkpoint() itself; this exception allows processes to
checkpoint themselves.
Internally, the checkpointing process is implemented as a two-phase
operation:
- The kernel traverses the tree of processes and "collects" every
object which is to be a part of the checkpoint image. Essentially,
"collecting" means building a hash table with an entry for every
process, every open file, every virtual memory area, every open
socket, etc. which must be saved. Scanning the tree in this way helps
the kernel to abort the checkpoint process early if something which
cannot be checkpointed is encountered. Just as importantly, the collecting process
also lets the system track objects which have multiple references
and ensure that they are only written to the image file once.
- The second pass then iterates over the collected objects and causes
each to be serialized and written to the image file.
Once this is done, the checkpoint is finished. The just-checkpointed
processes can either go on with their business or be killed, depending on
the reason for the checkpoint.
These two phases are reflected in the changes made to the lower levels of the
system. For example, the none-too-svelte file_operations
structure gains two new operations:
int collect(struct ckpt_ctx *ctx, struct file *filp);
int checkpoint(struct ckpt_ctx *ctx, struct file *filp);
The collect() operation should identify every object which must
be saved, eventually passing each to ckpt_obj_collect() (or
one of several higher-level interfaces) for tracking. Later, a call to
checkpoint() is made to request that the given filp be serialized for
saving to the checkpoint image. Similar methods have been added to a
number of other structure types, including vm_operations_struct and
proto_ops.
The serialization process requires copying data from kernel data structures
into a series of special structures intended to be written to the image
file. So, for example, a file descriptor finds its way from
struct fdtable into one of these:
struct ckpt_hdr_file_desc {
struct ckpt_hdr h;
__s32 fd_objref;
__s32 fd_descriptor;
__u32 fd_close_on_exec;
} __attribute__((aligned(8)));
Doing this copy requires a 75-line function which grabs the requisite
information and very carefully checks that everything can be checkpointed
successfully. In this case, the presence of locks on the file or an owner
(to be notified with SIGIO) will cause the checkpoint to fail. In
the absence of such roadblocks, the completed structure is handed to the
checkpoint code for saving to the image file.
This serialization process is one of the scarier parts of the whole
checkpoint/restart concept. Any changes to struct fdtable will
almost certainly break this serialization, quite possibly in ways which
will not be detected until some user runs into a problem. Even if a VFS
developer cared about checkpointing, they might not think to look
at the code in checkpoint/files.c to see if anything might require
changing. Similar dependencies are created for every other kernel data
structure which must be checkpointed.
The whole setup looks like it could be a little fragile; keeping
it working is almost certain to require significant ongoing maintenance.
Restarting
On the restart side, the application performing the restart is first expected to create a set
of processes to be animated with the checkpointed information. That
creation will be done with the much-reviewed "extended clone()"
system call, which, in this iteration, looks like:
int eclone(u32 flags_low, struct clone_args *cargs, int cargs_size,
pid_t *pids);
With eclone(), the processes can be created with specific
pids and with an extended set of flags.
Once the process hierarchy exists, the restart() system call can
be used:
int restart(pid_t pid, int fd, unsigned long flags, int logfd);
The checkpoint image found at fd will be restored into the process
hierarchy starting at pid. Once again, logfd can be used
to gain information on how the process went. There are a number of
flags defined: RESTART_TASKSELF (a single task is being
restarted on top of the process calling restart()),
RESTART_FROZEN (causes the restarted processes to be left frozen
at the end), RESTART_GHOST (appears to be a debugging feature),
RESTART_KEEP_LSM (restore security labels too), and
RESTART_CONN_RESET (force the closing of open sockets). On a
successful return from restart(), the process hierarchy should be
ready to go.
Once again, restart requires support at the lower levels of the kernel. So
our long-suffering file_operations structure gains another
function:
int restore(struct ckpt_ctx *, struct ckpt_hdr_file *);
This function (along with its analogs elsewhere in the kernel) is charged
with reanimating the given object from the checkpoint file.
Security
It is not hard to imagine that these new system calls could have any of a
number of security-related consequences, so it is surprising to see that,
in the current implementation, both checkpoint() and
restart() are unprivileged operations. This decision was made
deliberately, with the idea of forcing the developers to think about
security issues from the outset.
The biggest potential problem with checkpoint() is probably
information disclosure. To avoid this problem, checkpoint() is
only able to checkpoint processes which the caller would be able to call
ptrace() on. So there should be no way for a hostile user to gain
information from a checkpoint image which would not be available anyway.
The restart side is a little more frightening; it allows the caller to load
vast amounts of potentially arbitrary data into kernel data structures.
This risk is, one hopes, mitigated by causing all operations to be done in
the context of the calling process. If the caller cannot open a file
directly, that file cannot be opened via a corrupted checkpoint image
either. Doing things this way will break certain use cases, such as
checkpointing a setuid program which has since dropped its privileges, but
there is probably no way to make that case work securely for unprivileged
users.
For an added challenge, the checkpoint/restart developers have also
implemented the checkpointing of security labels on objects. By default,
these labels will not be used during the restart process, but the
RESTART_KEEP_LSM flag can change that. Again, the labels are
created in the context of the calling process, so the active security
module should prevent the attachment of labels which compromise the
security of the system.
Even with these measures in place, one still has to wonder about the security of
the process as a whole. The kernel is populating a wide array of data
structures from input which may be under the control of a hostile user; it
is not hard to imagine that, somewhere in tens of thousands of lines of
checkpoint/restart code, an important check has not been made. Perhaps as
a result of this concern, the patch set adds a sysctl knob which can be set
to disallow unprivileged checkpoint/restart operations.
Where things stand
According to the most recent patch posting:
This one is able to checkpoint/restart screen and vnc sessions, and
live-migrate network servers between hosts. It also adds support
for x86-64 (in addition to x86-32, s390x and powerpc).
So the patch set appears to be sufficiently functional to be minimally
useful. There are, however, a lot of things which can stil prevent the
creation of a successful checkpoint; they are summarized on this page.
Problem areas include private filesystem mounts, network sockets in some
states, open-but-unlinked files, use of any of the file event notification
interfaces, open files on network or FUSE filesystems, use of netlink,
ptrace(), asynchronous I/O, and more. There are patches in the
works for some of these problems; others look hard.
As of this writing, there has been no response to the developers' request
for inclusion in the -mm kernel. In the past, there have been concerns
about how much work would be required to finish the job. Over the last
year, much of that work is done, but checkpoint/restart looks like a job
which is never truly finished. It's mostly a matter of whether what has
been done so far appears to be good enough for real work, and whether the
maintenance cost of this code is deemed to be worth paying.
Comments (10 posted)
By Jonathan Corbet
February 21, 2010
Stable kernel update announcements posted on LWN have a certain tendency to be
followed by complaints about the amount of information which is made
available. It seems that there is a desire for a description of the
changes which is more accessible than the patches themselves, and for
attention to be drawn to the security-relevant fixes.
As an exercise in determining what kind of effort is being asked
of the kernel maintainers, your editor decided to make a pass
through the
proposed 2.6.32.9 update and
attempt to describe the impact of
each of the changes - all 93 of them. The results can be found below.
Disclaimers: there is no way to review 93 patches in a finite time and
fully understand each of them. So there are probably
certainly errors in what follows. The simple truth of the matter is that
it is very hard to say which fixes have security implications; a determined
attacker can find a way to exploit some very obscure bugs.
Your editor would also like to discourage anybody from thinking
that this will become a regular LWN feature. The amount of work required
is considerable; it's not something we're able to commit to doing for every
release.
That said, here's a look at what's in this update.
Security-related fixes
Other bug fixes
- #1: Fix potential crash with
sys_move_pages. Fix an unreliable test which could cause a crash
in the page migration code. [Update: as has been pointed out
in the comments, this one is exploitable
and should have been in the
security list above.]
- #6: hwmon: (w83781d) Request I/O ports
individually for probing. More robust access to hardware
monitoring ports.
- #7: hwmon: (lm78) Request I/O ports
individually for probing. More robust access to hardware
monitoring ports.
- #8: hwmon: (adt7462) Wrong
ADT7462_VOLT_COUNT. Fixes a bug which could cause one voltage
measurement to be passed over.
- #9: ALSA: ctxfi - fix PTP address
initialization. Fixes an alignment bug in the ctxfi sound driver.
- #10: drm/i915: disable hotplug detect
before Ironlake CRT detect. Fixes a possible hang in the monitor
detection code.
- #12: drm/i915: Disable SR when more than
one pipe is enabled. Fixes a flicker-causing i915 bug.
- #13: drm/i915: Fix DDC on some systems by
clearing BIOS GMBUS setup. Fixes a bug which can cause failure to
detect some monitors.
- #15: drm/i915: Fix the incorrect DMI
string for Samsung SX20S laptop. Incorrect identification
information was returned to user space.
- #17: usb: r8a66597-hcd: Flush the D-cache
for the pipe-in transfer buffers. Fixes a cache consistency
problem.
- #18: i2c-tiny-usb: Fix on big-endian
systems. An endianness bug in i2c-tiny-usb caused incorrect
information to be returned to user space.
- #19: drm/i915: handle FBC and self-refresh
better. Eliminates an i915 flicker problem.
- #20: drm/i915: Increase fb alignment to
64k. Fixes an obscure error in the i915 driver.
- #24: CPUFREQ: Fix use after free of struct
powernow_k8_data. Fixes a use-after-free bug in the cpufreq code;
does not appear to be user-triggerable.
- #25: freeze_bdev: dont deactivate
successfully frozen MS_RDONLY sb. Fixes a boot-time crash in the block
layer.
- #27: ioat: fix infinite timeout checking
in ioat2_quiesce. Fixes a typo in the IOAT code.
- #29: fs/exec.c: restrict initial stack
space expansion to rlimit. Fixes a bug which could cause process
creation failures in the presence of tight stack limits.a
- #30: cifs: fix length calculation for
converted unicode readdir names. Fixes a CIFS data consistency
bug.
- #31: NFS: Fix a reference leak in
nfs_wb_cancel_page(). Fixes a reference leak in the NFS
cancellation code.
- #32: NFS: Try to commit unstable writes in
nfs_release_page(). Looks like a fix for a potential data loss
problem in the NFS code.
- #33: NFSv4: Dont allow posix locking
against servers that dont support it. Be sure to notice if a
server does not support POSIX locking.
- #34: NFSv4: Ensure that the NFSv4 locking
can recover from stateid errors. Fix an NFSv4 locking problem
with unknown effects.
- #37: NFS: Fix a bug in
nfs_fscache_release_page(). Removes a spurious BUG_ON()
call.
- #38: NFS: Fix the mapping of the
NFSERR_SERVERFAULT error. Fix an incorrect error value returned
to user space.
- #39: md: fix degraded calculation when
starting a reshape. Some old code can cause the MD subsystem to
be unclear on whether a given array is running in a degraded mode or
not after a reshape.
- #42: kvmclock: count total_sleep_time when
updating guest clock. Fix an error which could lead to incorrect
wall clock time in KVM guests.
- #43: KVM: PIT: control word is
write-only. Prevent attempts to read a write-only register.
- #44: tpm_infineon: fix suspend/resume
handler for pnp_driver. Fixes a hang-on-suspend bug.
- #45: amd64_edac: Do not falsely trigger
kerneloops. Remove a spurious warning in the amd64 EDAC code.
- #46: netfilter: nf_conntrack: fix memory
corruption with multiple namespaces. Fixes a potential race
condition which could lead to memory corruption. Requires the
instantiation of a new namespace (and, thus, root privilege) to
trigger.
- #48: netfilter: nf_conntrack: restrict
runtime expect hashsize modifications. Don't allow the connection
tracking expect_hashsize attribute to be modified, since the
code isn't prepared to handle that.
- #49: netfilter: xtables: compat out of
scope fix. Fixes a potential stack-based dangling pointer bug.
- #51: drm/i915: remove full registers dump
debug. Removes an i915 debug option which could hang the machine.
- #52: drm/i915: add i915_lp_ring_sync
helper. Code and performance improvement in the i915 driver.
- #53: drm/i915: Dont wait interruptible for
possible plane buffer flush. The i915 DRM driver can corrupt the
hardware state if a signal comes in at the wrong time. Could be seen
as a denial of service problem, but that's a big stretch.
- #56: wmi: Free the allocated acpi objects
through wmi_get_event_data. Fixes a memory leak in the WMI code.
- #58: /dev/mem: introduce
size_inside_page(). Eliminates some duplicate code and fixes the
alignment logic for /dev/kmem, which was described simply as
"buggy." But who uses /dev/kmem anymore?
- #59: devmem: check vmalloc address on kmem
read/write. A missing test for addresses in the
vmalloc() space could cause an oops from the
/dev/kmem code. Probably not triggerable by ordinary users,
though, even on systems where /dev/kmem is enabled.
- #60: devmem: fix kmem write bug on memory
holes. An attempt to write data to /dev/mem would get
confused if a memory hole is hit, causing incorrect data to be written
after the hole.
- #61: SCSI: mptfusion : mptscsih_abort
return value should be SUCCESS instead of value 0. The mptfusion
driver had an incorrect return value with unknown effects.
- #62: sh: Couple kernel and user write
page perm bits for CONFIG_X2TLB. The SuperH architecture had a
problem handling write faults for pages in the vmalloc()
space, which could cause problems with drivers that map such pages
into user space.
- #63: ALSA: hda - use WARN_ON_ONCE() for
zero-division detection. Avoid spamming the log files if the
hardware goes nuts.
- #64: dst: call cond_resched() in
dst_gc_task(). The network destination cache code can process
very long lists, leading to soft lockup warnings.
- #66: befs: fix leak. There is a
memory leak in the BeFS mount code; one would not normally expect it
to be user-triggerable.
- #67: rtc-fm3130: add missing braces.
Missing braces in the rtc-fm3130 would cause spurious warnings to be
emitted.
- #68: [libata] Call flush_dcache_page after
PIO data transfers in libata-sff.c. Fix a cache coherency bug in
the ATA code.
- #70: pktgen: Fix freezing problem.
The packet generator could prevent the system from suspending or
hibernating.
- #71: x86/amd-iommu: Fix IOMMU-API
initialization for iommu=pt. Fix a boot-time initialization error
in the IOMMU code.
- #72: x86/amd-iommu: Fix deassignment of a
device from the pt_domain. Fix a KVM device assignment failure.
- #73: x86: Re-get cfg_new in case
reuse/move irq_desc. Fix a bug in interrupt migration with
unknown effect.
- #74: Staging: fix rtl8187se compilation
errors with mac80211. Boring compilation problem fix.
- #76: serial: 8250: add serial transmitter
fully empty test. Fixes a serial driver problem which could cause
the loss of some transmitted data.
- #77: sysfs: sysfs_sd_setattr set iattrs
unconditionally. An omitted initialization can cause sysfs
attributes to have more restrictive permissions than desired.
- #78: class: Free the class private data in
class_release. Fix a memory leak in the sysfs class code.
Potentially user-exploitable if somebody were willing to dedicate a
month of their life to repeatedly plugging and unplugging a device.
- #80: USB: usbfs: properly clean up the as
structure on error paths. Fixes a memory leak in the usbfs error
recovery paths.
- #83: ACPI: fix High cpu temperature with
2.6.32. Fixes behavior on a couple of laptops with problematic
power management operation.
- #84: drm/radeon/kms: use udelay for short
delays. Use of schedule_timeout() for short delays was
slowing bootstrap considerably on some systems.
- #85: NFS: Too many GETATTR and ACCESS
calls after direct I/O. Fixes a performance regression in the NFS
code.
- #86: eCryptfs: Add getattr function.
The eCryptfs filesystem would show incorrect file sizes.
- #87: b43: Fix throughput regression.
Throughput on some BCM4311 devices is said to have dropped from 18Mb/s
to 0.7Mb/s, which is a bit more of a penalty than some users wanted to
pay.
- #88: ath9k: Fix sequence numbers for PAE
frames. Fixes a protocol error in the ath9k driver.
- #89: mac80211: Fix probe request filtering
in IBSS mode. The wireless code could reply to probe requests
directed at a different SSID.
- #90: iwlwifi: Fix to set correct ht
configuration. The iwlwifi driver was not configuring
associations correctly, leading to dropped connections.
- #91: dm stripe: avoid divide by zero with
invalid stripe count. Giving a bad stripe size to the device
mapper code would cause a division by zero.
- #93: dm mpath: fix stall when requeueing
io. Fixes a root-triggerable stall in the device mapper multipath
code.
Enhancements
Conclusions
Out of 93 patches, 18 struck your editor as having clear security
implications. Quite a few other patches fix crashes which could possibly
be security problems; if they are not listed as such, it's because there
was no immediately evident way that a user could trigger the problem.
Doubtless people with more imagination will figure out ways to take
advantage of some of these bugs.
What it comes down to is that the identification of security problems is
often hard. In the kernel environment, almost any bug could potentially
create some kind of vulnerability. So it is not surprising to see developers
"silently fix" security bugs; they simply fix bugs without realizing the
implications. It is also not surprising that some developers are reluctant
to call attention to security-related fixes. The list above almost
certainly includes "security fixes" for bugs that nobody can exploit while
classifying true vulnerabilities as mere bug fixes. Any list of
security-relevant patches is sure to be an incomplete and partially
deceptive thing.
That said, it may well be that fixes which are truly known to have security
implications should be marked as such. Attackers will make the effort to
figure that out anyway; it's not clear that making life harder for
everybody else has any benefits. Still, those who would complain about how
the stable tree is managed would do well to remember that, a few years ago,
we had no such tree. It came into being because people stepped up to do
the work of maintaining it. There can be no doubt that a better job could
be done here (as is the case almost everywhere else too); its just a matter
of somebody finding the time and the energy to do it.
Comments (95 posted)
February 24, 2010
This article was contributed by Mel Gorman
In an ideal world, the operating system would automatically use huge pages
where appropriate, but there are a few problems. First, the operating system
must decide when it is appropriate to promote base pages to huge pages
requiring the maintenance of metadata which, itself, has an associated cost
which may or may not be offset by the use of huge pages. Second, there
can be architectural limitations that prevent a different page size being
used within an address range once one page has been inserted. Finally,
differences in TLB structure make predicting how many huge pages can be
used and still be of benefit problematic.
For these reasons, with one notable exception, operating systems provide a
more explicit interface for huge pages to user space. It is up to application
developers and system administrators to decide how they best be used. This
chapter will cover the interfaces that exist for Linux.
1 Shared Memory
One of the oldest interfaces backs shared memory segments created by
shmget() with huge pages. Today, it is commonly used due to its
simplicity and the length of time it has been supported. Huge pages are
requested by specifying the SHM_HUGETLB flag and ensuring the
size is huge-page-aligned. Examples of how to do this are included
in the kernel source tree under Documentation/vm/hugetlbpage.txt.
A limitation of this interface is that only the default huge page size
(as indicated by the Hugepagesize field in
/proc/meminfo) will be used. If one wanted to use 16GB pages as supported on
later versions of POWER for example, the default_hugepagesz=
field must be used on the kernel command line as documented in
Documentation/kernel-parameters.txt in the kernel source.
The maximum amount of memory that can be committed to shared-memory huge
pages is controlled
by the shmmax sysctl parameter. This parameter will be discussed
further in the next installment.
2 HugeTLBFS
For the creation of shared or private mappings, Linux provides a RAM-based
filesystem called "hugetlbfs." Every file on this filesystem is
backed by huge pages and is accessed with mmap() or read().
If no options are specified at mount time, the default huge page size
is used to back the files. To use a different page size, specify
pagesize=.
$ mount -t hugetlbfs none /mnt/hugetlbfs -o pagesize=64K
There are two ways to control the amount of memory which can be consumed by
huge pages attached to a mount point. The size= mount parameter
specifies (in bytes; the "K," "M," and
"G" suffixes are understood) the maximum amount of memory which will be used
by this mount. The size is rounded down to the nearest huge page size. It
can also be specified as a percentage of the static huge page pool, though
this option appears to be rarely used. The nr_inodes= parameter
limits the
number of files that can exist on the mount point which, in effect, limits the
number of possible mappings. In combination, these options can be used to
divvy up the available huge pages to groups or users in a shared system.
Hugetlbfs is a bare interface to the huge page capabilities of the underlying
hardware; taking advantage of it requires application awareness or library
support. Libhugetlbfs makes heavy use of this
interface when automatically backing regions with huge pages.
For an application wishing to use the interface, the initial step is
to discover the mount point by either reading /proc/mounts
or using libhugetlbfs. Finding the mount point manually is
relatively straightforward and already well known for debugfs
but, for completeness, a very simple example program is shown below:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/param.h>
char *find_hugetlbfs(char *fsmount, int len)
{
char format[256];
char fstype[256];
char *ret = NULL;
FILE *fd;
snprintf(format, 255, "%%*s %%%ds %%255s %%*s %%*d %%*d", len);
fd = fopen("/proc/mounts", "r");
if (!fd) {
perror("fopen");
return NULL;
}
while (fscanf(fd, format, fsmount, fstype) == 2) {
if (!strcmp(fstype, "hugetlbfs")) {
ret = fsmount;
break;
}
}
fclose(fd);
return ret;
}
int main() {
char buffer[PATH_MAX+1];
printf("hugetlbfs mounted at %s\n", find_hugetlbfs(buffer, PATH_MAX));
return 0;
}
When there are multiple mount points (to make different page sizes
available), it gets more complicated; libhugetlbfs also provides a number
of functions to help with these mount
points. hugetlbfs_find_path() returns a mount point similar
to the example program above, while hugetlbfs_find_path_for_size()
will return a mount point for a specific huge page size. If the developer
wishes to test a particular path to see if it hugetlbfs or not,
use hugetlbfs_test_path().
3 Anonymous mmap()
As of kernel 2.6.32, support is available that allows anonymous
mappings to be created backed by huge pages with mmap() by specifying
the flags MAP_ANONYMOUS|MAP_HUGETLB. These mappings
can be private or shared.
It is somewhat of an oversight that the amount of memory that can be pinned
for anonymous mmap() is limited only by huge page availability.
This potential problem may be addressed in future kernel releases.
4 libhugetlbfs Allocation APIs
It is recognised that a number of applications want to simply get a buffer
backed by huge pages. To facilitate this, libhugetlbfs
provides two APIs since release 2.0, get_hugepage_region()
and get_huge_pages() with corresponding free functions called
free_hugepage_region() and free_huge_pages(). These are
all provided with manual pages distributed with the libhugetlbfs
package.
get_huge_pages() is intended for use with the development of
custom allocators and not as a drop-in replacement for malloc().
It is required that the size parameter to this API be hugepage-aligned
which can be discovered with the function gethugepagesize().
If an application wants to allocate a number of very large buffers
but is not concerned with alignment or some wastage, it should use
get_hugepage_region(). The calling convention to this function
is much more relaxed and will optionally fallback to using small pages
if necessary.
It is possible that applications need very tight control
over how the mapping is placed in memory. If this is the case,
libhugetlbfs provides hugetlbfs_unlinked_fd() and
hugetlbfs_unlinked_fd_for_size() to create a file descriptor
representing an unlinked file on a suitable hugetlbfs mount
point. Using the file-descriptor, the application can mmap()
with the appropriate parameters for accurate placement.
Converting existing applications and libraries to use the API where applicable
should be straightforward, but basic examples of how to do it with
the STREAM memory
bandwidth benchmark suite are available [gorman09a].
5 Automatic Backing of Memory Regions
While applications can be modified to use any of the interfaces, it imposes a
significant burden on the application developer. To make life easier, libhugetlbfs can
back a number of memory region types automatically when it is either pre-linked or
pre-loaded. This process is described in the HOWTO documentation
and manual pages that come with libhugetlbfs.
Once loaded, libhugetlbfs's behaviour is determined by
environment variables described in the libhugetlbfs.7
manual page. As manipulating environment variables is time-consuming
and error-prone, a hugectl utility exists that does much of
the configuring automatically and will output what steps it took if the
--dry-run switch is specified.
To determine if huge pages are really being used, /proc can be
examined, but libhugetlbfs will also warn if the verbosity is
set sufficiently high and sufficient numbers of huge pages are not
available. See below for an example of using a simple
program that backs a 32MB segment with huge pages. Note how the first
attempt to use huge pages failed and some configuration was required as no
huge pages were previously configured on this system.
The manual pages are quite comprehensive so this section will only give a
brief introduction as to how different sections of memory can be backed by
huge pages without modification.
$ ./hugetlbfs-shmget-test
shmid: 0x2130007
shmaddr: 0xb5e37000
Starting the writes: ................................
Starting the Check...Done.
$ hugectl --shm ./hugetlbfs-shmget-test
libhugetlbfs: WARNING: While overriding shmget(33554432) to add
SHM_HUGETLB: Cannot allocate memory
libhugetlbfs: WARNING: Using small pages for shmget despite
HUGETLB_SHM shmid: 0x2128007
shmaddr: 0xb5d57000
Starting the writes: ................................
Starting the Check...Done.
$ hugeadm --pool-pages-min 4M:32M
$ hugectl --shm ./hugetlbfs-shmget-test
shmid: 0x2158007
shmaddr: 0xb5c00000
Starting the writes: ................................
Starting the Check...Done.
5.1 Shared Memory
When libhugetlbfs is preloaded or linked and
the environment variable HUGETLB_SHM is set to
yes, libhugetlbfs will override all calls
to shmget(). Alternatively, launch the application with
hugectl $--$shm. On setup, all shmget() requests
will become aligned to a hugepage boundary and backed with huge pages if
possible. If the system configuration does not allow huge pages to be used,
the original request is honoured.
5.2 Heap
Glibc defines a __morecore hook that is is
called when the heap size needs to be increased; libhugetlbfs
uses this hook to create regions of memory backed by huge pages. Similar to
shared memory, base pages are used when huge pages are not available.
When libhugetlbfs is preloaded or linked and the environment
variable HUGETLB_MORECORE set to yes,
libhugetlbfs will configure the __morecore
hook, causing malloc() requests will use huge pages. Alternatively,
launch the application with hugectl --heap.
Unlike shared memory, the page size can also be specified if more than
one page size is supported by the system. The first example below uses the
default page size (e.g. 16M on Power5+) and the second example explicitly
overrides a default, using 64K pages.
$ hugectl --heap ./target-application
$ hugectl --heap=64k ./target-application
If the application has already been linked with libhugetlbfs,
it may be necessary to specify --no-preload when using
--heap so that an attempt is not made to load the library twice.
By using the __morecore hook and setting the mallopt()
option M_MMAP_MAX to zero, libhugetlbfs prevents glibc from making
use of brk() to expand the heap. An
application that calls brk() directly will be using base pages.
If a custom memory allocator is being used, it must support the
__morecore hook to use huge pages. An alternative may be to
provide a wrapper around malloc() that called the real underlying
malloc() or get_hugepage_region() depending on the
size of the buffer. A heavy solution would be to provide a fully-fledged
implementation of malloc() with libhugetlbfs that
uses huge pages where appropriate, but this is currently unavailable due to
the lack of a demonstrable use case.
5.3 Text and Data
Backing text or data is more involved as the application should first
be relinked to align the sections to a huge page boundary. This
is accomplished by linking against libhugetlbfs and
specifying -Wl,--hugetlbfs-align -- assuming the version of
binutils installed is sufficiently recent. More information
on relinking applications is described in the libhugetlbfs
HOWTO. Once the application is relinked, as before control is with
environment variables or with hugectl.
$ hugectl --text --data --bss ./target-application
When backing text or data by text, the relevant sections are copied to files on
the hugetlbfs filesystem and mapped with mmap(). The files
are then unlinked so that the memory is freed on application exit. If the
application is to be invoked multiple times, it is worth sharing that data by
specifying the --share-text switch. The consequence is that the
memory remains in use when the application exits and must be manually deleted.
If it is not possible to relink the application, it is possible to force the
loading of segments backed by huge pages by setting the environment variable
HUGETLB_FORCE_ELFMAP to yes. This is not the
preferred option as the method is not guaranteed to work. Segments must be
large enough to overlap with a huge page and on architectures with limitations on
where segments can be placed, it can be particularly problematic.
5.4 Stack
Currently, the stack cannot be backed by huge pages. Support was implemented
in the past but the vast majority of applications did not aggressively use
the stack. In many distributions, there are ulimits on the size
of the stack that are smaller than a huge page size. Upon investigation,
only the bwaves test from the SPEC CPU 2006 benchmark benefited from
stacks being backed by huge pages and only then when using a commercial
compiler. When compiled with gcc, there was no benefit, hence
support was dropped.
6 Summary
There are a small number of interfaces provided by Linux to access huge pages.
While cumbersome to develop applications against, there is a programming API
available with libhugetlbfs and it is possible to automatically
back segments of memory with huge pages without application modification.
In the next section, it will be discussed how the system should be tuned.
Comments (6 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>