Kernel development
Brief items
Kernel release status
The 2.6.33 kernel is out, released on February 24. Linus says:
Other interesting stuff in 2.6.33 includes dynamic tracing, the block I/O bandwidth controller, and the compressed cache mechanism.
See the KernelNewbies 2.6.33 page for more information on this release.
The current stable kernel is 2.6.32.9, released on February 23. There are 93 fixes in this update, many of which are security-related. See below for our detailed look at this release.
Quotes of the week
Open by handle
Most Linux users never deal directly with file handles; indeed, most may not even know they exist. Of the rest, the bulk will have an experience limited to the cheery "stale file handle" message seen by NFS users at horribly inopportune times. In fact, a file handle is just a means by which a file can be uniquely identified within a filesystem. Handles are used in NFS, for example, to represent an open file in a way which allows the server to be almost entirely stateless. Handles are not normally used by, or even available to user-space applications.Aneesh Kumar is trying to change that situation with a short patch series adding two new system calls:
int name_to_handle(const char *name, struct file_handle *handle);
int open_by_handle(struct file_handle *handle, int flags);
The first takes the given name and looks up the associated file handle, which is returned in the handle structure. That handle can then be passed to open_by_handle() to get an open file descriptor for the file. Only privileged users can call open_by_handle(); otherwise it could be possible for a malicious local user to bypass the normal permission checks on the directories in the path to a specific file.
Why would an application developer want to open a file in two steps instead of just calling open()? It comes down to the ability to write filesystem servers that run in user space. Such a server could use name_to_handle() to generate handles for files on the underlying filesystem; those handles are then passed to the filesystem's clients. At some future time, the client can pass the handle back to actually open the file. This type of feature is also already used with the XFS filesystem for backup and restore operations and with a hierarchical storage management system.
Discussion of these system calls has been minimal, thus far. It does seem that some work will be needed still to better describe what a file handle really is, and, in particular, what its expected lifetime will be. Without some clarity in that area, it will be hard to write applications which can make proper use of file handles.
Reserved network ports
It is not all that uncommon to have a network application which needs to be able to bind to a specific port. Often, such requirements result from a firewall configuration allowing incoming connections only to a specific port, but there can be other reasons as well. When running such an application, it can be unpleasant to discover that somebody else's long-running ssh connection happened to stumble onto the required port. It would be nice to be able to avoid this kind of conflict if at all possible.This patch set from Octavian Purdila aims to make it possible. It adds a new sysctl knob (called ip_local_reserved_ports) under /proc/sys/net/ipv4. Should the system administrator write a comma-separated list of ports (or ranges of ports denoted by a hyphen) to this parameter, the networking layer will avoid those ports whenever it picks a port number for a new socket. Reserving ports in this manner will not interfere with any application which binds to those ports explicitly.
This patch has been through a surprising number of revisions; chances seem good that it will show up in the mainline once the 2.6.34 merge window opens.
Kernel development news
A Checkpoint/restart update
It has been exactly one year since LWN last checked up on the checkpoint/restart patch set. This code has just been reposted with a request for inclusion into the -mm tree, so it seems like an opportune time to restart our coverage of it. A lot of progress has been made on this front over the last year, but checkpoint/restart remains a difficult task which can probably never be implemented completely."Checkpointing" refers to the act of saving the state of a group of processes to a file, with the intent of restarting those processes at some future time. For many years, checkpointing has been used to save the state of long-running calculations to avoid losing work should the system fail. More recently, it has become a desired part of the virtualization toolkit, enabling the live migration of processes between physical hosts. The checkpoint/restart developers also see other potential advantages, such as the ability to quickly launch a set of processes on demand from a checkpoint image.
This patch set addresses checkpoint/restart in the containers context. In the context of full virtualization, checkpointing is relatively easy; the system just needs to save the entire memory image associated with the virtual machine and a bit of associated data. The "containers" model of virtualization tends to be messier in almost every way, and checkpointing is no exception. There is no memory image to be saved in one big chunk; instead, the kernel must track down every bit of state associated with the checkpointed processes and save it independently. When it works, it can be faster and more efficient than full virtual machine checkpointing; the checkpoint image will be much smaller. But getting it to work is a challenge. The complexity of this task can be seen in the current checkpoint/restart tree, which, despite being far from a complete solution of the problem, is a 27,000-line diff from 2.6.33-rc8.
Checkpointing
To checkpoint a group of processes, the following new system call is used:
int checkpoint(pid_t pid, int fd, unsigned long flags, int logfd);
The pid parameter identifies the top-level process to be checkpointed; all children of that process will also be included in the checkpoint image, which will be written to the file indicated by fd. There is currently only one possible flag value, CHECKPOINT_SUBTREE, which turns off the normal requirement that an entire container be checkpointed as a whole. Checkpointing just a subtree is a bit riskier than checkpointing a full container because it is harder to ensure that all needed resources have been saved. The logfd parameter is file descriptor open for writing; the kernel will write relevant logging information there. There are vast numbers of possible ways for a checkpoint to fail; the log file is intended to help users figure out what is happening when a checkpoint refuses to work. If logging is not desired, logfd can be -1.
The set of processes to be checkpointed should be frozen prior to the call to checkpoint(). One exception to that rule is a process running in checkpoint() itself; this exception allows processes to checkpoint themselves.
Internally, the checkpointing process is implemented as a two-phase operation:
- The kernel traverses the tree of processes and "collects" every
object which is to be a part of the checkpoint image. Essentially,
"collecting" means building a hash table with an entry for every
process, every open file, every virtual memory area, every open
socket, etc. which must be saved. Scanning the tree in this way helps
the kernel to abort the checkpoint process early if something which
cannot be checkpointed is encountered. Just as importantly, the collecting process
also lets the system track objects which have multiple references
and ensure that they are only written to the image file once.
- The second pass then iterates over the collected objects and causes each to be serialized and written to the image file.
Once this is done, the checkpoint is finished. The just-checkpointed processes can either go on with their business or be killed, depending on the reason for the checkpoint.
These two phases are reflected in the changes made to the lower levels of the system. For example, the none-too-svelte file_operations structure gains two new operations:
int collect(struct ckpt_ctx *ctx, struct file *filp);
int checkpoint(struct ckpt_ctx *ctx, struct file *filp);
The collect() operation should identify every object which must be saved, eventually passing each to ckpt_obj_collect() (or one of several higher-level interfaces) for tracking. Later, a call to checkpoint() is made to request that the given filp be serialized for saving to the checkpoint image. Similar methods have been added to a number of other structure types, including vm_operations_struct and proto_ops.
The serialization process requires copying data from kernel data structures into a series of special structures intended to be written to the image file. So, for example, a file descriptor finds its way from struct fdtable into one of these:
struct ckpt_hdr_file_desc {
struct ckpt_hdr h;
__s32 fd_objref;
__s32 fd_descriptor;
__u32 fd_close_on_exec;
} __attribute__((aligned(8)));
Doing this copy requires a 75-line function which grabs the requisite information and very carefully checks that everything can be checkpointed successfully. In this case, the presence of locks on the file or an owner (to be notified with SIGIO) will cause the checkpoint to fail. In the absence of such roadblocks, the completed structure is handed to the checkpoint code for saving to the image file.
This serialization process is one of the scarier parts of the whole checkpoint/restart concept. Any changes to struct fdtable will almost certainly break this serialization, quite possibly in ways which will not be detected until some user runs into a problem. Even if a VFS developer cared about checkpointing, they might not think to look at the code in checkpoint/files.c to see if anything might require changing. Similar dependencies are created for every other kernel data structure which must be checkpointed. The whole setup looks like it could be a little fragile; keeping it working is almost certain to require significant ongoing maintenance.
Restarting
On the restart side, the application performing the restart is first expected to create a set of processes to be animated with the checkpointed information. That creation will be done with the much-reviewed "extended clone()" system call, which, in this iteration, looks like:
int eclone(u32 flags_low, struct clone_args *cargs, int cargs_size,
pid_t *pids);
With eclone(), the processes can be created with specific pids and with an extended set of flags.
Once the process hierarchy exists, the restart() system call can be used:
int restart(pid_t pid, int fd, unsigned long flags, int logfd);
The checkpoint image found at fd will be restored into the process hierarchy starting at pid. Once again, logfd can be used to gain information on how the process went. There are a number of flags defined: RESTART_TASKSELF (a single task is being restarted on top of the process calling restart()), RESTART_FROZEN (causes the restarted processes to be left frozen at the end), RESTART_GHOST (appears to be a debugging feature), RESTART_KEEP_LSM (restore security labels too), and RESTART_CONN_RESET (force the closing of open sockets). On a successful return from restart(), the process hierarchy should be ready to go.
Once again, restart requires support at the lower levels of the kernel. So our long-suffering file_operations structure gains another function:
int restore(struct ckpt_ctx *, struct ckpt_hdr_file *);
This function (along with its analogs elsewhere in the kernel) is charged with reanimating the given object from the checkpoint file.
Security
It is not hard to imagine that these new system calls could have any of a number of security-related consequences, so it is surprising to see that, in the current implementation, both checkpoint() and restart() are unprivileged operations. This decision was made deliberately, with the idea of forcing the developers to think about security issues from the outset.
The biggest potential problem with checkpoint() is probably information disclosure. To avoid this problem, checkpoint() is only able to checkpoint processes which the caller would be able to call ptrace() on. So there should be no way for a hostile user to gain information from a checkpoint image which would not be available anyway.
The restart side is a little more frightening; it allows the caller to load vast amounts of potentially arbitrary data into kernel data structures. This risk is, one hopes, mitigated by causing all operations to be done in the context of the calling process. If the caller cannot open a file directly, that file cannot be opened via a corrupted checkpoint image either. Doing things this way will break certain use cases, such as checkpointing a setuid program which has since dropped its privileges, but there is probably no way to make that case work securely for unprivileged users.
For an added challenge, the checkpoint/restart developers have also implemented the checkpointing of security labels on objects. By default, these labels will not be used during the restart process, but the RESTART_KEEP_LSM flag can change that. Again, the labels are created in the context of the calling process, so the active security module should prevent the attachment of labels which compromise the security of the system.
Even with these measures in place, one still has to wonder about the security of the process as a whole. The kernel is populating a wide array of data structures from input which may be under the control of a hostile user; it is not hard to imagine that, somewhere in tens of thousands of lines of checkpoint/restart code, an important check has not been made. Perhaps as a result of this concern, the patch set adds a sysctl knob which can be set to disallow unprivileged checkpoint/restart operations.
Where things stand
According to the most recent patch posting:
So the patch set appears to be sufficiently functional to be minimally useful. There are, however, a lot of things which can stil prevent the creation of a successful checkpoint; they are summarized on this page. Problem areas include private filesystem mounts, network sockets in some states, open-but-unlinked files, use of any of the file event notification interfaces, open files on network or FUSE filesystems, use of netlink, ptrace(), asynchronous I/O, and more. There are patches in the works for some of these problems; others look hard.
As of this writing, there has been no response to the developers' request for inclusion in the -mm kernel. In the past, there have been concerns about how much work would be required to finish the job. Over the last year, much of that work is done, but checkpoint/restart looks like a job which is never truly finished. It's mostly a matter of whether what has been done so far appears to be good enough for real work, and whether the maintenance cost of this code is deemed to be worth paying.
2.6.32.9 Release notes
Stable kernel update announcements posted on LWN have a certain tendency to be followed by complaints about the amount of information which is made available. It seems that there is a desire for a description of the changes which is more accessible than the patches themselves, and for attention to be drawn to the security-relevant fixes. As an exercise in determining what kind of effort is being asked of the kernel maintainers, your editor decided to make a pass through the proposed 2.6.32.9 update and attempt to describe the impact of each of the changes - all 93 of them. The results can be found below.
Disclaimers: there is no way to review 93 patches in a finite time and
fully understand each of them. So there are probably
certainly errors in what follows. The simple truth of the matter is that
it is very hard to say which fixes have security implications; a determined
attacker can find a way to exploit some very obscure bugs.
Your editor would also like to discourage anybody from thinking that this will become a regular LWN feature. The amount of work required is considerable; it's not something we're able to commit to doing for every release.
That said, here's a look at what's in this update.
Security-related fixes
- #2: futex_lock_pi() key refcnt fix.
It's possible to create dangling futex references, leading to a
user-triggerable BUG_ON() oops. It's thus a denial of
service vulnerability; it has been present since 2.6.28.
- #3: futex: Handle user space corruption
gracefully. Malicious programs can cause a null pointer
dereference or hijack somebody else's futex.
- #4: futex: Handle futex value corruption
gracefully. User-space processes can cause warning floods,
spamming the system logs.
- #5: Fix race in tty_fasync() properly.
Possible (if unlikely) deadlock, and thus denial of service.
- #22: regulator: Fix display of null
constraints for regulators. Fixes an information disclosure issue
in the regulator code.
- #23: ALSA: hda-intel: Avoid divide by zero
crash. Papers over a user-triggerable divide-by-zero crash; the
real cause of the problem remains unknown.
- #26: cciss: Make cciss_seq_show handle
holes in the h->drv[] array. Null pointer dereference in the
cciss driver; probably not triggerable without privilege.
- #35: NFS: Fix an Oops when truncating a
file. User-triggerable oops when truncating a file on an NFS
filesystem.
- #36: NFS: Fix a umount race. Dangling
pointer dereference on NFS filesystem unmount. Maybe triggerable in
situations where users can cause mounts and unmounts to happen.
- #40: V4L/DVB: dvb-core: fix initialization
of feeds list in demux filter. User-triggerable dereference of a
corrupted pointer, with an oops being the best-case scenario.
- #47: netfilter: nf_conntrack: per netns
nf_conntrack_cachep. Fixes a potential leak of information
between network namespaces. Probably very hard to exploit in any
useful way.
- #50: netfilter: nf_conntrack: fix hash
resizing with namespaces. Changing the conntrack hash size in one
namespace causes that size to be incorrect for all others, leading to
unsightly kernel oops issues.
- #54: [S390] dasd: remove strings from
s390dbf. Stale pointer dereference bugs in the S390 DASD driver.
- #57: dell-wmi, hp-wmi, msi-wmi: check
wmi_get_event_data() return value. Fix a potential null pointer
dereference on memory allocation failure.
- #75: ALSA: usb-audio - Avoid Oops after
disconnect. Fixes a user-triggerable oops in the USB audio
driver.
- #79: USB: usbfs: only copy the actual data
received. Usbfs was copying more data than actually existed in
some situations, leading to a potential information disclosure problem.
- #82: ACPI: Add NULL pointer check in
acpi_bus_start. A null pointer dereference in the ACPI code.
- #92: dm log: userspace fix overhead_size calculations. A couple of structure-size miscalculations make both buffer overruns and information disclosure possible, though it's not at all clear that either is readily exploitable.
Other bug fixes
- #1: Fix potential crash with
sys_move_pages. Fix an unreliable test which could cause a crash
in the page migration code. [Update: as has been pointed out
in the comments, this one is exploitable
and should have been in the
security list above.]
- #6: hwmon: (w83781d) Request I/O ports
individually for probing. More robust access to hardware
monitoring ports.
- #7: hwmon: (lm78) Request I/O ports
individually for probing. More robust access to hardware
monitoring ports.
- #8: hwmon: (adt7462) Wrong
ADT7462_VOLT_COUNT. Fixes a bug which could cause one voltage
measurement to be passed over.
- #9: ALSA: ctxfi - fix PTP address
initialization. Fixes an alignment bug in the ctxfi sound driver.
- #10: drm/i915: disable hotplug detect
before Ironlake CRT detect. Fixes a possible hang in the monitor
detection code.
- #12: drm/i915: Disable SR when more than
one pipe is enabled. Fixes a flicker-causing i915 bug.
- #13: drm/i915: Fix DDC on some systems by
clearing BIOS GMBUS setup. Fixes a bug which can cause failure to
detect some monitors.
- #15: drm/i915: Fix the incorrect DMI
string for Samsung SX20S laptop. Incorrect identification
information was returned to user space.
- #17: usb: r8a66597-hcd: Flush the D-cache
for the pipe-in transfer buffers. Fixes a cache consistency
problem.
- #18: i2c-tiny-usb: Fix on big-endian
systems. An endianness bug in i2c-tiny-usb caused incorrect
information to be returned to user space.
- #19: drm/i915: handle FBC and self-refresh
better. Eliminates an i915 flicker problem.
- #20: drm/i915: Increase fb alignment to
64k. Fixes an obscure error in the i915 driver.
- #24: CPUFREQ: Fix use after free of struct
powernow_k8_data. Fixes a use-after-free bug in the cpufreq code;
does not appear to be user-triggerable.
- #25: freeze_bdev: dont deactivate
successfully frozen MS_RDONLY sb. Fixes a boot-time crash in the block
layer.
- #27: ioat: fix infinite timeout checking
in ioat2_quiesce. Fixes a typo in the IOAT code.
- #29: fs/exec.c: restrict initial stack
space expansion to rlimit. Fixes a bug which could cause process
creation failures in the presence of tight stack limits.a
- #30: cifs: fix length calculation for
converted unicode readdir names. Fixes a CIFS data consistency
bug.
- #31: NFS: Fix a reference leak in
nfs_wb_cancel_page(). Fixes a reference leak in the NFS
cancellation code.
- #32: NFS: Try to commit unstable writes in
nfs_release_page(). Looks like a fix for a potential data loss
problem in the NFS code.
- #33: NFSv4: Dont allow posix locking
against servers that dont support it. Be sure to notice if a
server does not support POSIX locking.
- #34: NFSv4: Ensure that the NFSv4 locking
can recover from stateid errors. Fix an NFSv4 locking problem
with unknown effects.
- #37: NFS: Fix a bug in
nfs_fscache_release_page(). Removes a spurious BUG_ON()
call.
- #38: NFS: Fix the mapping of the
NFSERR_SERVERFAULT error. Fix an incorrect error value returned
to user space.
- #39: md: fix degraded calculation when
starting a reshape. Some old code can cause the MD subsystem to
be unclear on whether a given array is running in a degraded mode or
not after a reshape.
- #42: kvmclock: count total_sleep_time when
updating guest clock. Fix an error which could lead to incorrect
wall clock time in KVM guests.
- #43: KVM: PIT: control word is
write-only. Prevent attempts to read a write-only register.
- #44: tpm_infineon: fix suspend/resume
handler for pnp_driver. Fixes a hang-on-suspend bug.
- #45: amd64_edac: Do not falsely trigger
kerneloops. Remove a spurious warning in the amd64 EDAC code.
- #46: netfilter: nf_conntrack: fix memory
corruption with multiple namespaces. Fixes a potential race
condition which could lead to memory corruption. Requires the
instantiation of a new namespace (and, thus, root privilege) to
trigger.
- #48: netfilter: nf_conntrack: restrict
runtime expect hashsize modifications. Don't allow the connection
tracking expect_hashsize attribute to be modified, since the
code isn't prepared to handle that.
- #49: netfilter: xtables: compat out of
scope fix. Fixes a potential stack-based dangling pointer bug.
- #51: drm/i915: remove full registers dump
debug. Removes an i915 debug option which could hang the machine.
- #52: drm/i915: add i915_lp_ring_sync
helper. Code and performance improvement in the i915 driver.
- #53: drm/i915: Dont wait interruptible for
possible plane buffer flush. The i915 DRM driver can corrupt the
hardware state if a signal comes in at the wrong time. Could be seen
as a denial of service problem, but that's a big stretch.
- #56: wmi: Free the allocated acpi objects
through wmi_get_event_data. Fixes a memory leak in the WMI code.
- #58: /dev/mem: introduce
size_inside_page(). Eliminates some duplicate code and fixes the
alignment logic for /dev/kmem, which was described simply as
"buggy." But who uses /dev/kmem anymore?
- #59: devmem: check vmalloc address on kmem
read/write. A missing test for addresses in the
vmalloc() space could cause an oops from the
/dev/kmem code. Probably not triggerable by ordinary users,
though, even on systems where /dev/kmem is enabled.
- #60: devmem: fix kmem write bug on memory
holes. An attempt to write data to /dev/mem would get
confused if a memory hole is hit, causing incorrect data to be written
after the hole.
- #61: SCSI: mptfusion : mptscsih_abort
return value should be SUCCESS instead of value 0. The mptfusion
driver had an incorrect return value with unknown effects.
- #62: sh: Couple kernel and user write
page perm bits for CONFIG_X2TLB. The SuperH architecture had a
problem handling write faults for pages in the vmalloc()
space, which could cause problems with drivers that map such pages
into user space.
- #63: ALSA: hda - use WARN_ON_ONCE() for
zero-division detection. Avoid spamming the log files if the
hardware goes nuts.
- #64: dst: call cond_resched() in
dst_gc_task(). The network destination cache code can process
very long lists, leading to soft lockup warnings.
- #66: befs: fix leak. There is a
memory leak in the BeFS mount code; one would not normally expect it
to be user-triggerable.
- #67: rtc-fm3130: add missing braces.
Missing braces in the rtc-fm3130 would cause spurious warnings to be
emitted.
- #68: [libata] Call flush_dcache_page after
PIO data transfers in libata-sff.c. Fix a cache coherency bug in
the ATA code.
- #70: pktgen: Fix freezing problem.
The packet generator could prevent the system from suspending or
hibernating.
- #71: x86/amd-iommu: Fix IOMMU-API
initialization for iommu=pt. Fix a boot-time initialization error
in the IOMMU code.
- #72: x86/amd-iommu: Fix deassignment of a
device from the pt_domain. Fix a KVM device assignment failure.
- #73: x86: Re-get cfg_new in case
reuse/move irq_desc. Fix a bug in interrupt migration with
unknown effect.
- #74: Staging: fix rtl8187se compilation
errors with mac80211. Boring compilation problem fix.
- #76: serial: 8250: add serial transmitter
fully empty test. Fixes a serial driver problem which could cause
the loss of some transmitted data.
- #77: sysfs: sysfs_sd_setattr set iattrs
unconditionally. An omitted initialization can cause sysfs
attributes to have more restrictive permissions than desired.
- #78: class: Free the class private data in
class_release. Fix a memory leak in the sysfs class code.
Potentially user-exploitable if somebody were willing to dedicate a
month of their life to repeatedly plugging and unplugging a device.
- #80: USB: usbfs: properly clean up the as
structure on error paths. Fixes a memory leak in the usbfs error
recovery paths.
- #83: ACPI: fix High cpu temperature with
2.6.32. Fixes behavior on a couple of laptops with problematic
power management operation.
- #84: drm/radeon/kms: use udelay for short
delays. Use of schedule_timeout() for short delays was
slowing bootstrap considerably on some systems.
- #85: NFS: Too many GETATTR and ACCESS
calls after direct I/O. Fixes a performance regression in the NFS
code.
- #86: eCryptfs: Add getattr function.
The eCryptfs filesystem would show incorrect file sizes.
- #87: b43: Fix throughput regression.
Throughput on some BCM4311 devices is said to have dropped from 18Mb/s
to 0.7Mb/s, which is a bit more of a penalty than some users wanted to
pay.
- #88: ath9k: Fix sequence numbers for PAE
frames. Fixes a protocol error in the ath9k driver.
- #89: mac80211: Fix probe request filtering
in IBSS mode. The wireless code could reply to probe requests
directed at a different SSID.
- #90: iwlwifi: Fix to set correct ht
configuration. The iwlwifi driver was not configuring
associations correctly, leading to dropped connections.
- #91: dm stripe: avoid divide by zero with
invalid stripe count. Giving a bad stripe size to the device
mapper code would cause a division by zero.
- #93: dm mpath: fix stall when requeueing io. Fixes a root-triggerable stall in the device mapper multipath code.
Enhancements
- #11: drm/i915: enable self-refresh on
965. Hardware feature enablement.
- #14: drm/i915: Add HP nx9020/SamsungSX20S
to ACPI LID quirk list. Adds a quirk entry for buggy hardware.
- #16: drm/i915: Add MALATA PC-81005 to ACPI
LID quirk list. Adds a quirk entry for more buggy hardware.
- #21: drm/i915: Update write_domains on
active list after flush. Performance improvement in the i915
driver.
- #28: resource: add helpers for fetching
rlimits. Adds helper functions to ensure that resource limit
values are not fetched multiple times.
- #41: Export the symbol of getboottime and
mmonotonic_to_bootbased. Adds a couple of symbol exports.
- #55: crypto: padlock-sha - Add
import/export support. Improve interoperation with some HMAC
code.
- #65: ALSA: hda - Improved MacBook (Pro)
5,1 / 5,2 support. Improves sound behavior on those systems.
- #69: ahci: add Acer G725 to broken suspend
list. Note that Acer G725 laptops with old firmware have buggy
suspend behavior.
- #81: rtl8187: Add new device ID. Recognize another device ID.
Conclusions
Out of 93 patches, 18 struck your editor as having clear security implications. Quite a few other patches fix crashes which could possibly be security problems; if they are not listed as such, it's because there was no immediately evident way that a user could trigger the problem. Doubtless people with more imagination will figure out ways to take advantage of some of these bugs.
What it comes down to is that the identification of security problems is often hard. In the kernel environment, almost any bug could potentially create some kind of vulnerability. So it is not surprising to see developers "silently fix" security bugs; they simply fix bugs without realizing the implications. It is also not surprising that some developers are reluctant to call attention to security-related fixes. The list above almost certainly includes "security fixes" for bugs that nobody can exploit while classifying true vulnerabilities as mere bug fixes. Any list of security-relevant patches is sure to be an incomplete and partially deceptive thing.
That said, it may well be that fixes which are truly known to have security implications should be marked as such. Attackers will make the effort to figure that out anyway; it's not clear that making life harder for everybody else has any benefits. Still, those who would complain about how the stable tree is managed would do well to remember that, a few years ago, we had no such tree. It came into being because people stepped up to do the work of maintaining it. There can be no doubt that a better job could be done here (as is the case almost everywhere else too); its just a matter of somebody finding the time and the energy to do it.
Huge pages part 2: Interfaces
In an ideal world, the operating system would automatically use huge pages where appropriate, but there are a few problems. First, the operating system must decide when it is appropriate to promote base pages to huge pages requiring the maintenance of metadata which, itself, has an associated cost which may or may not be offset by the use of huge pages. Second, there can be architectural limitations that prevent a different page size being used within an address range once one page has been inserted. Finally, differences in TLB structure make predicting how many huge pages can be used and still be of benefit problematic.For these reasons, with one notable exception, operating systems provide a more explicit interface for huge pages to user space. It is up to application developers and system administrators to decide how they best be used. This chapter will cover the interfaces that exist for Linux.
1 Shared Memory
One of the oldest interfaces backs shared memory segments created by shmget() with huge pages. Today, it is commonly used due to its simplicity and the length of time it has been supported. Huge pages are requested by specifying the SHM_HUGETLB flag and ensuring the size is huge-page-aligned. Examples of how to do this are included in the kernel source tree under Documentation/vm/hugetlbpage.txt.
A limitation of this interface is that only the default huge page size (as indicated by the Hugepagesize field in /proc/meminfo) will be used. If one wanted to use 16GB pages as supported on later versions of POWER for example, the default_hugepagesz= field must be used on the kernel command line as documented in Documentation/kernel-parameters.txt in the kernel source.
The maximum amount of memory that can be committed to shared-memory huge pages is controlled by the shmmax sysctl parameter. This parameter will be discussed further in the next installment.
2 HugeTLBFS
For the creation of shared or private mappings, Linux provides a RAM-based filesystem called "hugetlbfs." Every file on this filesystem is backed by huge pages and is accessed with mmap() or read(). If no options are specified at mount time, the default huge page size is used to back the files. To use a different page size, specify pagesize=.
$ mount -t hugetlbfs none /mnt/hugetlbfs -o pagesize=64K
There are two ways to control the amount of memory which can be consumed by huge pages attached to a mount point. The size= mount parameter specifies (in bytes; the "K," "M," and "G" suffixes are understood) the maximum amount of memory which will be used by this mount. The size is rounded down to the nearest huge page size. It can also be specified as a percentage of the static huge page pool, though this option appears to be rarely used. The nr_inodes= parameter limits the number of files that can exist on the mount point which, in effect, limits the number of possible mappings. In combination, these options can be used to divvy up the available huge pages to groups or users in a shared system.
Hugetlbfs is a bare interface to the huge page capabilities of the underlying hardware; taking advantage of it requires application awareness or library support. Libhugetlbfs makes heavy use of this interface when automatically backing regions with huge pages.
For an application wishing to use the interface, the initial step is to discover the mount point by either reading /proc/mounts or using libhugetlbfs. Finding the mount point manually is relatively straightforward and already well known for debugfs but, for completeness, a very simple example program is shown below:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/param.h>
char *find_hugetlbfs(char *fsmount, int len)
{
char format[256];
char fstype[256];
char *ret = NULL;
FILE *fd;
snprintf(format, 255, "%%*s %%%ds %%255s %%*s %%*d %%*d", len);
fd = fopen("/proc/mounts", "r");
if (!fd) {
perror("fopen");
return NULL;
}
while (fscanf(fd, format, fsmount, fstype) == 2) {
if (!strcmp(fstype, "hugetlbfs")) {
ret = fsmount;
break;
}
}
fclose(fd);
return ret;
}
int main() {
char buffer[PATH_MAX+1];
printf("hugetlbfs mounted at %s\n", find_hugetlbfs(buffer, PATH_MAX));
return 0;
}
When there are multiple mount points (to make different page sizes available), it gets more complicated; libhugetlbfs also provides a number of functions to help with these mount points. hugetlbfs_find_path() returns a mount point similar to the example program above, while hugetlbfs_find_path_for_size() will return a mount point for a specific huge page size. If the developer wishes to test a particular path to see if it hugetlbfs or not, use hugetlbfs_test_path().
3 Anonymous mmap()
As of kernel 2.6.32, support is available that allows anonymous mappings to be created backed by huge pages with mmap() by specifying the flags MAP_ANONYMOUS|MAP_HUGETLB. These mappings can be private or shared. It is somewhat of an oversight that the amount of memory that can be pinned for anonymous mmap() is limited only by huge page availability. This potential problem may be addressed in future kernel releases.
4 libhugetlbfs Allocation APIs
It is recognised that a number of applications want to simply get a buffer backed by huge pages. To facilitate this, libhugetlbfs provides two APIs since release 2.0, get_hugepage_region() and get_huge_pages() with corresponding free functions called free_hugepage_region() and free_huge_pages(). These are all provided with manual pages distributed with the libhugetlbfs package.
get_huge_pages() is intended for use with the development of custom allocators and not as a drop-in replacement for malloc(). It is required that the size parameter to this API be hugepage-aligned which can be discovered with the function gethugepagesize().
If an application wants to allocate a number of very large buffers but is not concerned with alignment or some wastage, it should use get_hugepage_region(). The calling convention to this function is much more relaxed and will optionally fallback to using small pages if necessary.
It is possible that applications need very tight control over how the mapping is placed in memory. If this is the case, libhugetlbfs provides hugetlbfs_unlinked_fd() and hugetlbfs_unlinked_fd_for_size() to create a file descriptor representing an unlinked file on a suitable hugetlbfs mount point. Using the file-descriptor, the application can mmap() with the appropriate parameters for accurate placement.
Converting existing applications and libraries to use the API where applicable should be straightforward, but basic examples of how to do it with the STREAM memory bandwidth benchmark suite are available [gorman09a].
5 Automatic Backing of Memory Regions
While applications can be modified to use any of the interfaces, it imposes a significant burden on the application developer. To make life easier, libhugetlbfs can back a number of memory region types automatically when it is either pre-linked or pre-loaded. This process is described in the HOWTO documentation and manual pages that come with libhugetlbfs.
Once loaded, libhugetlbfs's behaviour is determined by environment variables described in the libhugetlbfs.7 manual page. As manipulating environment variables is time-consuming and error-prone, a hugectl utility exists that does much of the configuring automatically and will output what steps it took if the --dry-run switch is specified.
To determine if huge pages are really being used, /proc can be examined, but libhugetlbfs will also warn if the verbosity is set sufficiently high and sufficient numbers of huge pages are not available. See below for an example of using a simple program that backs a 32MB segment with huge pages. Note how the first attempt to use huge pages failed and some configuration was required as no huge pages were previously configured on this system.
The manual pages are quite comprehensive so this section will only give a brief introduction as to how different sections of memory can be backed by huge pages without modification.
$ ./hugetlbfs-shmget-test
shmid: 0x2130007
shmaddr: 0xb5e37000
Starting the writes: ................................
Starting the Check...Done.
$ hugectl --shm ./hugetlbfs-shmget-test
libhugetlbfs: WARNING: While overriding shmget(33554432) to add
SHM_HUGETLB: Cannot allocate memory
libhugetlbfs: WARNING: Using small pages for shmget despite
HUGETLB_SHM shmid: 0x2128007
shmaddr: 0xb5d57000
Starting the writes: ................................
Starting the Check...Done.
$ hugeadm --pool-pages-min 4M:32M
$ hugectl --shm ./hugetlbfs-shmget-test
shmid: 0x2158007
shmaddr: 0xb5c00000
Starting the writes: ................................
Starting the Check...Done.
5.1 Shared Memory
When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_SHM is set to yes, libhugetlbfs will override all calls to shmget(). Alternatively, launch the application with hugectl $--$shm. On setup, all shmget() requests will become aligned to a hugepage boundary and backed with huge pages if possible. If the system configuration does not allow huge pages to be used, the original request is honoured.
5.2 Heap
Glibc defines a __morecore hook that is is called when the heap size needs to be increased; libhugetlbfs uses this hook to create regions of memory backed by huge pages. Similar to shared memory, base pages are used when huge pages are not available.
When libhugetlbfs is preloaded or linked and the environment variable HUGETLB_MORECORE set to yes, libhugetlbfs will configure the __morecore hook, causing malloc() requests will use huge pages. Alternatively, launch the application with hugectl --heap.
Unlike shared memory, the page size can also be specified if more than one page size is supported by the system. The first example below uses the default page size (e.g. 16M on Power5+) and the second example explicitly overrides a default, using 64K pages.
$ hugectl --heap ./target-application
$ hugectl --heap=64k ./target-application
If the application has already been linked with libhugetlbfs, it may be necessary to specify --no-preload when using --heap so that an attempt is not made to load the library twice.
By using the __morecore hook and setting the mallopt() option M_MMAP_MAX to zero, libhugetlbfs prevents glibc from making use of brk() to expand the heap. An application that calls brk() directly will be using base pages.
If a custom memory allocator is being used, it must support the __morecore hook to use huge pages. An alternative may be to provide a wrapper around malloc() that called the real underlying malloc() or get_hugepage_region() depending on the size of the buffer. A heavy solution would be to provide a fully-fledged implementation of malloc() with libhugetlbfs that uses huge pages where appropriate, but this is currently unavailable due to the lack of a demonstrable use case.
5.3 Text and Data
Backing text or data is more involved as the application should first be relinked to align the sections to a huge page boundary. This is accomplished by linking against libhugetlbfs and specifying -Wl,--hugetlbfs-align -- assuming the version of binutils installed is sufficiently recent. More information on relinking applications is described in the libhugetlbfs HOWTO. Once the application is relinked, as before control is with environment variables or with hugectl.
$ hugectl --text --data --bss ./target-application
When backing text or data by text, the relevant sections are copied to files on the hugetlbfs filesystem and mapped with mmap(). The files are then unlinked so that the memory is freed on application exit. If the application is to be invoked multiple times, it is worth sharing that data by specifying the --share-text switch. The consequence is that the memory remains in use when the application exits and must be manually deleted.
If it is not possible to relink the application, it is possible to force the loading of segments backed by huge pages by setting the environment variable HUGETLB_FORCE_ELFMAP to yes. This is not the preferred option as the method is not guaranteed to work. Segments must be large enough to overlap with a huge page and on architectures with limitations on where segments can be placed, it can be particularly problematic.
5.4 Stack
Currently, the stack cannot be backed by huge pages. Support was implemented in the past but the vast majority of applications did not aggressively use the stack. In many distributions, there are ulimits on the size of the stack that are smaller than a huge page size. Upon investigation, only the bwaves test from the SPEC CPU 2006 benchmark benefited from stacks being backed by huge pages and only then when using a commercial compiler. When compiled with gcc, there was no benefit, hence support was dropped.
6 Summary
There are a small number of interfaces provided by Linux to access huge pages. While cumbersome to develop applications against, there is a programming API available with libhugetlbfs and it is possible to automatically back segments of memory with huge pages without application modification. In the next section, it will be discussed how the system should be tuned.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
