LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.5 merge window remains open as of this writing, so there is no current development kernel. The flow of patches into the mainline continues; see the separate article below for a summary of what has been merged thus far.

Stable updates: there have been no stable updates released in the last week. The 3.0.33, 3.2.19, 3.3.8, and 3.4.1 updates are all in the review process as of this writing; they could all be released at any time.

Comments (none posted)

Quotes of the week

Lumpy reclaim had a purpose but in the mind of some, it was to kick the system so hard it trashed. For others the purpose was to complicate vmscan.c. Over time it was giving softer shoes and a nicer attitude but memory compaction needs to step up and replace it so this patch sends lumpy reclaim to the farm.
Mel Gorman

Jiri [Kosina] is now also marked as the maintainer of floppy.c, I shall be publically branding his forehead with red hot iron at the next opportune moment.
Jens Axboe

As a kernel rights holder I question the legality of Matthew's proposal, and it would be amusingly unfortunate if the Software Conservancy ended up beginning some of its Linux enforcement against Fedora.
Alan Cox

Comments (none posted)

klibc 2.0 released

Version 2.0 of the minimal C library klibc has been released. "A bit delayed due to kernel.org breakin, but development is kicking in again. The 2.0 branch saw boot time tests and deployments in Debian, so we are quite certain it should work for the most out of you, if not please let us know." The biggest change appears to be proper support for buffered I/O in the stdio functions.

Full Story (comments: 1)

Kernel development news

3.5 merge window part 2

By Jonathan Corbet
May 31, 2012
The 3.5 merge window continued in full force after last week's summary, with another 4,000 non-merge changesets pulled into the mainline since then; there have now been over 8,600 changes merged for 3.5, and the merge window is not done yet. The most significant user-visible changes are:

  • The autosleep patch set, implementing Android-style opportunistic suspend (with a different API) has been merged. Associated with this work is a new epoll flag (EPOLLWAKEUP) which causes a wakeup event to be activated, preventing suspend when an event is available for processing.

  • The user namespace rework gets the kernel closer to being able to safely run processes as root within a container.

  • RAID10 arrays managed by the MD layer can now be reshaped.

  • After years of attempts, the uprobes subsystem has been merged. See this article for more information on the version of uprobes that was merged for 3.5.

  • The tmpfs filesystem now supports hole punching and the SEEK_DATA and SEEK_HOLE lseek() options.

  • The removal of old code continues; victims include Microchannel bus support, legacy CRIS RTC drivers, the imxmmc driver, the swap token code, and the lumpy reclaim mechanism.

  • New drivers include:

    • Systems and processors: Renesas SH7264, SH7269, and SH7734 processors, ST SPEAr13xx processors, Atheros DB120 reference boards, and Lantiq FALCON processors.

    • Audio: Freescale MC13783 codecs, Cirrus Logic CS42L52 low power stereo codecs, LAPIS Semiconductor ML26124 codecs, TI LM49453 codecs, and ST Ericsson Ux500-based audio platform devices.

    • Block: Cirrus Logic EP93xx PATA controllers.

    • Graphics: Aspeed Technologies AST 2000, 2100, 2200, 2150 and 2300 chips, MGA G200 server engines, and QEMU-emulated Cirrus GPUs.

    • Input: I2C-based Wacom tablets, National Semiconductor LM8333 keypad controllers, Dialog DA9052/DA9053 touchscreen controllers, and Synaptics NavPoint touchpads on PXA27x SSP ports.

    • Miscellaneous: STA2X11 "ConneXt" I/O hubs, Power 7+ Nest crypto accelerators, Texas Instruments INA219 and INA226 power management chips, Intel Atom E6xx watchdogs, Intel MSIC mixed signal gpio controllers, RICOH RC5T583 GPIO controllers, Samsung Exynos I/O memory management units, and Dialog DA9052 watchdogs.

    • Multi-function chipsets: Maxim Semiconductor MAX77693 PMICs, Intel ICH LPC bridges, ST Microelectronics ConneXt (STA2X11) I/O hubs, and National Semiconductor / TI LM3533 lighting power chips.

    • Network: NXP PN544 NFC controllers.

    • Video4Linux: Infineon TUA 9001 silicon tuners, Afatech AF9033 DVB-T demodulators, Afatech AF9035 based DVB USB receivers, Fitipower FC0011 silicon tuners, LG Electronics LG216x-based ATSC-MH demodulators, Fitipower FC0012 and FC0013 silicon tuners, STA2x11 video input ports, and SMIA++/SMIA-compliant sensors.

Changes visible to kernel developers include:

  • The kernel's exception table can now be sorted at build time, speeding the boot process somewhat.

  • The ALSA core now supports "dynamic PCM" devices, being audio devices split into front and back ends which allow arbitrary routing of audio data between the front and back end devices.

  • The contiguous memory allocator patch set, designed to make life easier on systems where large chunks of physically-contiguous memory are needed on occasion, has been merged at last. The same pull included a complete rework of the ARM DMA mapping subsystem, adding CMA support and support for I/O memory management units.

  • The DMA buffer sharing subsystem has gained support for mapping buffers into user space. Also added is a new dma_buf_vmap() function for mapping buffers (using the vmalloc() area) into kernel space.

  • <asm/word-at-a-time.h> has been significantly reworked (by Linus) to be more efficient on all architectures; strnlen_user() has then been rewritten to use it in a generic manner.

  • The LED subsystem now supports one-shot timed operation; see ledtrig-transient.txt for details.

  • The error detection and correction (EDAC) subsystem has been massively reworked to better handle contemporary processors and memory controllers.

As of this writing, the 3.5 merge window has a few more days left to run. The final article in this series will come out once the merge window has closed.

Comments (1 posted)

Uprobes in 3.5

By Jonathan Corbet
May 30, 2012
Uprobes is a kernel patch with a long story and many contentious discussions behind it. This code has its roots in utrace, a user-space tracing and debugging API that was first covered here in early 2007. Utrace ran into various types of opposition (only partly related to its own origin in SystemTap) and has never been merged, but a piece of it lives on in the form of uprobes, which is charged with the placement of probes into user-space code. After several mailing-list rounds of its own, uprobes was finally merged for the 3.5 kernel development cycle. Just how this facility will be used remains to be seen, however.

At the core of uprobes is this function:

    #include <linux/uprobes.h>

    int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);

The inode structure represents an executable file; the probe is to be placed at offset bytes from the beginning. The uprobe_consumer structure tells the kernel what is to be done when a process encounters the probe; it looks like:

    struct uprobe_consumer {
	int (*handler) (struct uprobe_consumer *self, struct pt_regs *regs);
	bool (*filter) (struct uprobe_consumer *self, struct task_struct *task);
	struct uprobe_consumer *next;
    };

The filter() function is optional; if it exists, it determines whether handler() is called for each specific hit on the probe. The handler returns an int, but the return value is ignored in the current code.

Since probes are associated with files, they affect all processes that run code from those files. A special copy is made of the page to contain the probe; in that copy, the instruction at the specified offset is copied and replaced by a breakpoint. When the breakpoint is hit by a running process, filter() will be called if present, and handler() will be run unless the filter said otherwise. Then the displaced instruction is executed (using the "execute out of line" mechanism described in this article) and control returns to the instruction following the breakpoint.

Uprobes thus implements a mechanism by which a kernel function can be invoked whenever a process executes a specific instruction location. One could imagine a number of things that said kernel function could do; there has been talk, for example, of using uprobes (and, perhaps someday, something derived from utrace) as a replacement for the much-maligned ptrace() system call. Tools like GDB could place breakpoints with uprobes; it might even be possible to load simple filters for conditional breakpoints into the kernel, speeding their execution considerably. Uprobes could also someday be a component of a Dtrace-like dynamic tracing functionality. For now, though, the interfaces for that kind of feature have not been added to the kernel; none have even been proposed.

What the current implementation does have is integration with the perf events subsystem. New dynamic "events" can be added to any file location via an interface similar to that used for dynamic kernel tracepoints. In particular, there is a new file called uprobe_events in the tracing directory (/sys/kernel/debug/tracing/ on most systems) that is used to add and remove events. As an example, a line like:

    echo 'p:bashme /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
would place a new event (called "bashme") at location 0x4245c0 in the bash executable. The event would then appear with all other events in /sys/kernel/debug/tracing/events, in the uprobes subdirectory. Like other events, it is not actually turned on until its enabled attribute is set. See Documentation/trace/uprobetracer.txt for details on the interface at this level.

Placing uprobes is, by default, a privileged operation requiring the CAP_SYS_ADMIN capability. One can remove the privilege requirement by setting the perf_paranoid sysctl knob to -1, but doing so will allow the placement of dynamic tracepoints anywhere in the system, in kernel or user space. Thus, one need not be overly paranoid to leave perf_paranoid at its default setting.

The perf tool has been enhanced to make working with dynamic user-space tracepoints easy. One can, for example, set a tracepoint at the entry to the C library's malloc() implementation with:

    perf probe -x /lib64/libc.so.6 malloc

That tracepoint can then be treated like any other event understood by perf. See the explanatory text from Ingo Molnar's pull request for examples of what can be done.

Most kernel patches are conceived, implemented, reviewed, and merged into the mainline over a fairly short period of time. But some of them seem to languish for years without making much progress. Uprobes was such a patch set. It must have been frustrating for the developers to keep revising and posting this code, only to see it shot down over and over again. But the kernel community can be supportive of developers who show both persistence and a willingness to listen to criticism. The result, in this case, is a user-space probing mechanism that has been simplified, made more robust, and integrated into the existing events infrastructure. Hopefully it was worth the wait.

Comments (4 posted)

Atime and btrfs: a bad combination?

By Jonathan Corbet
May 31, 2012
Unix and Unix-like systems have traditionally recorded the time of last access for each file in the system. This practice has fallen partially out of favor over the last decade for a simple reason: writing the last-accessed time ("atime") takes up a lot of I/O bandwidth when lots of files are being read; see this article from 2007, for example. The worst of the atime-related problems have long since been mitigated by moving to the "relatime" mount option by default; relatime only updates atime a maximum of once per day for unchanging files. But now it seems that atime recording can be especially problematic with the btrfs filesystem, and relatime may not help much.

One of the core design features of btrfs is its copy-on-write nature. Blocks on disk are never modified in place; instead, when it becomes necessary to commit a change, the affected block is copied and rewritten into a new location. Copy-on-write applies to metadata as well as data; if a file's metadata (such as its last-accessed time) is changed, the block containing that metadata will be copied to a new spot. So, on btrfs, an operation that reads a lot of files (creating a tar archive, say, or a recursive grep) can, through atime updates, cause the copying and rewriting of a lot of metadata blocks.

Needless to say, performance is not improved this way, but that is not where the big problem comes in. As Alexander Block pointed out, the real problem has to do with the interaction between atime, copy-on-write, and snapshots.

Btrfs provides a fast snapshotting feature that can create a copy of the state of the filesystem at a specific time. When a snapshot is created, it shares all data and metadata with the "trunk" filesystem. Should a file be changed, the resulting copy-on-write operation separates the trunk from the snapshot, keeping both versions of the data available. So snapshots can be thought of as being nearly free as long as the filesystem remains relatively quiet. Snapshots will share data and metadata, so they do not require a lot of additional space.

Atime updates change the situation, though. If somebody takes a snapshot of a filesystem, then performs a recursive grep on that filesystem, the last-access time of every file touched may be updated. That, in turn, can cause copy-on-write operations on each file's inode structure, with the result that many or all of the inodes in the filesystem may be duplicated. That can increase the space consumption of the filesystem considerably; Alexander posted an example where a recursive grep caused 2.2GB of free space to disappear. That is a surprising result for what is meant to be a read-only operation.

Once upon a time, when disk capacities were measured in megabytes, it was said that the only standard feature of Unix systems was the message of the day telling users to clean up their files. Atime was often used by harried system administrators trying to recover some disk space; they would scan for infrequently-accessed files and, depending on how desperate the situation was and how powerful their users were, either send lists of unused files to users or simply back those files up to tape and delete them. It is somewhat ironic that a feature meant (among other things) to help keep disk space free has now, on next-generation filesystems, become part of the problem.

It's worth noting that the relatime option (which only updates atime once per day unless the file has been modified since the last atime update) is of little help here. It only takes one atime update to force an inode to be rewritten and unshared with any snapshots. So the fact that such updates are limited to one per day offers little in the way of consolation.

Users are also unlikely to be consoled by one other aspect of the problem pointed out by Alexander: since reading data can consume space in the filesystem, read operations might fail with "no space available" errors on an overflowing filesystem. That may make it difficult or impossible to fix the problem by copying data out of a full filesystem. By the time that happens, a typical user could be forgiven for thinking that, perhaps, they don't need last-accessed time tracking at all.

Along those lines, Alexander suggested that it might be a good idea to default to "noatime" (which turns off atime recording entirely) for btrfs mounts, even if that means that btrfs would then behave differently than other filesystems. That idea was quickly shot down for a simple reason: there are still applications that actually need the atime information to function correctly. The classic example is the mutt email client which, in the absence of atime, cannot tell whether a mailbox contains unread mail. Programs that clean up temporary directories (tmpreaper or tmpwatch, for example) will fail without atime. There are also hierarchical storage systems that, like the Unix system administrator of old, use atime to determine when to move files to slower storage. So atime needs to stick around, lest users run into a different kind of unpleasant surprise.

For now, the only recourse for users who run into (or are worried about) this problem is to explicitly mount their filesystems with the "noatime" option. Further ahead, it might be possible to make some tweaks to btrfs to mitigate the problem; Boaz Harrosh suggested disabling atime updates when the free space falls below a certain threshold or moving the atime data into a separate data structure. But nobody appears to be working on such solutions now. So it may be that, as usage of btrfs grows, users will occasionally be surprised that reading a file can consume space in their filesystems.

Comments (50 posted)

Patches and updates

Core kernel code

Development tools

Device drivers

  • Alex Williamson: VFIO . (May 30, 2012)

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds