Brief items
The 3.5 merge window remains open as of this writing, so there is no
current development kernel. The flow of patches into the mainline
continues; see the separate article below for a summary of what has been
merged thus far.
Stable updates: there have been no stable updates released in the
last week.
The 3.0.33,
3.2.19,
3.3.8, and
3.4.1 updates are all in the review process
as of this writing; they could all be released at any time.
Comments (none posted)
Lumpy reclaim had a purpose but in the mind of some, it was to kick
the system so hard it trashed. For others the purpose was to
complicate vmscan.c. Over time it was giving softer shoes and a
nicer attitude but memory compaction needs to step up and replace
it so this patch sends lumpy reclaim to the farm.
—
Mel
Gorman
Jiri [Kosina] is now also marked as the maintainer of floppy.c, I
shall be publically branding his forehead with red hot iron at the
next opportune moment.
—
Jens Axboe
As a kernel rights holder I question the legality of
Matthew's
proposal, and it would be amusingly unfortunate if the Software
Conservancy ended up beginning some of its Linux enforcement
against Fedora.
—
Alan Cox
Comments (none posted)
Version 2.0 of the minimal C library klibc has been released. "
A bit delayed due to kernel.org breakin, but development is
kicking in again. The 2.0 branch saw boot time tests and
deployments in Debian, so we are quite certain it should
work for the most out of you, if not please let us know." The
biggest change appears to be proper support for buffered I/O in the stdio
functions.
Full Story (comments: 1)
Kernel development news
By Jonathan Corbet
May 31, 2012
The 3.5 merge window continued in full force after
last week's summary, with another 4,000
non-merge changesets pulled into the mainline since then; there have now
been over 8,600 changes merged for 3.5, and the merge window is not done
yet. The most significant user-visible changes are:
- The autosleep patch set, implementing
Android-style opportunistic suspend (with a different API) has been
merged. Associated with this work is a new epoll flag
(EPOLLWAKEUP) which causes a wakeup event to be activated,
preventing suspend when an event is available for processing.
- The user namespace rework gets the
kernel closer to being able to safely run processes as root within a
container.
- RAID10 arrays managed by the MD layer can now be reshaped.
- After years of attempts, the uprobes
subsystem has been merged. See this
article for more information on the version of uprobes that was
merged for 3.5.
- The tmpfs filesystem now supports hole punching and the
SEEK_DATA and SEEK_HOLE lseek() options.
- The removal of old code continues; victims include Microchannel bus
support,
legacy CRIS RTC drivers,
the imxmmc driver,
the swap token code, and
the lumpy reclaim mechanism.
- New drivers include:
- Systems and processors:
Renesas SH7264, SH7269, and SH7734 processors,
ST SPEAr13xx processors,
Atheros DB120 reference boards, and
Lantiq FALCON processors.
- Audio:
Freescale MC13783 codecs,
Cirrus Logic CS42L52 low power stereo codecs,
LAPIS Semiconductor ML26124 codecs,
TI LM49453 codecs, and
ST Ericsson Ux500-based audio platform devices.
- Block:
Cirrus Logic EP93xx PATA controllers.
- Graphics:
Aspeed Technologies AST 2000, 2100, 2200, 2150 and 2300 chips,
MGA G200 server engines, and
QEMU-emulated Cirrus GPUs.
- Input:
I2C-based Wacom tablets,
National Semiconductor LM8333 keypad controllers,
Dialog DA9052/DA9053 touchscreen controllers, and
Synaptics NavPoint touchpads on PXA27x SSP ports.
- Miscellaneous:
STA2X11 "ConneXt" I/O hubs,
Power 7+ Nest crypto accelerators,
Texas Instruments INA219 and INA226 power management chips,
Intel Atom E6xx watchdogs,
Intel MSIC mixed signal gpio controllers,
RICOH RC5T583 GPIO controllers,
Samsung Exynos I/O memory management units, and
Dialog DA9052 watchdogs.
- Multi-function chipsets:
Maxim Semiconductor MAX77693 PMICs,
Intel ICH LPC bridges,
ST Microelectronics ConneXt (STA2X11) I/O hubs, and
National Semiconductor / TI LM3533 lighting power chips.
- Network:
NXP PN544 NFC controllers.
- Video4Linux:
Infineon TUA 9001 silicon tuners,
Afatech AF9033 DVB-T demodulators,
Afatech AF9035 based DVB USB receivers,
Fitipower FC0011 silicon tuners,
LG Electronics LG216x-based ATSC-MH demodulators,
Fitipower FC0012 and FC0013 silicon tuners,
STA2x11 video input ports, and
SMIA++/SMIA-compliant sensors.
Changes visible to kernel developers include:
- The kernel's exception table can now be sorted at build time,
speeding the boot process somewhat.
- The ALSA core now supports "dynamic PCM" devices, being audio devices
split into front and back ends which allow arbitrary routing of audio
data between the front and back end devices.
- The contiguous memory allocator patch
set, designed to make life easier on systems where large chunks of
physically-contiguous memory are needed on occasion, has been merged
at last. The same pull included a complete rework of the ARM DMA
mapping subsystem, adding CMA support and support for I/O memory
management units.
- The DMA buffer sharing subsystem has gained support for mapping
buffers into user space. Also added is a new dma_buf_vmap()
function for mapping buffers (using the vmalloc() area) into
kernel space.
- <asm/word-at-a-time.h> has been significantly reworked
(by Linus) to be more efficient on all architectures;
strnlen_user() has then been rewritten to use it in a generic
manner.
- The LED subsystem now supports one-shot timed operation; see ledtrig-transient.txt for details.
- The error detection and correction (EDAC) subsystem has been massively
reworked to better handle contemporary processors and memory
controllers.
As of this writing, the 3.5 merge window has a few more days left to run.
The final article in this series will come out once the merge window has
closed.
Comments (1 posted)
By Jonathan Corbet
May 30, 2012
Uprobes is a kernel patch with a long story and many contentious
discussions behind it. This code has its roots in utrace, a user-space
tracing and debugging API that was first
covered here in early 2007. Utrace ran into
various types of opposition (only partly related to its own origin in
SystemTap) and has never been merged, but a piece of it
lives on in the form of uprobes, which is charged with the placement of
probes into user-space code. After several mailing-list rounds of its own,
uprobes was finally merged for the 3.5 kernel development cycle. Just how
this facility will be used remains to be seen, however.
At the core of uprobes is this function:
#include <linux/uprobes.h>
int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
The inode structure represents an executable file; the probe is to
be placed at offset bytes from the beginning. The
uprobe_consumer structure tells the kernel what is to be done when
a process encounters the probe; it looks like:
struct uprobe_consumer {
int (*handler) (struct uprobe_consumer *self, struct pt_regs *regs);
bool (*filter) (struct uprobe_consumer *self, struct task_struct *task);
struct uprobe_consumer *next;
};
The filter() function is optional; if it exists, it determines
whether handler() is called for each specific hit on the probe.
The handler returns an int, but the return value is ignored in the
current code.
Since probes are associated with files, they affect all processes that run
code from those files. A special copy is made of the page to contain the
probe; in that copy, the instruction at the specified offset is copied and
replaced by a breakpoint. When the breakpoint is hit by a running process,
filter() will be called if present, and handler() will be
run unless the filter said otherwise. Then the displaced instruction is
executed (using the "execute out of line" mechanism described in this article) and control returns to the
instruction following the breakpoint.
Uprobes thus implements a mechanism by which a kernel function can be
invoked whenever a process executes a specific instruction location. One
could imagine a number of things that said kernel function could do; there
has been talk, for example, of using uprobes (and, perhaps someday,
something derived from utrace) as a replacement for the much-maligned
ptrace() system call. Tools like GDB could place breakpoints with
uprobes; it might even be possible to load simple filters for conditional
breakpoints into the kernel, speeding their execution considerably.
Uprobes could also someday be a component of a Dtrace-like dynamic tracing
functionality. For now, though, the interfaces for that kind of feature
have not been added to the kernel; none have even been proposed.
What the current implementation does have is integration with the
perf events subsystem. New dynamic "events" can be added to any file
location via an interface similar to that used for dynamic kernel tracepoints. In particular,
there is a new file called uprobe_events in the tracing directory
(/sys/kernel/debug/tracing/ on most systems) that is used to add
and remove events. As an example, a line like:
echo 'p:bashme /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
would place a new event (called "
bashme") at location 0x4245c0 in
the
bash executable. The event would then appear with all other
events in
/sys/kernel/debug/tracing/events, in the
uprobes subdirectory. Like other events, it is not actually
turned on until its
enabled attribute is set. See
Documentation/trace/uprobetracer.txt for
details on the interface at this level.
Placing uprobes is, by default, a privileged operation requiring the
CAP_SYS_ADMIN capability. One can remove the privilege
requirement by setting the perf_paranoid sysctl knob to
-1, but doing so will allow the placement of dynamic tracepoints
anywhere in the system, in kernel or user space. Thus, one need not be
overly paranoid to leave perf_paranoid at its default setting.
The perf tool has been enhanced to make working with dynamic user-space
tracepoints easy. One can, for example, set a tracepoint at the entry to
the C library's malloc() implementation with:
perf probe -x /lib64/libc.so.6 malloc
That tracepoint can then be treated like any other event understood by
perf. See the
explanatory text from Ingo Molnar's pull request for examples of what
can be done.
Most kernel patches are conceived, implemented, reviewed, and merged into
the mainline over a fairly short period of time. But some of them seem to
languish for years without making much progress. Uprobes was such a patch
set. It must have been frustrating for the developers to keep revising and
posting this code, only to see it shot down over and over again. But the
kernel community can be supportive of developers who show both persistence
and a willingness to listen to criticism. The result, in this case, is a
user-space probing mechanism that has been simplified, made more robust,
and integrated into the existing events infrastructure. Hopefully it was
worth the wait.
Comments (4 posted)
By Jonathan Corbet
May 31, 2012
Unix and Unix-like systems have traditionally recorded the time of last
access for each file in the system. This practice has fallen partially out
of favor over the last decade for a simple reason: writing the
last-accessed time ("atime") takes up a lot of I/O bandwidth when lots of
files are being read; see
this article from
2007, for example. The worst of the atime-related problems have long
since been mitigated by moving to the "relatime" mount option by default;
relatime only updates atime a maximum of once per day for unchanging
files. But now it seems that atime recording can be especially problematic
with the btrfs filesystem, and relatime may not help much.
One of the core design features of btrfs is its copy-on-write nature.
Blocks on disk are never modified in place; instead, when it becomes
necessary to commit a change, the affected block is copied and rewritten
into a new location. Copy-on-write applies to metadata as well as data; if
a file's metadata (such as its last-accessed time) is changed, the block
containing that metadata will be copied to a new spot. So, on btrfs, an
operation that reads a lot of files (creating a tar archive, say, or a
recursive grep) can, through atime updates, cause the copying and rewriting
of a lot of metadata blocks.
Needless to say, performance is not improved this way, but that is not
where the big problem comes in. As Alexander Block pointed out, the real problem has to do with
the interaction between atime, copy-on-write, and snapshots.
Btrfs provides a fast snapshotting feature that can create a copy of the
state of the filesystem at a specific time. When a snapshot is created, it
shares all data and metadata with the "trunk" filesystem. Should a file be
changed, the resulting copy-on-write operation separates the trunk from the
snapshot, keeping both versions of the data available. So snapshots can be
thought of as being nearly free as long as the filesystem remains relatively
quiet. Snapshots will share data and metadata, so they do not require a
lot of additional space.
Atime updates change the situation, though. If somebody takes a snapshot
of a filesystem, then performs a recursive grep on that filesystem, the
last-access time of every file touched may be updated. That, in turn, can
cause copy-on-write operations on each file's inode structure, with the
result that many or all of the inodes in the filesystem may be duplicated.
That can
increase the space consumption of the filesystem considerably; Alexander
posted an example where a recursive grep
caused 2.2GB of free space to disappear. That is a surprising result for
what is meant to be a read-only operation.
Once upon a time, when disk capacities were measured in megabytes, it was
said that the only standard feature of Unix systems was the message of the
day telling users to clean up their files. Atime was often used by harried
system administrators trying to recover some disk space; they would scan
for infrequently-accessed files and, depending on how desperate the
situation was and how powerful their users were, either send lists of
unused files to users or simply back those files up to tape and delete
them. It is somewhat ironic that a feature meant (among other things) to
help keep disk space free has now, on next-generation filesystems, become
part of the problem.
It's worth noting that the relatime option (which only updates atime once
per day unless the file has been modified since the last atime update) is
of little help here. It only takes one atime update to force an inode to
be rewritten and unshared with any snapshots. So the fact that such
updates are limited to one per day offers little in the way of
consolation.
Users are also unlikely to be consoled by one other aspect of the problem
pointed out by Alexander: since reading data can consume space in the
filesystem, read operations might fail with "no space available" errors on an
overflowing filesystem. That may make it difficult or impossible to fix
the problem by copying data out of a full filesystem.
By the time that happens, a typical user could be
forgiven for thinking that, perhaps, they don't need last-accessed time
tracking at all.
Along those lines, Alexander suggested that it might be a good idea to
default to "noatime" (which turns off atime recording entirely) for btrfs
mounts, even if that means that btrfs would then behave differently than
other filesystems. That idea was quickly shot down for a simple reason:
there are still applications that actually need the atime information to
function correctly. The classic example is the mutt email client which, in
the absence of atime, cannot tell whether a mailbox contains unread mail.
Programs that clean up temporary directories (tmpreaper or tmpwatch, for
example) will fail without atime. There are also hierarchical storage
systems that, like the Unix system administrator of old, use atime to
determine when to move files to slower storage. So atime needs to stick
around, lest users run into a different kind of unpleasant surprise.
For now, the only recourse for users who run into (or are worried about)
this problem is to explicitly mount their filesystems with the "noatime"
option. Further ahead, it might be possible to make some tweaks to btrfs
to mitigate the problem; Boaz Harrosh suggested disabling atime updates when the
free space falls below a certain threshold or moving the atime data into a
separate data structure. But nobody appears to be working on such
solutions now. So it may be that, as usage of btrfs grows, users will
occasionally be surprised that reading a file can consume space in their
filesystems.
Comments (50 posted)
Patches and updates
Core kernel code
Development tools
Device drivers
- Alex Williamson: VFIO .
(May 30, 2012)
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>