Brief items
The current development kernel remains 2.6.31-rc5; there have been
no 2.6.31 prepatches released since July 31. Patches continue to flow
into the mainline repository (442 since 2.6.31-rc5, as of this writing) and
the 2.6.31-rc6 release can be expected at almost any time.
Comments (none posted)
Kernel development news
Ok, so my definition of "plain C" is a bit odd. There's nothing
plain about it. It's disgusting C preprocessor misuse. But dang,
it's kind of fun to abuse the compiler this way.
--
Linus Torvalds
Can we add a consistent "--eatmydata" type of hurdle to jump over
before people are allowed to use either the so-far-less-tested
tools and/or options therein? [...]
I'm nervous about ext4 coming into wider use and people finding
some of the bits which aren't -quite- ready for prime time yet, and
winding up with a disaster.
--
Eric Sandeen
Got a SEGV, don't worry about it anymore! Just rescue an exception
and get on with life. Who cares about getting a SEGV anyway? It's
just memory. I mean, when I was in school, I didn't need 100% to
pass the class. Why should your memory need to be 100% correct to
get the job done? A little memory corruption here and there doesn't
hurt anyone.
--
NeverSayDie,
get your copy today
Comments (6 posted)
By Jonathan Corbet
August 12, 2009
Tux3. The once-noisy
Tux3 development community has
gone rather quiet in recent months. An inquiry into the status of the
project led to one of last week's
quotes of the week, wherein
developer Daniel Phillips pled a lack of time and expressed regrets at not
having merged the code into the mainline months ago. When asked (by Ted
Ts'o) for a description of what makes Tux3 interesting, Daniel
responded this way:
I think Tux3 fills an empty niche in our filesystem ecology where
a simple, clean and modern general purpose filesystem should exist
and there is none. In concrete terms, Tux3 implements a
single-pointer-per-extent model that Btrfs and ZFS do not. This
allows a very simple *physical* design, with much complexity
pushed to the *logical* level where things generally behave
better. A simple physical design offers many benefits, including
making it easier to take a run at that holiest of holy grails,
online check and repair.
What Tux3 needs, it seems, is some new development energy. It could be an
interesting project for developers who are wanting to get started in
filesystem development.
Resource counters. The resource
counter mechanism is built into control groups; it is intended for use
by tools like the memory use controller. These counters contain, at their
core, a (believe it or not) counter value which tracks the current usage of
a resource by a given control group. This counter has run into the same
problem which afflicts any frequently-changed global variable: it scales
poorly due to cache line bouncing. The usage of some resources (pages of
memory, for example) can change frequently, causing the associated counter
to be a drag on the system as a whole.
Balbir Singh's scalable resource counters
patch aims to fix that situation. With this patch, the single "usage"
counter becomes an array of per-CPU counters. Since each processor works
with its own copy of the counter, there is no more cache line bouncing and
things run faster. The down side is that the count becomes approximate.
The per-CPU counters are summed occasionally to keep everything roughly in
sync, but keeping exact counts would take away much of the scalability that
this patch was meant to provide. The good news is that exact counts are
not really needed anyway; as long as the counter reflects something close
enough to reality, the system will work essentially as it did before - only
a little more quickly.
Inline spinlocks. Once upon a time, spinlocks were implemented with
a series of inline functions, on the notion that such a
performance-critical primitive would need to be as fast as possible. That
changed in 2004, when
spinlocks were turned into normal functions. The function call overhead
hurt a bit, but moving spinlocks out-of-line made the kernel considerably
smaller, which has performance benefits of its own. And that's how
spinlocks have been ever since.
The pendulum may be about to swing the other way again, though, at least
for the S390 architecture. Heiko Carstens noted that function calls on
this architecture are quite expensive. He put together an inline spinlocks patch and
measured performance improvements of 1-5%. So he would like to put this
patch into the mainline, along with a configuration option allowing each
architecture to choose the best way to implement spinlocks. So far, there
has been little commentary for or against this idea.
Const seq_operations. James Morris has posted a patch making seq_operations structures
constant throughout the kernel. These structures are almost always
populated at compile time and never need to change; allowing the function
pointers therein to be overwritten can only be useful to those who would
like to subvert the kernel. A number of core VFS operations structures
have been made const over the years, but seq_operations
has not been addressed until now. James says: "This is derived from
the grsecurity patch, although generated
from scratch because it's simpler than extracting the changes
from there."
data=guarded. Back in the middle of the discussion of crash robustness
and latency in the ext3 filesystem, Chris Mason came forward with a
proposal for a data=guarded
mode, which would delay metadata updates when files change size to
prevent the disclosure of unrelated information. Since then, the
data=guarded patch has disappeared from view. In response to a query from
Frans Pop, Chris confirmed that he is still
working on that code, and that he plans to get it merged for 2.6.32.
Among those welcoming the news was Andi Kleen, who remarked: "data=writeback already cost
me a few files after crashes here." The data=guarded mode may not
help with that particular problem, though: it is really meant to combine
the security benefits of data=ordered (not disclosing random data, in
particular) with the performance benefits of data=writeback. The worst
data-loss problems should have already been addressed by the robustness
fixes that went into ext3 for 2.6.30.
Comments (4 posted)
By Jonathan Corbet
August 12, 2009
Tracepoints are a marker within the kernel source which, when enabled, can
be used to hook into a running kernel at the point where the marker is
located. They can be used by a number of tools for kernel debugging and
performance problem diagnosis. One of the advantages of the DTrace system
found in Solaris is the extensive set of well-documented tracepoints in the
kernel (and beyond); they allow administrators and developers to monitor
many aspects of system behavior without needing to know much about the
kernel itself. Linux, instead, is rather late to the tracepoint party;
mainline kernels currently feature only a handful of static tracepoints.
Whether that number will grow significantly is still a matter of debate
within the development community.
LWN last looked at the tracepoint
discussion in April. Since then, the disagreement has returned with
little change. The catalyst this time was Mel Gorman's page allocator tracepoints
patch, which further instruments the memory management layer. The
mainline kernel already contains tracepoints for calls to functions like
kmalloc(), kmem_cache_alloc(), and kfree().
Mel's patch adds tracepoints to the low-level page allocator, in places
like free_pages_bulk(), __rmqueue_fallback(), and
__free_pages(). These tracepoints give a view into how the page
allocator is performing; they'll inform a suitably clueful user if
fragmentation is growing or pages are being moved between processors. Also
included is a postprocessing script which uses the tracepoint data to
create a list of which processes on the system are putting the most stress
on the memory management code.
As has happened before, Andrew Morton questioned the value of these tracepoints. He
tends not to see the need for this sort of instrumentation, seeing it
instead as debugging code which is generally useful to a single developer.
Beyond that, Andrew asks, why can't the relevant information be added to
/proc/vmstat, which is an established interface for the provision
of memory management information to user space?
There are a couple of answers to that question. One is that
/proc/vmstat has a number of limitations; it cannot be used, for
example, to monitor the memory-management footprint of a specific set of
processes. It is, in essence, pre-cooked information about memory
management in the system as a whole; if a developer needs information which
cannot be found there, that information will be almost impossible to get.
Tracepoints, instead, provide much more specific information which can be
filtered to give more precise views of the system. Mel bashed out one demonstration: a SystemTap script which uses
the tracepoints to create a list of which processes are causing the most
page allocations.
Ingo Molnar posted a lengthy set of
examples of what could be done with tracepoints; some of these were
later taken by Mel and incorporated into a
document on simple tracepoint use. These examples merit a look; they
show just how quickly and how far the instrumentation of the Linux kernel
(and associated tools) have developed.
One of the key secrets for quick use of tracepoints is the perf
tool which is shipped with the kernel as of 2.6.31-rc1. This tool was written
as part of the performance monitoring subsystem; it can be used, for
example, to run a program and report on the number of cache misses
sustained during its execution. One of the features slipped into the
performance counter subsystem was the ability to treat tracepoint events
like performance counter events. One must set the
CONFIG_EVENT_PROFILE configuration option; after that,
perf can work with tracepoint events in exactly the same way it
manages counter events.
With that in place, and a working perf binary, one can start by
seeing which tracepoint events are available on the system:
$ perf list
...
ext4:ext4_sync_fs [Tracepoint event]
kmem:kmalloc [Tracepoint event]
kmem:kmem_cache_alloc [Tracepoint event]
kmem:kmalloc_node [Tracepoint event]
kmem:kmem_cache_alloc_node [Tracepoint event]
kmem:kfree [Tracepoint event]
kmem:kmem_cache_free [Tracepoint event]
ftrace:kmem_free [Tracepoint event]
...
How many kmalloc() calls are happening on a system? The question
can be answered with:
$ perf stat -a -e kmem:kmalloc sleep 10
Performance counter stats for 'sleep 10':
4119 kmem:kmalloc
10.001645968 seconds time elapsed
So your editor's mostly idle system was calling kmalloc() almost
420 times per second. The -a option gives whole-system results,
but perf can also look at specific processes. Monitoring allocations
during the building of the perf tool gives:
$ perf stat -e kmem:kmalloc make
...
Performance counter stats for 'make':
5554 kmem:kmalloc
2.999255416 seconds time elapsed
More detail can be had be recording data and analyzing it afterward:
$ perf record -c 1 -e kmem:kmalloc make
...
$ perf report
# Samples: 6689
#
# Overhead Command Shared Object Symbol
# ........ ............... .................................... ......
#
19.43% make /lib64/libc-2.10.1.so [.] __getdents64
12.32% sh /lib64/libc-2.10.1.so [.] __execve
10.29% gcc /lib64/libc-2.10.1.so [.] __execve
7.53% cc1 /lib64/libc-2.10.1.so [.] __GI___libc_open
5.02% cc1 /lib64/libc-2.10.1.so [.] __execve
4.41% sh /lib64/libc-2.10.1.so [.] __GI___libc_open
3.45% sh /lib64/libc-2.10.1.so [.] fork
3.27% sh /lib64/ld-2.10.1.so [.] __mmap
3.11% as /lib64/libc-2.10.1.so [.] __execve
2.92% make /lib64/libc-2.10.1.so [.] __GI___vfork
2.65% gcc /lib64/libc-2.10.1.so [.] __GI___vfork
Conclusion: the largest source of kmalloc() calls in a simple
compilation process is getdents(), called from make,
followed by the execve() calls needed to run the compiler.
The perf tool can take things further; it can, for example,
generate call graphs and disassemble the code around specific
performance-relevant points. See Ingo's mail and Mel's document for more
information. Even then, we're just talking about statistics on
tracepoints; there is a lot more information available which can be used in
postprocessing scripts or tools like SystemTap. Suffice to say that
tracepoints open a lot of possibilities.
The obvious question is: was Andrew impressed by all this? Here's his answer:
So? The fact that certain things can be done doesn't mean that there's
a demand for them, nor that anyone will _use_ this stuff.
As usual, we're adding tracepoints because we feel we must add
tracepoints, not because anyone has a need for the data which they
gather.
He suggested that he would be happier if the new tracepoints could be used
to phase out /proc/vmstat and /proc/meminfo; that way
there would not be a steadily-increasing variety of memory management
instrumentation methods. Removing those files is problematic for a couple
of reasons, though. One is that they form part of the kernel ABI, which is
not easily broken. It would be a multi-year process to move applications
over to a different interface and be sure there were no more users of the
/proc files. Beyond that, though, tracepoints are good for
reporting events, but they are a bit less well-suited to reporting the
current state of affairs. One can use a tracepoint to see page allocation
events, but an interface like /proc/vmstat can be more
straightforward if one simply wishes to know how many pages are free.
There is space, in other words, for both styles of instrumentation.
As of this writing, nobody has made a final pronouncement on whether the
new tracepoints will be merged. Andrew has made it clear, though, that,
despite his concerns, he's not firmly opposing them. There is enough
pressure to get better instrumentation into the kernel, and enough useful
things to do with that instrumentation, that, one assumes, more of it will
go into the mainline over time.
Comments (15 posted)
By Jake Edge
August 12, 2009
As part of the changes to support application checkpoint and restart in the
kernel, Sukadev Bhattiprolu has proposed a new system call:
clone_with_pids(). When a process that was checkpointed gets
restarted, having the same process id (PID) as it had when the checkpoint
was done is important to some kinds of applications. Normally, the kernel
assigns an unused PID
when a new task is started (via clone()), but, for checkpointed
processes, that could lead to
processes' PIDs changing during their lifetime, which could be an
undesirable side effect. So, Bhattiprolu is looking for a way to avoid
that by allowing clone() callers to specify the
PID—or PIDs for processes in nested
namespaces—of the child.
The actual system call is fairly straightforward. It adds an additional
pid_set parameter to clone(), to contain a list of
process ids; pid_set has the obvious definition:
struct pid_set {
int num_pids;
pid_t *pids;
};
A pointer to a
pid_set is passed as the last parameter to
clone_with_pids(). Each of the PIDs is used to specify
which PID should be assigned at each level of namespace nesting.
The patch that actually implements
clone_with_pids() (as opposed
to the earlier patches in the patchset that prepare the way)
illustrates this with an example (slightly
edited for clarity):
pid_t pids[] = { 0, 77, 99 };
struct pid_set pid_set;
pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.pids = &pids;
clone_with_pids(flags, stack, NULL, NULL, NULL, &pid_set);
If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.
The patchset assumes that being able to set PIDs is desirable, but
Linus Torvalds was
not particularly in favor of that approach when it was first discussed on linux-kernel back
in March. His complaint was that there are far too many stateful
attributes of processes to ever be able to handle checkpointing in the
general case. His suggestion: "just teach the damn program
you're checkpointing that pids will change, and admit to everybody
that people who want to be checkpointed need to do work".
Others disagreed—no surprise—but it is unclear that
Torvalds has changed his mind. He was also concerned about the security
implications of processes being able to request PID assignments:
"But it also sounds like a _wonderful_ attack vector against badly
written user-land software that sends signals and has small races."
That particular concern should be alleviated by the requirement that a
process have the CAP_SYS_ADMIN capability (essentially root
privileges) in order to use clone_with_pids().
Requiring root to
handle restarts, which in practice means that root must manage the checkpoint
process as well, makes checkpoint/restart less useful, overall. But there
are a whole host of problems to solve before allowing users to arbitrarily
checkpoint and restore from their own, quite possibly maliciously crafted,
checkpoint images. Even with root handling the process, there are a number
of interesting applications.
There is an additional wrinkle that Bhattiprolu notes in the patch.
Currently, all of the available clone() flags are allocated. That
doesn't affect clone_with_pids() directly, as the flags it needs
are already present, but, when adding a system call, it is good to look
to the future. To that end, there are two proposed implementations of
a clone_extended() system call, which could be added instead of
clone_with_pids(), that would allow for more
clone() flags, while still supporting the restart case.
The first possibility is to turn the flags argument into a pointer
to an array of flag entries, that would be treated like signal()
sets, including operations to test, set, and clear flags a la
sigsetops():
typedef struct {
unsigned long flags[CLONE_FLAGS_WORDS];
} clone_flags_t;
int clone_extended(clone_flags_t *flags, void *child_stack, int *unused,
int *parent_tid, int *child_tid, struct pid_set *pid_set);
In the proposal,
CLONE_FLAGS_WORDS would be set to 1 for 64-bit
architectures,
while on 32-bit architectures, it would be set to 2, thus doubling the
number of available flags to 64. Should the number of clone flags needed
grow, that could be expanded as required, though doing so in a
backward-compatible manner is not really possible.
Another option is to split the flags into two parameters, keeping the
current flags parameter as it is, and adding a new
clone_info parameter that contains new flags along with the
pid_set:
struct clone_info {
int num_clone_high_words;
int *flags_high;
struct pid_set pid_set;
}
int clone_extended(int flags_low, void *child_stack, void *unused,
int *parent_tid, int *child_tid, struct clone_info *clone_info);
There are pros and cons to each approach, as Bhattiprolu points out. The
first requires a
copy_from_user() for the flags in all cases
(though 64-bit architectures might be able to avoid that for now), while
the second requires the awkward splitting of the flags, but avoids the
copy_from_user() for calls that don't use the new flags or
pid_sets.
It is hard to imagine that copying a bit of data from user space will
measurably impact a system call that is creating a process, though, so some
derivative of the first option would seem to be the better choice. It's
also a bit hard to see the need for more than 64 clone() flags,
but if that is truly desired, something with a path for compatibility is
needed.
There has been no objection to the implementation of
clone_with_pids(), but there have been few comments overall.
Pavel Machek wondered about the need for
setting the PID of anything but the inner-most namespace, but
Serge E. Hallyn noted that nested
namespaces require that ability: "we might be restarting an app
using a nested pid namespace, in which case restart would specify pids for
2 (or more) of the innermost containers".
Machek also thought there should be a documentation file that described the
new system call, and Bhattiprolu agreed, but is waiting to see what kind of
consensus on either clone_with_pids() or clone_extended()
(and which of the two interfaces for the latter) would emerge. So far, no
one has commented on that particular aspect.
This
is version 4 of the patchset, and the history shows that earlier comments
have been addressed. It is still at the RFC stage, or, as
Bhattiprolu puts it: "Its mostly an exploratory patch seeking
feedback on the interface". That feedback has yet to emerge,
however, and one might wonder whether Torvalds will still object to the
whole approach. It would seem, though, that there are too many important
applications for checkpoint and restart—including process migration
and the ability to upgrade kernels underneath long-running
processes—for some kind of solution not to make its way into the
kernel eventually.
Comments (8 posted)
By Jonathan Corbet
August 10, 2009
Network device drivers have been using the increasingly misnamed NAPI ("new
API") interface for some time now. NAPI allows a network driver to
turn off interrupts from an interface and go into a polling mode. Polling
is often seen as a bad thing, but it's really only a problem when poll
attempts turn up no useful work to do. With a busy network interface,
there will always be new packets to process; "polling," in this situation, really means
"going off to deal with the accumulated work." When there is always work
to do, interrupts informing the system of that fact are really just added
noise. Your editor likes to compare the situation to email notifications;
anybody who gets a reasonable volume of email is quite likely to turn such
notifications off. They are distracting, and there is probably always
email waiting whenever one gets around to checking.
NAPI is well suited to network drivers, since high packet rates can lead to
high interrupt rates, but it has not spread to other parts of the kernel,
where interrupt rates are lower. That situation could change
in 2.6.32, though, if Jens Axboe follows through with his plan to merge the
new blk-iopoll
infrastructure into the mainline. In short, blk-iopoll is NAPI for block
devices; indeed, some of the core code was borrowed from the NAPI
implementation.
Converting a block driver to the blk-iopoll is straightforward. Each
interrupting device needs to have a struct blk_iopoll structure
defined for it, presumably in the structure which describes the device
within the driver. This structure should be initialized with:
#include <linux/blk-iopoll.h>
typedef int (blk_iopoll_fn)(struct blk_iopoll *, int);
void blk_iopoll_init(struct blk_iopoll *iop, int weight, blk_iopoll_fn *poll_fn);
The weight value describes the relative importance of the device;
a higher weight results in more requests being processed in each polling
cycle. As with NAPI, there is no definitive guidance as to what
weight should be; in Jens's initial patch, it is set to 32. The
poll_fn() will be called when the block subsystem decides that it's
time to poll for completed requests.
I/O polling for a device is controlled with:
void blk_iopoll_enable(struct blk_iopoll *iop);
void blk_iopoll_disable(struct blk_iopoll *iop);
A call to blk_iopoll_enable() must be made by the driver before
any polling of the device will happen. Enabling polling allows that
polling to occur, but does not cause it to happen. There is no
point in polling a device which is not doing any work, so the block layer
will not actually poll a given device until the driver informs it that
there may be a reason to do so.
That normally happens when the device is actually interrupting. The driver
can, in its interrupt handler, switch over to polling mode through a
three-step process. The first is to check the global variable
blk_iopoll_enabled; if it is zero, block I/O polling cannot be
used. Assuming polling is enabled, the driver should prepare the
blk_iopoll structure with:
int blk_iopoll_sched_prep(struct blk_iopoll *iop);
In the first version of the patch, a return value of zero means that the
preparation "failed," either because polling is disabled or because the
device is already in polling mode. In future versions, the sense of the
return value is likely to be inverted to the more standard "zero means
success" mode. If blk_iopoll_sched_prep() succeeds, the
driver can then call:
void blk_iopoll_sched(struct blk_iopoll *iop);
At this point, polling mode has been entered; the driver need only disable
interrupts from its device and return. The "disable interrupts" step
should, of course, be done at the device itself; masking the IRQ line would
be an antisocial act in a world where those lines are shared.
Later on, the block layer will call the poll_fn() which was
provided to blk_iopoll_init(). The prototype for this function
is:
typedef int (blk_iopoll_fn)(struct blk_iopoll *iop, int budget);
The polling function is called (in software interrupt context) with
iop being the related
blk_iopoll structure, and budget being the maximum number
of requests that the poll function should process. In normal usage, the
driver's device-specific structure can be obtained from iop with
container_of(). The budget value is just the
weight that was specified back at initialization time.
The return value should be the number of requests actually processed.
If the device consumes less than the given budget, it should turn
off further polling with:
void blk_iopoll_complete(struct blk_iopoll *iopoll);
Interrupts from the device should be re-enabled, since further polling
will not happen. Note that the block layer assumes that a driver will
not call blk_iopoll_complete() if it has consumed its
full budget. If it's necessary to return to interrupt mode despite having
exhausted the budget, the driver should either (1) use
blk_iopoll_disable(), or (2) lie about the number of requests
processed when returning from the polling function.
One might well wonder about the motivation behind all of this work. Block
device interrupt handling has not traditionally been a performance
bottleneck. The problem is the rapid improvement in solid-state storage
devices. It is expected that, before too long, these devices will be
operating in the range of 100,000 I/O operations per second - far beyond
anything that rotating storage can do. When dealing with that many I/O
operations, the kernel must take care to minimize the per-operation
overhead in any way possible. As others have observed, the block layer
needs to become more like the network layer, with the per-request cost
squeezed to a bare minimum. The blk-iopoll code is a step in that
direction.
How big a step? Jens has posted some
preliminary numbers showing significant reductions in system time on a
random-read disk benchmark. More testing will certainly be required; in
particular, some developers are concerned about the possibility of
increasing I/O latency. But the initial numbers suggest that this work has
improved the efficiency of the block subsystem under load.
Comments (5 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>