User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.31-rc5; there have been no 2.6.31 prepatches released since July 31. Patches continue to flow into the mainline repository (442 since 2.6.31-rc5, as of this writing) and the 2.6.31-rc6 release can be expected at almost any time.

Comments (none posted)

Kernel development news

Quotes of the week

Ok, so my definition of "plain C" is a bit odd. There's nothing plain about it. It's disgusting C preprocessor misuse. But dang, it's kind of fun to abuse the compiler this way.
-- Linus Torvalds

Can we add a consistent "--eatmydata" type of hurdle to jump over before people are allowed to use either the so-far-less-tested tools and/or options therein? [...]

I'm nervous about ext4 coming into wider use and people finding some of the bits which aren't -quite- ready for prime time yet, and winding up with a disaster.

-- Eric Sandeen

Got a SEGV, don't worry about it anymore! Just rescue an exception and get on with life. Who cares about getting a SEGV anyway? It's just memory. I mean, when I was in school, I didn't need 100% to pass the class. Why should your memory need to be 100% correct to get the job done? A little memory corruption here and there doesn't hurt anyone.
-- NeverSayDie, get your copy today

Comments (6 posted)

In Brief

By Jonathan Corbet
August 12, 2009
Tux3. The once-noisy Tux3 development community has gone rather quiet in recent months. An inquiry into the status of the project led to one of last week's quotes of the week, wherein developer Daniel Phillips pled a lack of time and expressed regrets at not having merged the code into the mainline months ago. When asked (by Ted Ts'o) for a description of what makes Tux3 interesting, Daniel responded this way:

I think Tux3 fills an empty niche in our filesystem ecology where a simple, clean and modern general purpose filesystem should exist and there is none. In concrete terms, Tux3 implements a single-pointer-per-extent model that Btrfs and ZFS do not. This allows a very simple *physical* design, with much complexity pushed to the *logical* level where things generally behave better. A simple physical design offers many benefits, including making it easier to take a run at that holiest of holy grails, online check and repair.

What Tux3 needs, it seems, is some new development energy. It could be an interesting project for developers who are wanting to get started in filesystem development.

Resource counters. The resource counter mechanism is built into control groups; it is intended for use by tools like the memory use controller. These counters contain, at their core, a (believe it or not) counter value which tracks the current usage of a resource by a given control group. This counter has run into the same problem which afflicts any frequently-changed global variable: it scales poorly due to cache line bouncing. The usage of some resources (pages of memory, for example) can change frequently, causing the associated counter to be a drag on the system as a whole.

Balbir Singh's scalable resource counters patch aims to fix that situation. With this patch, the single "usage" counter becomes an array of per-CPU counters. Since each processor works with its own copy of the counter, there is no more cache line bouncing and things run faster. The down side is that the count becomes approximate. The per-CPU counters are summed occasionally to keep everything roughly in sync, but keeping exact counts would take away much of the scalability that this patch was meant to provide. The good news is that exact counts are not really needed anyway; as long as the counter reflects something close enough to reality, the system will work essentially as it did before - only a little more quickly.

Inline spinlocks. Once upon a time, spinlocks were implemented with a series of inline functions, on the notion that such a performance-critical primitive would need to be as fast as possible. That changed in 2004, when spinlocks were turned into normal functions. The function call overhead hurt a bit, but moving spinlocks out-of-line made the kernel considerably smaller, which has performance benefits of its own. And that's how spinlocks have been ever since.

The pendulum may be about to swing the other way again, though, at least for the S390 architecture. Heiko Carstens noted that function calls on this architecture are quite expensive. He put together an inline spinlocks patch and measured performance improvements of 1-5%. So he would like to put this patch into the mainline, along with a configuration option allowing each architecture to choose the best way to implement spinlocks. So far, there has been little commentary for or against this idea.

Const seq_operations. James Morris has posted a patch making seq_operations structures constant throughout the kernel. These structures are almost always populated at compile time and never need to change; allowing the function pointers therein to be overwritten can only be useful to those who would like to subvert the kernel. A number of core VFS operations structures have been made const over the years, but seq_operations has not been addressed until now. James says: "This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there."

data=guarded. Back in the middle of the discussion of crash robustness and latency in the ext3 filesystem, Chris Mason came forward with a proposal for a data=guarded mode, which would delay metadata updates when files change size to prevent the disclosure of unrelated information. Since then, the data=guarded patch has disappeared from view. In response to a query from Frans Pop, Chris confirmed that he is still working on that code, and that he plans to get it merged for 2.6.32.

Among those welcoming the news was Andi Kleen, who remarked: "data=writeback already cost me a few files after crashes here." The data=guarded mode may not help with that particular problem, though: it is really meant to combine the security benefits of data=ordered (not disclosing random data, in particular) with the performance benefits of data=writeback. The worst data-loss problems should have already been addressed by the robustness fixes that went into ext3 for 2.6.30.

Comments (4 posted)

Fun with tracepoints

By Jonathan Corbet
August 12, 2009
Tracepoints are a marker within the kernel source which, when enabled, can be used to hook into a running kernel at the point where the marker is located. They can be used by a number of tools for kernel debugging and performance problem diagnosis. One of the advantages of the DTrace system found in Solaris is the extensive set of well-documented tracepoints in the kernel (and beyond); they allow administrators and developers to monitor many aspects of system behavior without needing to know much about the kernel itself. Linux, instead, is rather late to the tracepoint party; mainline kernels currently feature only a handful of static tracepoints. Whether that number will grow significantly is still a matter of debate within the development community.

LWN last looked at the tracepoint discussion in April. Since then, the disagreement has returned with little change. The catalyst this time was Mel Gorman's page allocator tracepoints patch, which further instruments the memory management layer. The mainline kernel already contains tracepoints for calls to functions like kmalloc(), kmem_cache_alloc(), and kfree(). Mel's patch adds tracepoints to the low-level page allocator, in places like free_pages_bulk(), __rmqueue_fallback(), and __free_pages(). These tracepoints give a view into how the page allocator is performing; they'll inform a suitably clueful user if fragmentation is growing or pages are being moved between processors. Also included is a postprocessing script which uses the tracepoint data to create a list of which processes on the system are putting the most stress on the memory management code.

As has happened before, Andrew Morton questioned the value of these tracepoints. He tends not to see the need for this sort of instrumentation, seeing it instead as debugging code which is generally useful to a single developer. Beyond that, Andrew asks, why can't the relevant information be added to /proc/vmstat, which is an established interface for the provision of memory management information to user space?

There are a couple of answers to that question. One is that /proc/vmstat has a number of limitations; it cannot be used, for example, to monitor the memory-management footprint of a specific set of processes. It is, in essence, pre-cooked information about memory management in the system as a whole; if a developer needs information which cannot be found there, that information will be almost impossible to get. Tracepoints, instead, provide much more specific information which can be filtered to give more precise views of the system. Mel bashed out one demonstration: a SystemTap script which uses the tracepoints to create a list of which processes are causing the most page allocations.

Ingo Molnar posted a lengthy set of examples of what could be done with tracepoints; some of these were later taken by Mel and incorporated into a document on simple tracepoint use. These examples merit a look; they show just how quickly and how far the instrumentation of the Linux kernel (and associated tools) have developed.

One of the key secrets for quick use of tracepoints is the perf tool which is shipped with the kernel as of 2.6.31-rc1. This tool was written as part of the performance monitoring subsystem; it can be used, for example, to run a program and report on the number of cache misses sustained during its execution. One of the features slipped into the performance counter subsystem was the ability to treat tracepoint events like performance counter events. One must set the CONFIG_EVENT_PROFILE configuration option; after that, perf can work with tracepoint events in exactly the same way it manages counter events.

With that in place, and a working perf binary, one can start by seeing which tracepoint events are available on the system:

    $ perf list
      ext4:ext4_sync_fs                        [Tracepoint event]
      kmem:kmalloc                             [Tracepoint event]
      kmem:kmem_cache_alloc                    [Tracepoint event]
      kmem:kmalloc_node                        [Tracepoint event]
      kmem:kmem_cache_alloc_node               [Tracepoint event]
      kmem:kfree                               [Tracepoint event]
      kmem:kmem_cache_free                     [Tracepoint event]
      ftrace:kmem_free                         [Tracepoint event]

How many kmalloc() calls are happening on a system? The question can be answered with:

    $ perf stat -a -e kmem:kmalloc sleep 10

     Performance counter stats for 'sleep 10':

           4119  kmem:kmalloc            

     10.001645968  seconds time elapsed

So your editor's mostly idle system was calling kmalloc() almost 420 times per second. The -a option gives whole-system results, but perf can also look at specific processes. Monitoring allocations during the building of the perf tool gives:

    $ perf stat -e kmem:kmalloc make
 Performance counter stats for 'make':

           5554  kmem:kmalloc            

  2.999255416  seconds time elapsed

More detail can be had be recording data and analyzing it afterward:

    $ perf record -c 1 -e kmem:kmalloc make
    $ perf report
    # Samples: 6689
    # Overhead          Command                         Shared Object  Symbol
    # ........  ...............  ....................................  ......
      19.43%             make  /lib64/                 [.] __getdents64
      12.32%               sh  /lib64/                 [.] __execve
      10.29%              gcc  /lib64/                 [.] __execve
       7.53%              cc1  /lib64/                 [.] __GI___libc_open
       5.02%              cc1  /lib64/                 [.] __execve
       4.41%               sh  /lib64/                 [.] __GI___libc_open
       3.45%               sh  /lib64/                 [.] fork
       3.27%               sh  /lib64/                   [.] __mmap
       3.11%               as  /lib64/                 [.] __execve
       2.92%             make  /lib64/                 [.] __GI___vfork
       2.65%              gcc  /lib64/                 [.] __GI___vfork

Conclusion: the largest source of kmalloc() calls in a simple compilation process is getdents(), called from make, followed by the execve() calls needed to run the compiler.

The perf tool can take things further; it can, for example, generate call graphs and disassemble the code around specific performance-relevant points. See Ingo's mail and Mel's document for more information. Even then, we're just talking about statistics on tracepoints; there is a lot more information available which can be used in postprocessing scripts or tools like SystemTap. Suffice to say that tracepoints open a lot of possibilities.

The obvious question is: was Andrew impressed by all this? Here's his answer:

So? The fact that certain things can be done doesn't mean that there's a demand for them, nor that anyone will _use_ this stuff.

As usual, we're adding tracepoints because we feel we must add tracepoints, not because anyone has a need for the data which they gather.

He suggested that he would be happier if the new tracepoints could be used to phase out /proc/vmstat and /proc/meminfo; that way there would not be a steadily-increasing variety of memory management instrumentation methods. Removing those files is problematic for a couple of reasons, though. One is that they form part of the kernel ABI, which is not easily broken. It would be a multi-year process to move applications over to a different interface and be sure there were no more users of the /proc files. Beyond that, though, tracepoints are good for reporting events, but they are a bit less well-suited to reporting the current state of affairs. One can use a tracepoint to see page allocation events, but an interface like /proc/vmstat can be more straightforward if one simply wishes to know how many pages are free. There is space, in other words, for both styles of instrumentation.

As of this writing, nobody has made a final pronouncement on whether the new tracepoints will be merged. Andrew has made it clear, though, that, despite his concerns, he's not firmly opposing them. There is enough pressure to get better instrumentation into the kernel, and enough useful things to do with that instrumentation, that, one assumes, more of it will go into the mainline over time.

Comments (15 posted)


By Jake Edge
August 12, 2009

As part of the changes to support application checkpoint and restart in the kernel, Sukadev Bhattiprolu has proposed a new system call: clone_with_pids(). When a process that was checkpointed gets restarted, having the same process id (PID) as it had when the checkpoint was done is important to some kinds of applications. Normally, the kernel assigns an unused PID when a new task is started (via clone()), but, for checkpointed processes, that could lead to processes' PIDs changing during their lifetime, which could be an undesirable side effect. So, Bhattiprolu is looking for a way to avoid that by allowing clone() callers to specify the PID—or PIDs for processes in nested namespaces—of the child.

The actual system call is fairly straightforward. It adds an additional pid_set parameter to clone(), to contain a list of process ids; pid_set has the obvious definition:

    struct pid_set {
	   int num_pids;
	   pid_t *pids;
A pointer to a pid_set is passed as the last parameter to clone_with_pids(). Each of the PIDs is used to specify which PID should be assigned at each level of namespace nesting. The patch that actually implements clone_with_pids() (as opposed to the earlier patches in the patchset that prepare the way) illustrates this with an example (slightly edited for clarity):
	pid_t pids[] = { 0, 77, 99 };
	struct pid_set pid_set;

	pid_set.num_pids = sizeof(pids) / sizeof(int);
	pid_set.pids = &pids;

	clone_with_pids(flags, stack, NULL, NULL, NULL, &pid_set);
If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY.

The patchset assumes that being able to set PIDs is desirable, but Linus Torvalds was not particularly in favor of that approach when it was first discussed on linux-kernel back in March. His complaint was that there are far too many stateful attributes of processes to ever be able to handle checkpointing in the general case. His suggestion: "just teach the damn program you're checkpointing that pids will change, and admit to everybody that people who want to be checkpointed need to do work".

Others disagreed—no surprise—but it is unclear that Torvalds has changed his mind. He was also concerned about the security implications of processes being able to request PID assignments: "But it also sounds like a _wonderful_ attack vector against badly written user-land software that sends signals and has small races." That particular concern should be alleviated by the requirement that a process have the CAP_SYS_ADMIN capability (essentially root privileges) in order to use clone_with_pids().

Requiring root to handle restarts, which in practice means that root must manage the checkpoint process as well, makes checkpoint/restart less useful, overall. But there are a whole host of problems to solve before allowing users to arbitrarily checkpoint and restore from their own, quite possibly maliciously crafted, checkpoint images. Even with root handling the process, there are a number of interesting applications.

There is an additional wrinkle that Bhattiprolu notes in the patch. Currently, all of the available clone() flags are allocated. That doesn't affect clone_with_pids() directly, as the flags it needs are already present, but, when adding a system call, it is good to look to the future. To that end, there are two proposed implementations of a clone_extended() system call, which could be added instead of clone_with_pids(), that would allow for more clone() flags, while still supporting the restart case.

The first possibility is to turn the flags argument into a pointer to an array of flag entries, that would be treated like signal() sets, including operations to test, set, and clear flags a la sigsetops():

    typedef struct {
	    unsigned long flags[CLONE_FLAGS_WORDS];
    } clone_flags_t;

    int clone_extended(clone_flags_t *flags, void *child_stack, int *unused,
	    int *parent_tid, int *child_tid, struct pid_set *pid_set);
In the proposal, CLONE_FLAGS_WORDS would be set to 1 for 64-bit architectures, while on 32-bit architectures, it would be set to 2, thus doubling the number of available flags to 64. Should the number of clone flags needed grow, that could be expanded as required, though doing so in a backward-compatible manner is not really possible.

Another option is to split the flags into two parameters, keeping the current flags parameter as it is, and adding a new clone_info parameter that contains new flags along with the pid_set:

    struct clone_info {
	    int num_clone_high_words;
	    int *flags_high;
	    struct pid_set pid_set;

    int clone_extended(int flags_low, void *child_stack, void *unused,
	    int *parent_tid, int *child_tid, struct clone_info *clone_info);
There are pros and cons to each approach, as Bhattiprolu points out. The first requires a copy_from_user() for the flags in all cases (though 64-bit architectures might be able to avoid that for now), while the second requires the awkward splitting of the flags, but avoids the copy_from_user() for calls that don't use the new flags or pid_sets.

It is hard to imagine that copying a bit of data from user space will measurably impact a system call that is creating a process, though, so some derivative of the first option would seem to be the better choice. It's also a bit hard to see the need for more than 64 clone() flags, but if that is truly desired, something with a path for compatibility is needed.

There has been no objection to the implementation of clone_with_pids(), but there have been few comments overall. Pavel Machek wondered about the need for setting the PID of anything but the inner-most namespace, but Serge E. Hallyn noted that nested namespaces require that ability: "we might be restarting an app using a nested pid namespace, in which case restart would specify pids for 2 (or more) of the innermost containers".

Machek also thought there should be a documentation file that described the new system call, and Bhattiprolu agreed, but is waiting to see what kind of consensus on either clone_with_pids() or clone_extended() (and which of the two interfaces for the latter) would emerge. So far, no one has commented on that particular aspect.

This is version 4 of the patchset, and the history shows that earlier comments have been addressed. It is still at the RFC stage, or, as Bhattiprolu puts it: "Its mostly an exploratory patch seeking feedback on the interface". That feedback has yet to emerge, however, and one might wonder whether Torvalds will still object to the whole approach. It would seem, though, that there are too many important applications for checkpoint and restart—including process migration and the ability to upgrade kernels underneath long-running processes—for some kind of solution not to make its way into the kernel eventually.

Comments (8 posted)

Interrupt mitigation in the block layer

By Jonathan Corbet
August 10, 2009
Network device drivers have been using the increasingly misnamed NAPI ("new API") interface for some time now. NAPI allows a network driver to turn off interrupts from an interface and go into a polling mode. Polling is often seen as a bad thing, but it's really only a problem when poll attempts turn up no useful work to do. With a busy network interface, there will always be new packets to process; "polling," in this situation, really means "going off to deal with the accumulated work." When there is always work to do, interrupts informing the system of that fact are really just added noise. Your editor likes to compare the situation to email notifications; anybody who gets a reasonable volume of email is quite likely to turn such notifications off. They are distracting, and there is probably always email waiting whenever one gets around to checking.

NAPI is well suited to network drivers, since high packet rates can lead to high interrupt rates, but it has not spread to other parts of the kernel, where interrupt rates are lower. That situation could change in 2.6.32, though, if Jens Axboe follows through with his plan to merge the new blk-iopoll infrastructure into the mainline. In short, blk-iopoll is NAPI for block devices; indeed, some of the core code was borrowed from the NAPI implementation.

Converting a block driver to the blk-iopoll is straightforward. Each interrupting device needs to have a struct blk_iopoll structure defined for it, presumably in the structure which describes the device within the driver. This structure should be initialized with:

    #include <linux/blk-iopoll.h>

    typedef int (blk_iopoll_fn)(struct blk_iopoll *, int);

    void blk_iopoll_init(struct blk_iopoll *iop, int weight, blk_iopoll_fn *poll_fn);

The weight value describes the relative importance of the device; a higher weight results in more requests being processed in each polling cycle. As with NAPI, there is no definitive guidance as to what weight should be; in Jens's initial patch, it is set to 32. The poll_fn() will be called when the block subsystem decides that it's time to poll for completed requests.

I/O polling for a device is controlled with:

    void blk_iopoll_enable(struct blk_iopoll *iop);
    void blk_iopoll_disable(struct blk_iopoll *iop);

A call to blk_iopoll_enable() must be made by the driver before any polling of the device will happen. Enabling polling allows that polling to occur, but does not cause it to happen. There is no point in polling a device which is not doing any work, so the block layer will not actually poll a given device until the driver informs it that there may be a reason to do so.

That normally happens when the device is actually interrupting. The driver can, in its interrupt handler, switch over to polling mode through a three-step process. The first is to check the global variable blk_iopoll_enabled; if it is zero, block I/O polling cannot be used. Assuming polling is enabled, the driver should prepare the blk_iopoll structure with:

    int blk_iopoll_sched_prep(struct blk_iopoll *iop);

In the first version of the patch, a return value of zero means that the preparation "failed," either because polling is disabled or because the device is already in polling mode. In future versions, the sense of the return value is likely to be inverted to the more standard "zero means success" mode. If blk_iopoll_sched_prep() succeeds, the driver can then call:

    void blk_iopoll_sched(struct blk_iopoll *iop);

At this point, polling mode has been entered; the driver need only disable interrupts from its device and return. The "disable interrupts" step should, of course, be done at the device itself; masking the IRQ line would be an antisocial act in a world where those lines are shared.

Later on, the block layer will call the poll_fn() which was provided to blk_iopoll_init(). The prototype for this function is:

        typedef int (blk_iopoll_fn)(struct blk_iopoll *iop, int budget);

The polling function is called (in software interrupt context) with iop being the related blk_iopoll structure, and budget being the maximum number of requests that the poll function should process. In normal usage, the driver's device-specific structure can be obtained from iop with container_of(). The budget value is just the weight that was specified back at initialization time.

The return value should be the number of requests actually processed. If the device consumes less than the given budget, it should turn off further polling with:

    void blk_iopoll_complete(struct blk_iopoll *iopoll);

Interrupts from the device should be re-enabled, since further polling will not happen. Note that the block layer assumes that a driver will not call blk_iopoll_complete() if it has consumed its full budget. If it's necessary to return to interrupt mode despite having exhausted the budget, the driver should either (1) use blk_iopoll_disable(), or (2) lie about the number of requests processed when returning from the polling function.

One might well wonder about the motivation behind all of this work. Block device interrupt handling has not traditionally been a performance bottleneck. The problem is the rapid improvement in solid-state storage devices. It is expected that, before too long, these devices will be operating in the range of 100,000 I/O operations per second - far beyond anything that rotating storage can do. When dealing with that many I/O operations, the kernel must take care to minimize the per-operation overhead in any way possible. As others have observed, the block layer needs to become more like the network layer, with the per-request cost squeezed to a bare minimum. The blk-iopoll code is a step in that direction.

How big a step? Jens has posted some preliminary numbers showing significant reductions in system time on a random-read disk benchmark. More testing will certainly be required; in particular, some developers are concerned about the possibility of increasing I/O latency. But the initial numbers suggest that this work has improved the efficiency of the block subsystem under load.

Comments (5 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management



Virtualization and containers

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds