User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.34-rc3, but -rc4 can be expected at almost any time. Quite a few patches have gone in since -rc3; most are fixes, but there's also a 4200-file cleanup from Tejun Heo and a new driver for Chelsio T4-based Ethernet adapters.

Stable updates: Greg Kroah-Hartman has announced the release of four separate stable kernels:,,, and These are fairly sizable updates, weighing in at 45, 89, 116, and 156 patches respectively (at review time anyway, a few patches may have been dropped). As usual, all users are strongly encouraged to upgrade. In addition, it sounds like stable updates for the 2.6.31 series are nearing their end, so users of that kernel should move to .32 or .33.

Comments (none posted)

Quotes of the week

4208 files changed, 3717 insertions(+), 717 deletions(-)
-- Tejun Heo casts a wide net for -rc4

I have two machines that show very different performance numbers. After digging a little I found out that the first machine has, in /proc/cpuinfo:

model name : Intel(R) Celeron(R) M processor 1.00GHz

while the other has:

model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

and that seems to be the main difference. Now the problem is that /proc/cpuinfo is read only. Would it be possible to make /proc/cpuinfo writable so that I could do:

echo -n "model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz" > /proc/cpuinfo

in the first machine and get a performance similar to the second machine?

-- Paulo Marques

It's probably buggy as hell, I don't dare try to actually boot the crap I write.
-- Linus Torvalds

Comments (4 posted)

A "live mode" for perf

By Jake Edge
April 7, 2010

The perf tracing tool has evolved quickly. When last we looked, Tom Zanussi had added Python and Perl scripting to perf. Next up would seem to be perf "live mode", where perf no longer requires two steps: record the data, then analyze. Live mode will allow perf trace record and perf trace report to operate via a pipe, which allows instantaneous, as well as continuously updating (a la top), output.

So that no existing perf users need to change their scripts, Zanussi only added the new capabilities when perf recognizes that its record output is going to stdout or report input is coming from stdin. In that case, perf handles the data through a pipe, and uses special synthesized events to provide header information. This will also allow perf to operate over the network by piping its record output to netcat, and then reading it via netcat on another system and piping it into report.

All of the scripts that are installed in the standard perf location (i.e. those which are listed in perf trace -l) are automatically able to be run in live mode:

  $ perf trace syscall-counts
will run both ends of the the syscall-counts script with a pipe in between, a more usable shorthand for:
  $ perf trace record syscall-counts -o - | perf trace report syscall-counts -i -
which itself is shorthand for:
  perf record -c 1 -f -a -M -R -e raw_syscalls:sys_enter -o - | \
  perf trace -i - -s ~/libexec/perf-core/scripts/python/

Zanussi also included several sample top-style scripts that can be used to monitor read/write or system call activity updated every three seconds. It looks to be a very useful addition to perf, which is rapidly becoming the "swiss army knife" of kernel monitoring.

Comments (4 posted)

The NO_BOOTMEM patches

By Jonathan Corbet
April 7, 2010
Every kernel development cycle seems to involve one set of patches which turn out to be more trouble than had been expected. With 2.6.34, that award should probably go to the patches found under the somewhat confusing CONFIG_NO_BOOTMEM option.

"Bootmem" is a simple, low-level memory allocator used by the kernel during the early parts of the bootstrap process. One might think that the kernel does not need yet another allocator, but the memory management code used during operation requires that much of the kernel already be functional before it can be called. Getting to that point involves a chain of increasingly complicated memory allocation mechanisms; on the x86 architecture, those begin the "early_res" mechanism which takes over from the BIOS "e820" facility. Once things get a little farther, the architecture-independent bootmem allocator takes over, followed, eventually, by the full buddy allocator.

Yinghai Lu came to the conclusion that things could be simplified considerably if the bootmem stage were taken out of the picture. The result was a series of patches which extends the use of the early_res mechanism for long enough to bootstrap the buddy allocator. These changes were merged for 2.6.34, but the old bootmem-based code was left behind. The CONFIG_NO_BOOTMEM option controls which allocator is used, with the default being to short out bootmem.

This is a significant change to the crucial and tricky early bootstrap code, so few people were surprised when some regressions were reported against 2.6.34-rc1. When the reports continued to arrive after -rc3, though, the level of irritation began to grow, to the point that Linus started talking about reverting the whole thing. Nobody seemed to dislike the objectives of the patches, but system-killer regressions after -rc3, along with the twisted mess of #ifdefs created by the patch and the fact that it was on by default led to some grumpiness.

Normally, new features are expected to be configured out by default; to the greatest extent possible, a new kernel should behave as much like its predecessors as possible when the default options are taken. In this case, the default led to significant changes and problems. The purpose of this option was twofold: to allow the new code to be configured out when it proved to be problematic, and to ensure that it was well tested in the mean time. Certainly it was successful on both fronts, even if some of the testers proved to be not entirely willing.

As of this writing, it would appear that the worst problems have been fixed; talk of removing the no-bootmem code has subsided. Eventually, perhaps, all architectures will make similar changes and the bootmem code can be removed entirely. Meanwhile, Yinghai has a new set of changes on the horizon for 2.6.35: replacing the early_res code with the "logical memory block" allocator currently used by some other architectures. That change looks even more disruptive than the bootmem elimination was.

Comments (3 posted)

Kernel development news

Memory management for virtualization

By Jonathan Corbet
April 7, 2010
For some time now, your editor has asserted that, at the kernel level, the virtualization problem is mostly solved. Much of the remaining work is in the performance area. That said, making virtualized systems perform well is not a small or trivial problem. One of the most interesting aspects of this problem is in the interaction between virtualized guests and host memory management. A couple of patch sets under discussion illustrate where the work in this area is being done.

The transparent huge pages patch set was discussed here back in October. This patch seeks to change how huge pages are used by Linux applications. Most current huge page users must be set up explicitly to use huge pages, which, in turn, must be set aside by the system administrator ahead of time; see the recent series by Mel Gorman for more information on how this is done. The "some assembly required" nature of huge pages limits their use in many situations.

The transparent huge page patch, instead, works to provide huge pages to applications without those applications even being aware that such pages exist. When large pages are available, applications may have their scattered pages joined together into huge pages automatically; those pages can also be split back apart when the need arises. When the system operates in this mode, huge pages can be used in many more situations without the need for application or administrator awareness. This feature turns out to be especially beneficial when running virtualized guests; huge pages map well to how guests tend to see and use their address spaces.

The transparent huge page patches have been working their way toward acceptance, though it should be noted that some developers still have complaints about this work. Andrew Morton recently pointed out a different problem with this patch set:

It appears that these patches have only been sent to linux-mm. Linus doesn't read linux-mm and has never seen them. I do think we should get things squared away with him regarding the overall intent and implementation approach before trying to go further... [T]his is a *large* patchset, and it plays in an area where Linus is known to have, err, opinions.

It didn't take long for Linus to join the conversation directly; after a couple of digressions into areas not directly related to the benefits of the transparent huge pages patch, he realized that this work was motivated by the needs of virtualization. At that point, he lost interest:

So I thought it was a more interesting load than it was. The virtualization "TLB miss is expensive" load I can't find it in myself to care about. "Get a better CPU" is my answer to that one.

He went on to compare the transparent huge page work to high memory, which, in turn, he called "a failure". The right solution in both cases, he says, is to get a better CPU.

It should be pointed out that high memory was a spectacularly successful failure, extending the useful life of 32-bit systems for some years. It still shows up in surprising places - you editor's phone is running a high-memory-enabled kernel. So calling high memory a failure is something like calling the floppy driver a failure; it may see little use now, but there was a time when we were glad we had it.

Perhaps, someday, advances in processor architecture will make transparent huge pages unnecessary as well. But, while the alternative to high memory (64-bit processors) has been in view for a long time, it's not at all clear what sort of processor advance might make transparent huge pages irrelevant. So, should this code get into the kernel, it may well become one of those failures which is heavily used for many years.

A related topic under discussion was the recently-posted VMware balloon driver patch. A balloon driver has an interesting task; its job is to "inflate" within a guest system, taking up memory and making it unavailable for processes running within the guest. The pages absorbed by the balloon can then be released back to the host system which, presumably, has a more pressing need for them elsewhere. Letting "air" out of the balloon makes memory available to the guest once again.

The purpose of this driver, clearly, is to allow the host to dynamically balance the memory needs of its guest systems. It's a bit of a blunt instrument, but it's the best we have. But Andrew Morton questioned the need for a separate memory control mechanism. The kernel already has a function, called shrink_all_memory(), which can be used to force the release of memory. This function is currently used for hibernation, but Andrew suspects that it could be adapted to the needs of virtualization as well.

Whether that is really true remains to be seen; it seems that the bulk of the complexity lies not with the freeing of memory but in the communication between the guest and the hypervisor. Beyond that, the longer-term solution is likely to be something more sophisticated than simply applying memory pressure and watching the guest squirm until it releases enough pages. As Dan Magenheimer put it:

Historically, all OS's had a (relatively) fixed amount of memory and, since it was fixed in size, there was no sense wasting any of it. In a virtualized world, OS's should be trained to be much more flexible as one virtual machine's "waste" could/should be another virtual machine's "want".

His answer to this problem is the transcendent memory patch, which allows the operating system to designate memory which is available for the taking should the need arise, but which can contain useful data in the mean time.

This is clearly an area that needs further work. The whole point of virtualization is to isolate guests from each other, but a more cooperative approach to memory requires that these guests, somehow, be aware of the level of contention for resources like memory and respond accordingly. Like high memory and transparent huge pages, balloon drivers may eventually be consigned to the pile of failed technologies. Until something better comes along, though, we'll still need them.

Comments (13 posted)

Receive flow steering

By Jake Edge
April 7, 2010

Today's increasing bandwidth, and faster networking hardware, has made it difficult for a single CPU to keep up. Multiple cores and packages have helped matters on the transmit side, but the receive side is trickier. Tom Herbert's receive packet steering (RPS) patches, which we looked at back in November, provide a way to steer packets to particular CPUs based on a hash of the packet's protocol data. Those patches were applied to the network subsystem tree and are bound for 2.6.35, but now Herbert is back with an enhancement to RPS that will attempt to steer packets to the CPU on which the receiving application is running: receive flow steering (RFS).

RFS uses the RPS hash table to store the CPU of an application when it calls recvmsg() or sendmsg(). Instead of picking an arbitrary CPU based on the hash and a CPU mask optionally set by an administrator, as RPS does, RFS tries to use the CPU where the receiving application is running. Based on the hash calculated on the incoming packet, RFS can look up the "proper" CPU and assign the packet there.

The RPS CPU masks, which can be set via sysfs for each device (and queue for devices with multiple queues), represent the allowable CPUs to assign for a packet. But dynamically changing those values introduces the possibility of out-of-order packets. For RPS, with largely static CPU masks, it was not necessarily a big problem. For RFS, however, multiple threads trying to read from the same socket, while potentially bouncing around to different CPUs, would cause the CPU value in the hash table to change frequently, thus increasing the likelihood of out-of-order packets.

For RFS, that was considered to be a "non-starter", Herbert said, so a different approach was required. To eliminate the out-of-order packets, two types of hash tables are created, both indexed by the hash calculated from the packet information. The global rps_sock_flow_table is populated by the recvmsg() or sendmsg() call with the CPU number where the application is running (this is called the "desired" CPU). Each device queue then gets a rps_dev_flow_table which contains the most recent CPU used to handle packets for that connection (which is called the "current" CPU). In addition, the value of the tail queue counter for the current CPU's backlog queue is stored in the rps_dev_flow_table entry.

The two CPU values are compared when deciding which CPU to process the packet on (which is done in get_rps_cpu()). If the current CPU (as determined from the rps_dev_flow_table hash table) is unset (presumably for the first packet) or that CPU is offline, the desired CPU (from rps_sock_flow_table) is used. If the two CPU values are the same, obviously, that CPU is used. But if they are both valid CPU numbers, but different, the backlog tail queue counter is consulted.

Backlog queues have a queue head counter that gets incremented when packets are removed from the queue. Using that and the queue length, a queue tail counter value can be calculated. That is what gets stored in rps_dev_flow_table. When the kernel makes its decision about which CPU to assign the packet to, it needs to consider both the current (really last used by the kernel) CPU and the desired (last used by an application for sending or receiving) CPU.

The kernel compares the current CPU's queue tail counter (as stored in the hash table) with that CPU's queue head counter. If the tail counter is less than or equal the head counter, that means that all packets that were put on the queue by this connection have been processed. That in turn means that switching to the desired CPU will not result in out-of-order packets.

Herbert's current patch is for TCP, but RFS should be "usable for other flow oriented protocols". The benefit is that it can achieve better CPU locality for the processing of the packet, both by the kernel, and the application itself. Depending on various factors—cache hierarchy and application are given as examples—it can and does increase the packets per second that can be processed as well as lowering the latency before a packet gets processed. But, interestingly, "on simple benchmarks, we don't necessarily see improvement and sometimes see degradation".

For more complex benchmarks, the performance increase looks to be significant. Herbert gave numbers for a netperf run where the transactions per second went from 104K without either RFS or RPS, to 290K for the best RPS configuration, and to 303K with RFS and RPS. A different test, with 100 threads handling an RPC-like request/response with some user-space work being done, was even more dramatic. That test showed 103K, 174K, and 223K respectively, but also showed a marked decrease in the latency for both RPS and RPS + RFS.

These patches are coming from Google, which has been known to process a few packets using the Linux kernel. If RFS is being used on production systems at Google, that would seem to bode well for its reliability and performance beyond just benchmarks. The patches were posted April 2, and seemed to be generally well-received, so it's a little early to tell when they might make it into the mainline. But it seems rather likely that we will see them in either 2.6.35 or 36.

Comments (6 posted)

The padata parallel execution mechanism

By Jonathan Corbet
April 6, 2010
One day, Andrew Morton was happily reading linux-kernel when he encountered a patch fixing a minor problem with the "padata" code. Andrew, it seems, had never heard of padata, which was merged during the 2.6.34 merge window. So he asked: "OK, on behalf of thousands I ask: what the heck is kernel/padata.c?" On behalf of those same thousands, your editor set out to learn what this new bit of core kernel code does and how to use it.

In short: padata is a mechanism by which the kernel can farm work out to be done in parallel on multiple CPUs while retaining the ordering of tasks. It was developed for use with the IPsec code, which needs to be able to perform encryption and decryption on large numbers of packets without reordering those packets. The crypto developers made a point of writing padata in a sufficiently general fashion that it could be put to other uses as well, but that requires knowing that the API is there and how to use it. Unfortunately, they made a bit less of a point of updating the documentation directory.

The first step in using padata is to set up a padata_instance structure for overall control of how tasks are to be run:

    #include <linux/padata.h>

    struct padata_instance *padata_alloc(const struct cpumask *cpumask,
				         struct workqueue_struct *wq);

The cpumask describes which processors will be used to execute work submitted to this instance. The workqueue wq is where the work will actually be done; it should be a multithreaded queue, naturally.

There are functions for enabling and disabling the instance:

    void padata_start(struct padata_instance *pinst);
    void padata_stop(struct padata_instance *pinst);

These functions literally do nothing beyond setting or clearing the "padata_start() was called" flag; if that flag is not set, other functions will refuse to work. There must be some perceived value in this functionality, but the only current padata user (crypto/pcrypt.c) does not make use of it. So padata_start() looks like one of those exercises in pointless bureaucracy that we all have to cope with sometimes.

The list of CPUs to be used can be adjusted with these functions:

    int padata_set_cpumask(struct padata_instance *pinst,
			   cpumask_var_t cpumask);
    int padata_add_cpu(struct padata_instance *pinst, int cpu);
    int padata_remove_cpu(struct padata_instance *pinst, int cpu);

Changing the CPU mask has the look of an expensive operation, though, so it probably should not be done with great frequency.

Actually submitting work to the padata instance requires the creation of a padata_priv structure:

    struct padata_priv {
        /* Other stuff here... */
	void                    (*parallel)(struct padata_priv *padata);
	void                    (*serial)(struct padata_priv *padata);

This structure will almost certainly be embedded within some larger structure specific to the work to be done. Most its fields are private to padata, but the structure should be zeroed at initialization time, and the parallel() and serial() functions should be provided. Those functions will be called in the process of getting the work done as we will see momentarily.

The submission of work is done with:

    int padata_do_parallel(struct padata_instance *pinst,
		           struct padata_priv *padata, int cb_cpu);

The pinst and padata structures must be set up as described above; cb_cpu specifies which CPU will be used for the final callback when the work is done; it must be in the current instance's CPU mask. The return value from padata_do_parallel() is a little strange; zero is an error return indicating that the caller forgot the padata_start() formalities. -EBUSY means that somebody, somewhere else is messing with the instance's CPU mask, while -EINVAL is a complaint about cb_cpu not being in that CPU mask. If all goes well, this function will return -EINPROGRESS, indicating that the work is in progress.

Each task submitted to padata_do_parallel() will, in turn, be passed to exactly one call to the above-mentioned parallel() function, on one CPU, so true parallelism is achieved by submitting multiple tasks. The workqueue is used to actually make these calls, so parallel() runs in process context and is allowed to sleep. The parallel() function gets the padata_priv structure pointer as its lone parameter; information about the actual work to be done is probably obtained by using container_of() to find the enclosing structure.

Note that parallel() has no return value; the padata subsystem assumes that parallel() will take responsibility for the task from this point. The work need not be completed during this call, but, if parallel() leaves work outstanding, it should be prepared to be called again with a new job before the previous one completes. When a task does complete, parallel() (or whatever function actually finishes the job) should inform padata of the fact with a call to:

    void padata_do_serial(struct padata_priv *padata);

At some point in the future, padata_do_serial() will trigger a call to the serial() function in the padata_priv structure. That call will happen on the CPU requested in the initial call to padata_do_parallel(); it, too, is done through the workqueue, but with local software interrupts disabled. Note that this call may be deferred for a while since the padata code takes pains to ensure that tasks are completed in the order in which they were submitted.

The one remaining function in the padata API should be called to clean up when a padata instance is no longer needed:

    void padata_free(struct padata_instance *pinst);

This function will busy-wait while any remaining tasks are completed, so it might be best not to call it while there is work outstanding. Shutting down the workqueue, if necessary, should be done separately.

The API as described above is what can be found in the 2.6.34-rc3 kernel. As was seen back at the beginning of this article, padata is just coming into more general awareness, and some developers are asking questions about the API. So changes are possible - but, then, that is true of any internal kernel interface.

Comments (2 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds