LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.28-rc5, released on November 15. It contains the usual pile of fixes; see the long-format changelog for the details.

The current stable 2.6 kernel is 2.6.27.6, released on November 13. It includes a fair number of fixes, one of which has a CVE number attached. As of this writing, 46 patches are under review for inclusion in 2.6.27.7 which will likely be released soon.

Comments (none posted)

Kernel development news

Quotes of the week

That GLOBAL_EXTERN thing should be held on the ground whilst farm animals poop on its head, but my attempts to remove it have thus far fallen on deaf inboxes.
-- Andrew Morton

Your patch is still adding bells and whistles to a useless turd. In fact this patch is worse. Without this patch the turd can be disabled and left out, with your patch everyone now has to compile in said turd pile.
-- Alan Cox joining the scatological mood

Comments (none posted)

Photos from the 2008 Kernel Summit

The Linux Foundation has posted a set of photos from the 2008 Kernel Summit. If these pictures are to be believed, the Summit involved a lot of time spent consuming alcoholic beverages. But it was a more serious event than that, honest.

Comments (4 posted)

kerneloops.org records its 100,000th oops

Arjan van de Ven reports that kerneloops.org has recorded oops #100,000, just shy of its first birthday. The site gathers the output of kernel oops messages, which are the crash signatures from the kernel. The intent is to find out which are the most common in order to find and fix the underlying bugs. "Other than the top 2 items, which have patches, we've done a pretty good job of fixing the high occurrence bugs (excluding the binary drivers which we obviously cannot fix)" Click below for his full report.

Full Story (comments: 20)

UKUUG: Arnd Bergmann on interconnecting with PCIe

By Jake Edge
November 19, 2008

PCI express (PCIe) is not normally considered as a way to connect computers, rather it is a bus for attaching peripherals, but there are advantages to using it as an interconnect. Kernel hacker Arnd Bergmann gave a presentation at the recent UKUUG Linux 2008 conference on work he has been doing on using PCIe for IBM. He outlined the current state of Linux support as well as some plans for the future.

The availability of PCIe endpoints for much of the hardware in use today is one major advantage. By using PCIe, instead of other interconnects such as InfiniBand, the same throughput can be achieved with lower latency and power consumption. Bergmann noted that avoiding using a separate InfiniBand chip saves 10-30 watts which adds up rather quickly on a 30,000 node supercomputer.

There are some downsides to PCIe as well. There is no security model, for example, so a root process on one machine can crash other connected machines. There is also a single point of failure because if the PCIe root port goes down, it takes the network with it or, as Bergmann puts it: "if anything goes wrong, the whole system goes down". PCIe lacks a standard high-level interface for Linux and there is no generic code shared between the various drivers—at least so far.

As an example of a system that uses PCIe, Bergmann described the "Roadrunner" supercomputer that is currently the fastest in existence. It is a cluster of hybrid nodes, called "Triblades", each of which has one Opteron blade along with two Cell blades. The nodes are connected with InfiniBand, but PCIe is used to communicate between the processors within each node by using the Opteron root port and PCIe endpoints on the Cells.

There is other hardware that uses PCIe in this way, including the Fixstars GigaAccel 180 accelerator board and an embedded PowerPC 440/460 system-on-a-chip (SoC) board, both of which use the same Axon PCIe device. Bergmann also talked about PCIe switches and non-transparent bridges that perform the same kinds of functions as networking switches and bridges. Bridges are called "non-transparent" because they have I/O remapping tables—sometimes IOMMUs—that can be addressed by the two root ports that are connected via the bridge. These bridges may also have DMA engines to facilitate data transfer without host processor control.

Bergmann then moved on to the software side of things, looking at the drivers available—and planned—to support connection via PCIe. The first driver was written by Mercury Computers in 2006 for a Cell accelerator board and is now "abandonware". It has many deficiencies and would take a lot of work to get it into shape for the mainline.

Another choice is the driver used in the Roadrunner Triblade and the GigaAccel device which is vaguely modeled on InfiniBand. It has an interface that uses custom ioctl() commands that implement just eight operations, as opposed to hundreds for InfiniBand. It is "enormous for a Linux device driver", weighing in at 13,000 lines of code.

The Triblade driver is not as portable as it could be, as it is very specific to the Opteron and Cell architectures. On the Cell side, it is implemented as an Open Firmware driver, but the Opteron side is a PCIe driver. There is a lot of virtual ethernet code mixed in as well. Overall, it is not seen as the best way forward to support these kinds of devices in Linux.

Another approach was taken by a group of students sponsored by IBM who developed a virtual ethernet prototype to talk to an IBM BladeCenter from a workstation by way of a non-transparent bridge. Each side could access memory on the other by using ioremap() on one side and dma_map_single() on the other. By implementing a virtio driver, they did not have to write an ethernet driver, as the virtio abstraction provided that functionality. The driver was a bit slow, as it didn't use DMA, but it is a start down the road that Bergmann thinks should be taken.

He went on to describe a "conceptual driver" for PCIe endpoints that is based on the students' work but adds on things like DMA as well as additional virtio drivers. Adding a virtio block device would allow embedded devices to use hard disks over PCIe or, by implementing a Plan 9 filesystem (9pfs) virtio driver, individual files could be used directly over PCIe. All of this depends on using the virtio abstraction.

Virtio is seen as a useful layer in the driver because it is a standard abstraction for "doing something when you aren't limited by hardware". Networking, block device, and filesystem "hosts" are all implemented atop virtio drivers, which makes them available fairly easily. One problem area, though, is the runtime configuration piece. The problem there is "not in coming up with something that works, but something that will also work in the future".

Replacing the ioctl() interface with the InfiniBand verbs (ibverb) interface is planned. The ibverb interface may not be the best choice in an abstract sense, but it exists and supports OpenMPI, so the new driver should implement it as well.

Two types of virtqueue implementations are envisioned, one for memory-mapped I/O (MMIO) and the other for a DMA-based virtqueue. The MMIO would be the most basic virtqueue implementation, with a local read of a remote write. Read access on PCIe is much slower than write because a read must flush all writes then wait for data reception. Data and signaling information would have separate areas so that data ordering guarantees could be relaxed on the data area for better performance, while strict data ordering would be set for the signalling area.

The DMA engine virtqueue implementation would be highly hardware-specific to incorporate performance and other limitations of the underlying engine. In some cases, for example, it is not worth setting up a DMA for transfers of less than 2K, so copying via MMIO should be used instead. DMA would be used for transferring payload data, but signaling would still be handled via MMIO. Bergmann noted that the kernel DMA abstraction may not provide all that is needed so enhancements to that interface may be required as well.

Bergmann did not provide any kind of time frame in which this work might make its way into the kernel as it is a work in progress. There is much still to be done, but his presentation laid out a roadmap of where he thinks it is headed.

In a post-talk email exchange, Bergmann points to his triblade-2.6.27 branch for those interested in looking at the current state of affairs, while noting that it "is only mildly related to what I think we should be doing". He also mentioned a patch by Ira Snyder that implements virtual ethernet over PCI, which "is more likely to go into the kernel in the near future". Bergmann and Snyder have to agreed to join forces down the road to add more functionality along the lines that were outlined in the talk.

Comments (5 posted)

Tbench troubles II

By Jonathan Corbet
November 19, 2008
LWN has previously covered concerns over slowly deteriorating performance by current Linux systems on the network- and scheduler-heavy tbench benchmark. Tbench runs have been getting worse since roughly 2.6.22. At the end of the last episode, attention had been directed toward the CFS scheduler as the presumptive culprit. That article concluded with the suggestion that, now that attention had been focused on the scheduler's role in the tbench performance regression, fixes would be relatively quick in coming. One month later, it would appear that those fixes have indeed come, and that developers looking for better tbench results will need to cast their gaze beyond the scheduler.

The discussion resumed after a routine weekly posting of the post-2.6.26 regression list; one entry in that list is the tbench performance issue. Ingo Molnar responded to that posting with a pointer to an extensive set of benchmark runs done by Mike Galbraith. The conclusion Ingo draws from all those runs is that the CFS scheduler is now faster than the old O(1) scheduler, and that "all scheduler components of this regression have been eliminated." Beyond that:

In fact his numbers show that scheduler speedups since 2.6.22 have offset and hidden most other sources of tbench regression. (i.e. the scheduler portion got 5% faster, hence it was able to offset a slowdown of 5% in other areas of the kernel that tbench triggers)

This improvement is not something that just happened; it is the result of a focused effort on the part of the scheduler developers. Quite a few changes have been merged; they all seem like small tweaks, but, together, they add up to substantial improvements in scheduler performance. One change fixes a spot where the scheduler code disabled interrupts needlessly. Some others (here and here) adjust the scheduler's "wakeup buddy" mechanism, a feature which ties processes together in the scheduler's view. As an example, consider a process which wakes up a second process, then runs out of its allocated time on the CPU. The wakeup buddy system will cause the scheduler to bias its selection mechanism to favor the just-waked process, on the theory that said process will be consuming cache-warm data created by the waking process. By allowing cooperating processes like this to run slightly ahead of what a strictly fair scheduling algorithm would provide, the scheduler gets better performance out of the system as a whole.

The recent changes add a "backward buddy" concept. If there is no recently-waked process to switch to, the scheduler will, instead, bias the selection toward the process which was preempted to enable the outgoing process to run. Chances are relatively good that the preempted process might (1) be cooperating with the outgoing process or (2) have some data still in cache - or both. So running that process next is likely to yield better performance overall.

A number of other small changes have been merged, to the point that the scheduler developers think that the tbench regressions are no longer their problem. Networking maintainer David Miller has disagreed with this assessment, though, claiming that performance problems still exist in the scheduler. Ingo responded in a couple of ways, starting with the posting of some profiling results which show very little scheduler overhead. Interestingly, it turns out that the networking developers get different results from their profiling runs than the scheduler developers do. And that, in turn, is a result of the different hardware that they are using for their work. Ingo has a bleeding-edge Intel processor to play with; the networking folks have processors which are not quite so new. David Miller tends to run on SPARC processors, which may be adding unique problems of their own.

The other thing Ingo did was, for all practical purposes, to profile the entire kernel code path involved in a tbench run, then to disassemble the executable and examine the profile results on a per-instruction basis. The postings that resulted (example) point out a number of potential problem spots, most of which are in the networking code. Some of those have already been fixed, while others are being disputed. It is, in the end, a large amount of raw data which is likely to inspire discussion for a while.

To an outsider, this whole affair can have the look of an ongoing finger-pointing exercise. And, perhaps, that's what it is. But it's highly-technical finger-pointing which has increased the understanding of how the kernel responds to a specific type of stress while also demonstrating the limits of some of our measurement tools and the performance differences exhibited by various types of hardware. The end result will be a faster, more tightly-tuned kernel - and better tbench numbers too.

Comments (11 posted)

UKUUG: The right way to port Linux

By Jake Edge
November 19, 2008

Arnd Bergmann pulled double duty at the recent UKUUG Linux 2008 conference by giving a talk on each day of the event. His talk on Saturday, entitled "Porting Linux to a new architecture, the right way", looked at various problems with recent architecture ports along with a project he has been working on to simplify that process. By creating a generic template for architectures, some of the mistakes of the past can be avoided.

This is one of Bergmann's pet projects, that "I like to do for fun, when I am hacking on the kernel, but not for IBM". The project and talk were inspired by a few new architectures that were merged—or were submitted for merging—in the last few years. In particular, the Blackfin and MicroBlaze architectures were inspiring, with the latter architecture still not merged, perhaps due to Bergmann's comments. He is hoping to help that situation get better.

The biggest problem with architecture ports tends to be code duplication because people start by copying all of the files from an existing architecture. In addition, "most people who don't know what they are doing copy from x86, which in my opinion is a big mistake". According to Bergmann, architecture porters seem to "first copy the header files and then change the whitespace", which makes it difficult to immediately spot duplicated code.

He points to termbits.h as an example of an include file that is duplicated in multiple architectures unnecessarily as the code is the same in most cases. He also notes there is "incorrect code duplication", pointing to new architectures that implement the sys_ipc() system call, resulting in "brand new architectures supporting a broken interface for x86 UNIX from the 80s". That call is a de-multiplexer for System V IPC calls that has the comment—dutifully duplicated into other architectures—"This is really horribly ugly".

Then there are problems with "code duplication by clueless people" which includes a sembuf.h implementation that puts the padding in the wrong place because of 64 vs. 32-bit confusion. In addition, because code is duplicated in multiple locations, bug fixes that are made for one architecture don't propagate to all the places that need the fix. As an example he noted a bug fix made by Sparc maintainer David Miller in the x86 tree that didn't make it into the Sparc tree. Finally, there are ABIs that are being needlessly propagated in new architecture ports: system calls that are implemented in terms of newer calls are still present in new ports even though it could all be handled in libc.

The "obvious" solution is to create a generic architecture implementation that can be used as a starting point for new ports. Bergmann has been working on that, resulting in a 3000 line patch that "should make it very easy for people to port to new architectures". To start with, it defines a canonical ABI that is a list of all of the system calls that need to be implemented for a new architecture. It puts all of the required include files into the asm-generic directory that new ports can just include—or copy if they need to modify them.

Unfortunately, things are not quite that simple of course, there are a number of problem areas. There are "lots of things you simply cannot do in a generic way". Most of these things are fairly hardware-specific areas like MMU support, atomics, interrupts, task switching, byte order, signal contexts, hardware probing and the like.

Bergmann decided to go ahead by defining away some of these problems in his example architecture. So, there is no SMP or MMU support with the asm-generic/atomic.h and asm-generic/mmu_context.h include files being appropriately modified. Many of the architecture-specific functions have been stubbed out in arch/example/kernel/dummy.c so that he can compile the template architecture.

The example architecture uses an Open Firmware device tree to describe the hardware that is available at boot time. Open Firmware "is a bit like what you have with the new Intel EFI firmware, but it's a lot nicer". A flattened device tree data structure is passed to the kernel at boot time by the bootloader, so Bergmann will be able make it to the next step: making it boot.

As one might guess, there is still more work to be done. There are eight header files that are needed from the asm-example directory, but Bergmann hopes to reduce that some. He notes that there are other architecture-specific areas that need work. For example, every single architecture has its own implementation of TCP checksums in assembly language, which may not be optimal

Bergmann pointed attendees at the ukuug2008 branch of his kernel.org playground git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git to see the current state of his example architecture. It looks to be a nice addition to the kernel that will likely result in better architecture ports down the road.

Comments (3 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds