Kernel development

Brief items

Kernel release status

The current development kernel is 3.4-rc6, released on May 6. "Another week, another -rc - and I think we're getting close to final 3.4. So please do test."

Stable updates: the 3.0.31 and 3.3.5 updates were released on May 7 with the usual pile of important fixes.

The 3.2.17 update, with 167 fixes, is in the review process as of this writing; it can be expected on or after May 11.

Comments (2 posted)

Quotes of the week

So [KERN_CONT] is like a defibrillator: it is good to *have* one, but it's really bad to have to *use* one.

—Linus Torvalds

I really love fairy tales, just not in the context of kernel code.

—Thomas Gleixner

Quick! Everyone say something extreme for this week's LWN Quote of the Week!

—Jon Masters (warning: disappointing results)

Comments (none posted)

Nichols, Jacobson: Controlling Queue Delay

Kathleen Nichols and Van Jacobson have published a paper describing a new network queue management algorithm that, it is hoped, will play a significant role in the solution to the bufferbloat problem. "CoDel (Controlled Delay Management) has three major innovations that distinguish it from prior AQMs. First, CoDel’s algorithm is not based on queue size, queue-size averages, queue-size thresholds, rate measurements, link utilization, drop rate or queue occupancy time. Starting from Van Jacobson’s 2006 insight, we used the local minimum queue as a more accurate and robust measure of standing queue. Then we observed that it is sufficient to keep a single-state variable of how long the minimum has been above or below the target value for standing queue delay rather than keeping a window of values to compute the minimum. Finally, rather than measuring queue size in bytes or packets, we used the packet-sojourn time through the queue. Use of the actual delay experienced by each packet is independent of link rate, gives superior performance to use of buffer size, and is directly related to the user-visible performance."

For more information, see this blog post from Jim Gettys. "A preliminary Linux implementation of CoDel written by Eric Dumazet and Dave Täht is now being tested on Ethernet over a wide range of speeds up to 10gigE, and is showing very promising results similar to the simulation results in Kathie and Van’s article. CoDel has been run on a CeroWrt home router as well, showing its performance."

Comments (13 posted)

Kernel development news

The CoDel queue management algorithm

By Jonathan Corbet
May 9, 2012

"Bufferbloat" can be thought of as the buffering of too many packets in flight between two network end points, resulting in excessive delays and confusion of TCP's flow control algorithms. It may seem like a simple problem, but the simple solution—make buffers smaller—turns out not to work. A true solution to bufferbloat requires a deeper understanding of what is going on, combined with improved software across the net. A new paper from Kathleen Nichols and Van Jacobson provides some of that understanding and an algorithm for making things better—an algorithm that has been implemented first in Linux.

Your editor had a classic bufferbloat experience at a conference hotel last year. An attempt to copy a photograph to the LWN server (using scp) would consistently fail with a "response timeout" error. There was so much buffering in the path that scp was able to "send" the entire image before any of it had been received at the other end. The scp utility would then wait for a response from the remote end; that response would never come in time because most of the image had not, contrary to what scp thought, actually been transmitted. The solution was to use the -l option to slow down transmission to a rate closer to what the link could actually manage. With scp transmitting slower, it was able to come up with a more reasonable idea for when the data should be received by the remote end.

And that, of course, is the key to avoiding bufferbloat issues in general. A system transmitting packets onto the net should not be sending them more quickly than the slowest link on the path to the destination can handle them. TCP implementations are actually designed to figure out what the transmission rate should be and stick to it, but massive buffering defeats the algorithms used to determine that rate. One way around this problem is to force users to come up with a suitable rate manually, but that is not the sort of network experience most users want to have. It would be far better to find a solution that Just Works.

Part of that solution, according to Nichols and Jacobson, is a new algorithm called CoDel (for "controlled delay"). Before describing that algorithm, though, they make it clear that just making buffers smaller is not a real solution to the problem. Network buffers serve an important function: they absorb traffic spikes and equalize packet rates into and out of a system. A long packet queue is not necessarily a problem, especially during the startup phase of a network connection, but long queues as a steady state just add delays without improving throughput at all. The point of CoDel is to allow queues to grow when needed, but to try to keep the steady state at a reasonable level.

Various automated queue management algorithms have been tried over the years; they have tended to suffer from complexity and a need for manual configuration. Having to tweak parameters by hand was never a great solution even in ideal situations, but it fails completely in situations where the network load or link delay time can vary widely over time. Such situations are the norm on the contemporary Internet; as a result, there has been little use of automated queue management even in the face of obvious problems.

One of the key insights in the design of CoDel is that there is only one parameter that really matters: how long it takes a packet to make its way through the queue and be sent on toward its destination. And, in particular, CoDel is interested in the minimum delay time over a time interval of interest. If that minimum is too high, it indicates a standing backlog of packets in the queue that is never being cleared, and that, in turn, indicates that too much buffering is going on. So CoDel works by adding a timestamp to each packet as it is received and queued. When the packet reaches the head of the queue, the time spent in the queue is calculated; it is a simple calculation of a single value, with no locking required, so it will be fast.

Less time spent in queues is always better, but that time cannot always be zero. Built into CoDel is a maximum acceptable queue time, called target; if a packet's time in the queue exceeds this value, then the queue is deemed to be too long. But an overly-long queue is not, in itself, a problem, as long as the queue empties out again. CoDel defines a period (called interval) during which the time spent by packets in the queue should fall below target at least once; if that does not happen, CoDel will start dropping packets. Dropped packets are, of course, a signal to the sender that it needs to slow down, so, by dropping them, CoDel should cause a reduction in the rate of incoming packets, allowing the queue to drain. If the queue time remains above target, CoDel will drop progressively more packets. And that should be all it takes to keep queue lengths at reasonable values on a CoDel-managed node.

The target and interval parameters may seem out of place in an algorithm that is advertised as having no knobs in need of tweaking. What the authors have found, though, is that a target of 5ms and an interval of 100ms work well in just about any setting. The use of time values (rather than packet or byte counts) makes the algorithm function independently of the speed of the links it is managing, so there is no real need to adjust them. Of course, as they note, these are early results based mostly on simulations; what is needed now is experience using a functioning implementation on the real Internet.

That experience may not be long in coming, at least for some kinds of links; there is now a CoDel patch for Linux available thanks to Dave Täht and Eric Dumazet. This code is likely to find its way into the mainline fairly quickly; it will also be available in the CeroWrt router distribution. As the early CoDel implementation starts to see some real use, some shortcomings will doubtless be encountered and it may well lose some of its current simplicity. But it has every appearance of being an important component in the solution to the bufferbloat problem.

Of course, it's not the only component; the problem is more complex than that. There is still a need to look at buffer sizes throughout the stack; in many places, there is simply too much buffering in places where it can do no good. Wireless networking adds some interesting challenges of its own, with its quickly varying link speeds and complexities added by packet aggregation. There is also the little problem of getting updated software distributed across the net. So a full solution is still somewhat distant, but the understanding of the problem is clearly growing and some interesting approaches are beginning to appear.

Comments (43 posted)

Statistics from the 3.4 development cycle

By Jonathan Corbet
May 8, 2012

With the release of the 3.4-rc6 prepatch, Linus let it be known that he thought the final 3.4 release was probably not too far away. That can only mean one thing: it's time to look at the statistics for this development cycle. 3.4 was an active cycle, with an interesting surprise or two.

As of this writing, Linus has merged just over 10,700 changes for 3.4; those changes were contributed from 1,259 developers. The total growth of the kernel source this time around is 215,000 lines. The developers most active in this cycle were:

Most active 3.4 developers

By changesets

Mark Brown 284 2.7%

Russell King 211 2.0%

Johannes Berg 147 1.4%

Al Viro 136 1.3%

Axel Lin 133 1.2%

Johan Hedberg 122 1.1%

Guenter Roeck 121 1.1%

Masanari Iida 109 1.0%

Stanislav Kinsbursky 97 0.9%

Trond Myklebust 85 0.8%

Jiri Slaby 82 0.8%

Ben Hutchings 82 0.8%

Greg Kroah-Hartman 78 0.7%

Takashi Iwai 78 0.7%

Dan Carpenter 78 0.7%

Stephen Warren 76 0.7%

Stanislaw Gruszka 76 0.7%

Alex Deucher 73 0.7%

By changed lines

Joe Perches 56571 8.1%

Dan Magenheimer 24077 3.4%

Stephen Rothwell 17354 2.5%

Greg Kroah-Hartman 15015 2.1%

Mark Brown 12266 1.8%

Jiri Olsa 11842 1.7%

Mark A. Allyn 10976 1.6%

Stephen Warren 10386 1.5%

Arun Murthy 9347 1.3%

Ingo Molnar 8779 1.3%

Alex Deucher 8770 1.3%

David Howells 8034 1.2%

Guenter Roeck 7634 1.1%

Chris Kelly 7023 1.0%

Johannes Berg 6657 1.0%

Ben Hutchings 6650 1.0%

Al Viro 6628 0.9%

Russell King 6610 0.9%

Mark Brown finds himself at the top of the list of changeset contributors for the second cycle in a row; as usual, he has done a great deal of work with sound drivers and related subsystems. Russell King is the chief ARM maintainer; he has also taken an active role in the refactoring and cleanup of the ARM architecture code. Johannes Berg continues to do a lot of work with the mac80211 layer and the iwlwifi driver, Al Viro has been improving the VFS API and fixing issues throughout the kernel, and Axel Lin has done a lot of cleanup work in the ALSA and regulator subsystems and beyond.

Joe Perches leads the "lines changed" column with coding-style fixes, pr_*() conversions, and related work. Dan Magenheimer added the "ramster" memory sharing mechanism to the staging tree. Linux-next maintainer Stephen Rothwell made it into the "lines changed" column with the removal of a lot of old PowerPC code. Greg Kroah-Hartman works all over the tree, but the bulk of his changed lines were to be found in the staging tree.

Some 195 companies contributed changes during the 3.4 development cycle. The top contributors this time around were:

Most active 3.4 employers

By changesets

(None) 1156 10.8%

Intel 1138 10.6%

Red Hat 960 9.0%

(Unknown) 688 6.4%

Texas Instruments 428 4.0%

IBM 381 3.6%

Novell 372 3.5%

(Consultant) 298 2.8%

Wolfson Microelectronics 286 2.7%

Samsung 234 2.2%

Google 222 2.1%

Oracle 188 1.8%

Freescale 175 1.6%

Qualcomm 161 1.5%

Linaro 143 1.3%

Broadcom 140 1.3%

NetApp 133 1.2%

MiTAC 133 1.2%

AMD 132 1.2%

By lines changed

(None) 108509 15.5%

Intel 67464 9.7%

Red Hat 65966 9.4%

(Unknown) 50900 7.3%

IBM 36800 5.3%

Oracle 26617 3.8%

Texas Instruments 25687 3.7%

Samsung 24966 3.6%

NVidia 20604 2.9%

Linux Foundation 16917 2.4%

ST Ericsson 15792 2.3%

Novell 15185 2.2%

Wolfson Microelectronics 14039 2.0%

(Consultant) 13495 1.9%

AMD 10151 1.5%

Freescale 10102 1.4%

Linaro 9360 1.3%

Google 9070 1.3%

Qualcomm 8972 1.3%

A longstanding invariant in the above table has been Red Hat as the top corporate contributor; in 3.4, however, Red Hat has been pushed down one position by Intel. Red Hat's contributions are down somewhat; 960 changesets in 3.4 compared to 1,290 in 3.3. But the more significant change is the burst of activity from Intel. This work is mostly centered around support for Intel's own hardware, as one would expect, but also extends to things like support for the x32 ABI. Meanwhile, Texas Instruments continues the growth in participation seen over the last few years, as do a number of other mobile and embedded companies. Once upon a time, it was said that Linux development was dominated by "big iron" enterprise-oriented companies; those companies have not gone away, but they are clearly not the only driving force behind Linux kernel development at this point. On the other hand, the participation by volunteers is at the lowest level seen in many cycles, continuing a longstanding trend.

A brief focus on ARM

Recent development cycles have seen a lot of work in the ARM subtree, and 3.4 is no exception; 1,100 changesets touched code in arch/arm this time around. Those changes were contributed by 178 developers representing 51 companies. Among those companies, the most active were:

Most active 3.4 employers (ARM subtree)

By changesets

(Consultant) 149 13.5%

Texas Instruments 121 11.0%

(None) 103 9.4%

Samsung 91 8.3%

Linaro 80 7.3%

NVidia 54 4.9%

ARM 52 4.7%

(Unknown) 48 4.4%

Calxeda 46 4.2%

Freescale 40 3.6%

Atmel 37 3.4%

Atomide 30 2.7%

OpenSource AB 24 2.2%

Google 23 2.1%

ST Ericsson 23 2.1%

By lines changed

Samsung 8162 16.8%

(None) 5967 12.3%

NVidia 4929 10.2%

(Consultant) 4755 9.8%

Linaro 3550 7.3%

Texas Instruments 3118 6.4%

ARM 2659 5.5%

Calxeda 2408 5.0%

Atmel 2080 4.3%

(Unknown) 1862 3.8%

Vista-Silicon S.L. 1121 2.3%

Freescale 1117 2.3%

Atomide 1005 2.1%

Google 737 1.5%

PHILOSYS Software 659 1.4%

ARM is clearly an active area for consultants, who contributed over 13% of the changes this time around. Otherwise, there are few surprises to be seen in this area; the companies working in the mobile area are the biggest contributors to the ARM tree, while those focused on other types of systems have little presence here.

There is one other way to look at ARM development. Much of the work on ARM is done through the Linaro consortium. Many developers contributing code from a linaro.com address are "on loan" from other companies; the above table, to the extent possible, credits those changes to the "real" employer that paid for the work. If, instead, all changes from a Linaro address are credited to Linaro, the results change: Linaro, with 11.9% of all the changes in arch/arm, becomes the top employer, though it still accounts for fewer changes than independent consultants do. Linaro clearly has become an important part of the ARM development community.

In summary, it has been another busy and productive development cycle in the kernel community. Despite the usual hiccups, things are stabilizing and chances are good that 3.4-rc7 will be the last prepatch, meaning that this cycle will be a relatively short one. There is little rest for kernel developers, though; the 3.5 cycle with its frantic merge window will start shortly thereafter. Stay tuned to LWN, as always, for ongoing coverage of development in this large and energetic community.

Comments (1 posted)

Supporting multi-platform ARM kernels

By Jonathan Corbet
May 9, 2012

The diversity of the ARM architecture is one of its great strengths: manufacturers have been able to create a wide range of interesting system-on-chip devices around the common ARM processor core. But this diversity, combined with a general lack of hardware discoverability, makes ARM systems hard to support in the kernel. As things stand now, a special kernel must be built for any specific ARM system. With most other architectures, it is possible to support most or all systems with a single binary kernel (or maybe two for 32-bit and 64-bit configurations). In the ARM realm, there is no single binary kernel that can run everywhere. Work is being done to improve that situation, but some interesting decisions will have to be made on the way.

On an x86 system, the kernel is, for the most part, able to boot and ask the hardware to describe itself; kernels can thus configure themselves for the specific system on which they are run. In the ARM world, the hardware usually has no such capability, so the kernel must be told which devices are present and where they can be found. Traditionally, this configuration has been done in "board files," which have a number of tasks:

Define any system-specific functions and setup code.
Create a description of the available peripherals, usually through the definition of a number of platform devices.
Create a special machine description structure that includes a magic number defined for that particular system. That number must be passed to the kernel by the bootloader; the kernel uses it to find the machine description for the specific system being booted.

There are currently hundreds of board files in the ARM architecture subtree, and some unknown number of them shipped in devices but never contributed upstream. Within a given platform type (a specific system-on-chip line from a vendor), it is often possible to build multiple board files into a single kernel, with the actual machine type being specified at boot time. But combining board files across platform types is not generally possible.

One of the main goals of the current flurry of work in the ARM subtree is to make multi-platform kernels possible. An important step in that direction is the elimination of board files as much as possible; they are being replaced with device trees. In the end, a board file is largely a static data structure describing the topology of the system; that data structure can just as easily be put into a text file passed into the kernel by the boot loader. By moving the hardware configuration information out of the kernel itself, the ARM developers make the kernel more easily applicable to a wider variety of hardware. There are a lot of other things to be done before we have true multi-platform support—work toward properly abstracting interrupts and clocks continues, for example—but device tree support is an important piece of the puzzle.

Arnd Bergmann recently asked a question to the kernel development community: does it make sense to support legacy board files in multi-platform kernels? Or would it be better to limit support to systems that use device trees for hardware enumeration? Arnd was pretty clear on what his own position was:

My feeling is that we should just mandate DT booting for multiplatform kernels, because it significantly reduces the combinatorial space at compile time, avoids a lot of legacy board files that we cannot test anyway, reduces the total kernel size and gives an incentive for people to move forward to DT with their existing boards.

There was a surprising amount of opposition to this idea. Some developers seemed to interpret Arnd's message as a call to drop support for systems that lack device tree support, but that is not the point at all. Current single-platform builds will continue to work as they always have; nobody is trying to take that away. The point, instead, is to make life easier for developers trying to make multi-platform builds work; multi-platform ARM kernels have never worked in the past, so excluding some systems will not deprive their users of anything they already had.

Some others saw it as an arbitrary restriction without any real technical basis. There is nothing standing in the way of including non-device-tree systems in a multi-platform kernel except the extra complexity and bloat that they bring. But complexity and bloat are technical problems, especially when the problem being solved is difficult enough as it is. It was also pointed out that there are some older platforms that have not seen any real maintenance in recent times, but which are still useful for users.

In the end, it will come down to what the users of multi-platform ARM kernels want. It was not immediately clear to everybody that there are users for such kernels: ARM kernels are usually targeted to specific devices, so adding support for other systems gives no benefit at all. Thus, embedded systems manufacturers are likely to be uninterested in multi-platform support. Distributors are another story, though; they would like to support a wide range of systems without having to build large numbers of kernels. As Debian developer Wookey put it:

We are keen on multiplatform kernels because building a great pile of different ones is a massive pain (and not just for arm because it holds up security updates), and if we could still cover all that lot with one kernel, or indeed any number less than 7 that would be great.

In response, Arnd amended his proposal to allow board files for subarchitectures that don't look likely to support device trees anytime soon. At that point, the discussion wound down without any sort of formal conclusion. The topic will likely be discussed at the upcoming Linaro Connect event and, probably, afterward as well. There are a number of other issues to be dealt with before multi-platform ARM kernels are a reality; that gives some time for this particular decision to be considered with all the relevant needs in mind.

Comments (6 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.4-rc6 ?

Greg KH Linux 3.3.5 ?

Greg KH Linux 3.0.31 ?

Architecture-specific

Jarkko Sakkinen x86, realmode: new infrastructure for realmode code ?

Core kernel code

Thomas Gleixner init_task: Use a generic init_task variant ?

Colin Cross coupled cpuidle state support ?

Michael S. Tsirkin bitops: add _local bitops ?

Development tools

Hui Zhu KGTP (Linux Kernel debugger and tracer) lite patch for review ?

Yan, Zheng perf: Intel uncore pmu counting support ?

Device drivers

Johan Hovold mfd: add LM3533 lighting-power chip driver ?

Venkatraman S [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature ?

Huang Ying PCIe, Add PCIe runtime D3cold support ?

Tony Lindgren [PATCH] pinctrl: Add generic pinctrl-simple driver that supports omap2+ padconf ?

Tom Lin Input:Elan I2C touchpad driver for HID over I2C ?

Sylwester Nawrocki V4L: camera control enhancements ?

Samuel Iglesias Gonsalvez Staging: IndustryPack bus for the Linux Kernel ?

Chanwoo Choi Extcon: add MAX8997 extcon driver ?

Documentation

Michael Kerrisk (man-pages) man-pages-3.40 is released ?

Christoph Hellwig XFS status update for March 2012 ?

Christoph Hellwig XFS status update for April 2012 ?

Filesystems and block I/O

Jeff Layton vfs: add the ability to retry on ESTALE to several syscalls ?

Richard Weinberger UBI: UBIVIS (aka checkpointing) support ?

Networking

Ian Campbell skb paged fragment destructors ?

Eric Dumazet codel: Controlled Delay AQM ?

Virtualization and containers

Stanislav Kinsbursky Lockd: grace period containerization ?

Miscellaneous

NeilBrown ANNOUNCE: mdadm 3.2.4 - A tool for managing Soft RAID under Linux ?

Page editor: Jonathan Corbet
Next page: Distributions>>