Kernel development [LWN.net]

Kernel release status

The current 2.6 release is 2.6.13, announced by Linus on August 28. Only a small number of relatively important fixes went in since -rc7. For those just tuning in, 2.6.13 includes inotify, support for the Xtensa architecture, kexec and kdump, execute-in-place support, a configuration-time selectable clock interrupt frequency (the default for i386 changes to 250 Hz), a much-improved CFQ I/O scheduler with I/O priority support, the voluntary preemption patches, the removal of the devfs configuration option (though the code remains in place for the moment) and more. The long-format changelog contains the details for the patches merged since 2.6.13-rc7.

The floodgates have opened for 2.6.14; Linus's git repository includes a large InfiniBand update (with a shared receive queue implementation), a PHY abstraction layer for ethernet drivers, a serial ATA update, four-level page table support for the ppc64 architecture, some sk_buff structure shrinking patches, a big netfilter update (including netlink interface to a number of netfilter internals and a user-space packet logging capability), a new linked list primitive, a DCCP implementation (see below), and more.

The current -mm release remains 2.6.13-rc6-mm2; there have been no -mm releases over the last week.

The current stable 2.6 kernel is 2.6.12.6, released on August 29. This one will be the last in the 2.6.12.x series, now that 2.6.13 is out; it contains a small number of important fixes.

Comments (3 posted)

Chelsio responds to the TOE article

Last week's Kernel Page included an article about the TCP offload engine patch proposed by Chelsio Communications. That article reflected the criticisms of the TOE approach which have been heard on the development lists. In response, Chelsio's Wael Noureddine has sent us a letter defending TCP offload engines. That letter appears in this week's Letters to the Editor page. It merits a mention here, however, since it provides a different view of the situation than was seen on this page last week. Readers who do not normally get to the back page may want to have a look this time around.

Comments (1 posted)

Linux gets DCCP

For many years, the bulk of networking over IP has made use of just two protocols: transmission control protocol (TCP) and user datagram protocol (UDP). TCP offers a reliable, stream-oriented connection which works well for a large variety of higher-level network protocols. UDP, instead, makes a best effort to move individual packets from one host to another, but makes no promises regarding reliability or ordering. Most higher-level protocols are built upon TCP, but there are applications which are better served by UDP. These include:

Protocols involving brief exchanges which will be slowed unacceptably by TCP's connection overhead. A classic example is the domain name system, which can often achieve a name lookup with a single packet in each direction.
Protocols where timely delivery is more important than reliability. These include internet telephony, streaming media, and certain kinds of online games. If the network drops a packet, TCP will stall the data flow until the sending side gets a successful retransmission through. But a telephony application would rather keep the data flowing and just do without the missing packet.

The second type of application listed above is an increasingly problematic user of UDP. Streaming applications are a growing portion of the total traffic on the net, and they can be the cause of significant congestion. Unlike TCP, however, UDP has no concept of congestion control. In the absence of any sort of connection information, there is no way to control how any given application responds to network congestion. Early versions of TCP, lacking congestion control, brought about the virtual collapse of the early Internet; some fear that the growth of UDP-based traffic could lead to similar problems in the near future.

This concern has led to the creation of the datagram congestion control protocol (DCCP), which is described by this draft RFC. Like UDP, DCCP is a datagram protocol. It differs from UDP, however, in that it includes a congestion control mechanism. Eventually, it is hoped that users of high-bandwidth, datagram-oriented protocols will move over to DCCP as a way of getting better network utilization while being fair to the net as a whole. Further down the road, after DCCP has proved itself, it would not be surprising to see backbone network routers beginning to discriminate against high bandwidth UDP users.

DCCP is a connection-oriented protocol, requiring a three-packet handshake before data can be transferred. For this reason, it is unlikely to take over from UDP in some areas, such as for DNS lookups. (There is a provision in the protocol for sending data with the connection initiation packet, but implementations are not required to accept that data). The higher-bandwidth applications tend to use longer-lived connections, however, so they should not even notice the connection setup overhead.

Actually, DCCP uses a concept known as "half connections." A DCCP half connection is a one-way, unreliable data pipe; most applications will create two half connections to send data in both directions. The two half connections can be tied together to the point that, as with TCP, a data packet traveling in one direction can carry an acknowledgment for data received from the other. In other respects, however, the two half connections are distinctly separate from each other.

One way in which this separation can be seen is with congestion control. TCP hides congestion control from user space entirely; it is handled by the protocol, with the system administrator having some say over which algorithms are used. DCCP, on the other hand, recognizes that different protocols will have different needs, and allows each half connection to negotiate its own congestion control regime. There are currently two "congestion control ID profiles" (CCIDs) defined:

CCID 2 uses an algorithm much like that used with TCP. A congestion window is used which can vary rapidly depending on net conditions; this algorithm will be quick to take advantage of available bandwidth, and equally quick to slow things down when congestion is detected. (See this LWN article for more information on how TCP congestion control works).
CCID 3, called "TCP-friendly rate control" or TFRC, aims to avoid quick changes in bandwidth use while remaining fair to other network users. To this end, TFRC will respond more slowly to network events (such as dropped packets) but will, over time, converge to a bandwidth utilization similar to what TCP would choose.

It is anticipated that applications which send steady streams of packets (telephony and streaming media, for example) would elect to use TFRC congestion control. For this sort of application, keeping the data flowing is more important than using every bit of bandwidth which is available at the moment. A control connection for an online game, instead, may be best served by getting packets through as quickly as possible; applications using this sort of connection may opt for the traditional TCP congestion control mechanism.

DCCP has a number of other features aimed at minimization of overhead, resistance to denial of service attacks, and more. For the most part, however, it can be seen as a form of UDP with explicit connections and congestion control. Porting UDP applications to DCCP should not be particularly challenging - once platforms with DCCP support have been deployed on the net.

To that end, one of the first things which was merged for 2.6.14 was a DCCP implementation for Linux. This work was done by Arnaldo Carvalho de Melo, Ian McDonald, and others. It is a significant bunch of code; beyond the DCCP implementation itself, Arnaldo has done a lot of work to generalize parts of the Linux network stack. Much of the code which was once useful only for TCP or UDP can now also be shared with DCCP.

For now, only CCID 3 (TFRC) has been implemented. A CCID 2 implementation, taking advantage of the TCP congestion control code, will follow. Even before that, however, the 2.6.14 kernel will be the first widely deployed DCCP implementation on the net. As such, it will likely help to find some of the remaining glitches in the protocol and shape its future evolution. When DCCP hits the mainstream, one can be reasonably well sure that the Linux implementation will be second to none.

Comments (11 posted)

The state of the dynamic tick patch

The configurable timer interrupt frequency patch, part of the 2.6.13 kernel, led to a certain amount of controversy over the optimal default value. That default is 250 Hz, but there arguments in favor of both increasing and decreasing that value. There was no consensus on what the default should really be, but there is a certain amount of agreement that the real solution is to merge the dynamic tick patch. By varying the timer interrupt frequency in response to the actual system workload, the dynamic tick approach should be able to satisfy most users.

Now that patches are being merged for 2.6.14, the obvious question came up: will dynamic tick be one of them? The answer, it seems, is almost certainly "no." This patch, despite being around in one form or another for years, is still not quite ready.

One issue, apparently, is that systems running with dynamic tick tend to boot slowly, and nobody has yet figured out why. The problem can be masked by simply waiting until the system has booted before turning on dynamic tick, but that solution appeals to nobody. Until this behavior is understood, there will almost certainly be opposition to the merging of this patch.

Another problem with the current patch is that it does not work particularly well on SMP systems. It requires that all CPUs go idle before the timer interrupt frequency can be reduced. But an SMP system may well have individual CPUs with no work to do while others are busy; such a situation could come up fairly often. Srivatsa Vaddagiri is working on a patch for SMP systems, but it is still a work in progress and has not received widespread testing.

The end result is that dynamic tick is unlikely to come together in time to get into 2.6.14; the window for merging of patches of this magnitude is supposed to close within a week or so. So this patch will be for 2.6.15 at the earliest. If the revised development process works as planned, 2.6.15 should not be all that far away. Hopefully.

Comments (5 posted)

Improving shared memory performance

When a process forks, the kernel must copy that process's memory space for the new child. Linux has long avoided copying the memory itself; anything which cannot be shared is simply marked "copy on write" and left in place until one process or the other does something to force a particular page to be copied. The kernel does copy the process's page tables, however. If the parent process has a large address space, that copy can take a long time.

Recently, Ray Fucillo noted that the amount of time required to create a new process increased notably with the size of any shared memory segments that process was using. After some discussion, Nick Piggin came up with a quick fix: don't bother copying page tables in cases where the kernel will be able to reconstruct them at page fault time anyway. This small patch takes away the fork() penalty for large shared mappings. In many cases, it will make fork() more efficient in general; if the child process never uses those parts of its address space (if it simply uses exec() to run another program, say), the setup and teardown overhead can be avoided altogether. On the other hand, if the child process does use those mappings, a higher cost will be paid overall. Rebuilding page tables one-by-one in response to faults is more expensive than simply copying them in bulk at fork() time. The consensus seems to be that the tradeoff is worthwhile, however, and this patch has been merged for 2.6.14. If any serious performance regressions result, they will hopefully be found before 2.6.14 is released.

One might well ask, however: why bother copying page tables for shared mappings at all? Since the mappings are shared, the associated page tables might as well be too. Sharing page tables would cut down on fork() overhead, save the memory used to store multiple copies of the tables, improve translation buffer performance, and reduce the number of page faults handled by the kernel. To this end, Dave McCracken has posted a new shared page table patch. This patch is simpler than previous versions in that it does not attempt to perform copy-on-write sharing of private mappings; instead, it restricts itself to mappings which are, themselves, shared. Since most processes have a few of these (consider shared libraries, for example), even the smaller patch can achieve a fair amount of sharing.

For the most part, sharing of page tables is straightforward; the kernel need only avoid copying them and point a new process's page directories to the shared tables. The one problem which does come up is reference counting. When each process has its own page tables, it is easy to know when those tables are no longer used. When a page table can be used by more than one process, however, the kernel needs a way to keep track of how many users each table has. The shared page table patch addresses this by using the _mapcount field in the page structure describing the page table page itself.

[Yes, page tables can already be shared by threads which share an entire address space. In that case, however, the kernel can track usage by looking at references to the full address space, rather than to individual portions of it.]

Not everybody is convinced that shared page tables are a good idea. The added complexity may not be justified by the resulting performance gains. Dave claims a 3% improvement on an unnamed "industry standard database benchmark," which is significant. There is also a fundamental conflict between shared page tables and address space randomization. For page tables to be shared, the corresponding mappings must be at the same virtual address in every process, but randomization explicitly breaks that assumption. Dave apparently has ideas for making the patch work in the presence of randomization (if the alignment of the mappings works out), but, for now, the two features are incompatible.

It has also been asked: do shared page tables still yield a performance benefit when Nick's deferred page table copying patch is taken into account? The answer would appear to be "yes." The deferred copying patch is entirely aimed at shortening the process creation time. Shared page tables should also help in that regard, but, unlike the copying patch (which may hurt ongoing performance slightly until the page tables are populated), shared page tables speed things up throughout the life of the process. So there may well be room in the kernel for both patches.

Comments (none posted)

Linus Torvalds Linux 2.6.13 ?

Con Kolivas 2.6.13-ck1 ?

Ingo Molnar 2.6.13-rt1 ?

Ingo Molnar 2.6.13-rc7-rt1 ?

Chris Wright Linux 2.6.12.6 ?

Arnd Bergmann Cell SPU file system, snapshot 4 ?

Arnd Bergmann spufs: The SPU file system ?

Arnd Bergmann spufs: switchable spu contexts ?

Arnd Bergmann spufs: spu-side context switch code ?

Arnd Bergmann spufs: Add a register file for the debugger ?

Arnd Bergmann spufs: kernel-side context switch code ?

Arnd Bergmann spufs: Use a system call instead of ioctl ?

Arnd Bergmann spufs: allow O_ASYNC on mailbox files ?

Eric W. Biederman i386, x86_64 Initial PAT implementation ?

Peter Williams PlugSched-6.1 for 2.6.13 ?

Srivatsa Vaddagiri Updated dynamic tick patches ?

Junio C Hamano GIT 0.99.5 ?

Tom Zanussi relayfs: upgraded read() implementation ?

Keith Owens Announce: kdb v4.4 is available for kernel 2.6.13 ?

Andrea Arcangeli KLive: Linux Kernel Live Usage Monitor ?

Todd Poynor PowerOP Take 2 0/3 Intro ?

Todd Poynor PowerOP Take 2 1/3: ARM OMAP1 platform support ?

Todd Poynor PowerOP Take 2 2/3: sysfs UI core ?

Todd Poynor PowerOP Take 2 3/3: OMAP1 sysfs UI ?

Robert Love IBM HDAPS accelerometer driver, with probing. ?

Jeff Garzik 2.6.x libata updates ?

Roland Dreier InfiniBand updates for 2.6.14 ?

Marcelo Tosatti MPC8xx PCMCIA driver ?

David Brownell SPI redux ... driver model support ?

Dave Airlie DRM tree for 2.6.14 (fwd) ?

Yani Ioannou IPMI: driver model and sysfs support ?

Jaroslav Kysela ALSA version 1.0.10rc1+ ?

Brett Russ Marvell SATA support (PIO mode) ?

Mark Gross Telecom Clock driver for MPCBL0010 ATCA compute blade. ?

Tejun Heo SCSI EH document ?

Tejun Heo ATA/ATAPI exceptions doc ?

Tejun Heo current libata eh doc ?

Tejun Heo libata new EH document ?

Pradeep Padala Log-structured File System for Linux that Supports Snapshots ?

Daniel Phillips Configfs is really sysfs ?

Jesper Juhl remove verify_area() ?

Russell King Removal of deprecated serial functions - please update your drivers NOW ?

Blaisorblade [patch 0/18] remap_file_pages protection support (for UML), try 3 ?

Dave McCracken Implement shared page tables ?

Jiri Benc ieee80211 patches ?

Chris Wright LSM hook updates ?

Stephen Smalley Generic VFS fallback for security xattrs ?

Jeff Garzik change libata license from OSL+GPL to GPL ?

Benjamin LaHaise new name for 2.6.14 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Chelsio responds to the TOE article

Linux gets DCCP

The state of the dynamic tick patch

Improving shared memory performance

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous