Kernel development
Brief items
Kernel release status
The current 2.6 release is 2.6.13, announced by Linus on August 28. Only a small number of relatively important fixes went in since -rc7. For those just tuning in, 2.6.13 includes inotify, support for the Xtensa architecture, kexec and kdump, execute-in-place support, a configuration-time selectable clock interrupt frequency (the default for i386 changes to 250 Hz), a much-improved CFQ I/O scheduler with I/O priority support, the voluntary preemption patches, the removal of the devfs configuration option (though the code remains in place for the moment) and more. The long-format changelog contains the details for the patches merged since 2.6.13-rc7.The floodgates have opened for 2.6.14; Linus's git repository includes a large InfiniBand update (with a shared receive queue implementation), a PHY abstraction layer for ethernet drivers, a serial ATA update, four-level page table support for the ppc64 architecture, some sk_buff structure shrinking patches, a big netfilter update (including netlink interface to a number of netfilter internals and a user-space packet logging capability), a new linked list primitive, a DCCP implementation (see below), and more.
The current -mm release remains 2.6.13-rc6-mm2; there have been no -mm releases over the last week.
The current stable 2.6 kernel is 2.6.12.6, released on August 29. This one will be the last in the 2.6.12.x series, now that 2.6.13 is out; it contains a small number of important fixes.
Kernel development news
Chelsio responds to the TOE article
Last week's Kernel Page included an article about the TCP offload engine patch proposed by Chelsio Communications. That article reflected the criticisms of the TOE approach which have been heard on the development lists. In response, Chelsio's Wael Noureddine has sent us a letter defending TCP offload engines. That letter appears in this week's Letters to the Editor page. It merits a mention here, however, since it provides a different view of the situation than was seen on this page last week. Readers who do not normally get to the back page may want to have a look this time around.Linux gets DCCP
For many years, the bulk of networking over IP has made use of just two protocols: transmission control protocol (TCP) and user datagram protocol (UDP). TCP offers a reliable, stream-oriented connection which works well for a large variety of higher-level network protocols. UDP, instead, makes a best effort to move individual packets from one host to another, but makes no promises regarding reliability or ordering. Most higher-level protocols are built upon TCP, but there are applications which are better served by UDP. These include:
- Protocols involving brief exchanges which will be slowed unacceptably
by TCP's connection overhead. A classic example is the domain name
system, which can often achieve a name lookup with a single packet in
each direction.
- Protocols where timely delivery is more important than reliability. These include internet telephony, streaming media, and certain kinds of online games. If the network drops a packet, TCP will stall the data flow until the sending side gets a successful retransmission through. But a telephony application would rather keep the data flowing and just do without the missing packet.
The second type of application listed above is an increasingly problematic user of UDP. Streaming applications are a growing portion of the total traffic on the net, and they can be the cause of significant congestion. Unlike TCP, however, UDP has no concept of congestion control. In the absence of any sort of connection information, there is no way to control how any given application responds to network congestion. Early versions of TCP, lacking congestion control, brought about the virtual collapse of the early Internet; some fear that the growth of UDP-based traffic could lead to similar problems in the near future.
This concern has led to the creation of the datagram congestion control protocol (DCCP), which is described by this draft RFC. Like UDP, DCCP is a datagram protocol. It differs from UDP, however, in that it includes a congestion control mechanism. Eventually, it is hoped that users of high-bandwidth, datagram-oriented protocols will move over to DCCP as a way of getting better network utilization while being fair to the net as a whole. Further down the road, after DCCP has proved itself, it would not be surprising to see backbone network routers beginning to discriminate against high bandwidth UDP users.
DCCP is a connection-oriented protocol, requiring a three-packet handshake before data can be transferred. For this reason, it is unlikely to take over from UDP in some areas, such as for DNS lookups. (There is a provision in the protocol for sending data with the connection initiation packet, but implementations are not required to accept that data). The higher-bandwidth applications tend to use longer-lived connections, however, so they should not even notice the connection setup overhead.
Actually, DCCP uses a concept known as "half connections." A DCCP half connection is a one-way, unreliable data pipe; most applications will create two half connections to send data in both directions. The two half connections can be tied together to the point that, as with TCP, a data packet traveling in one direction can carry an acknowledgment for data received from the other. In other respects, however, the two half connections are distinctly separate from each other.
One way in which this separation can be seen is with congestion control. TCP hides congestion control from user space entirely; it is handled by the protocol, with the system administrator having some say over which algorithms are used. DCCP, on the other hand, recognizes that different protocols will have different needs, and allows each half connection to negotiate its own congestion control regime. There are currently two "congestion control ID profiles" (CCIDs) defined:
- CCID
2 uses an algorithm much like that used with TCP. A congestion
window is used which can vary rapidly depending on net conditions;
this algorithm will be quick to take advantage of available bandwidth,
and equally quick to slow things down when congestion is detected.
(See this LWN article
for more information on how TCP congestion control works).
- CCID 3, called "TCP-friendly rate control" or TFRC, aims to avoid quick changes in bandwidth use while remaining fair to other network users. To this end, TFRC will respond more slowly to network events (such as dropped packets) but will, over time, converge to a bandwidth utilization similar to what TCP would choose.
It is anticipated that applications which send steady streams of packets (telephony and streaming media, for example) would elect to use TFRC congestion control. For this sort of application, keeping the data flowing is more important than using every bit of bandwidth which is available at the moment. A control connection for an online game, instead, may be best served by getting packets through as quickly as possible; applications using this sort of connection may opt for the traditional TCP congestion control mechanism.
DCCP has a number of other features aimed at minimization of overhead, resistance to denial of service attacks, and more. For the most part, however, it can be seen as a form of UDP with explicit connections and congestion control. Porting UDP applications to DCCP should not be particularly challenging - once platforms with DCCP support have been deployed on the net.
To that end, one of the first things which was merged for 2.6.14 was a DCCP implementation for Linux. This work was done by Arnaldo Carvalho de Melo, Ian McDonald, and others. It is a significant bunch of code; beyond the DCCP implementation itself, Arnaldo has done a lot of work to generalize parts of the Linux network stack. Much of the code which was once useful only for TCP or UDP can now also be shared with DCCP.
For now, only CCID 3 (TFRC) has been implemented. A CCID 2 implementation, taking advantage of the TCP congestion control code, will follow. Even before that, however, the 2.6.14 kernel will be the first widely deployed DCCP implementation on the net. As such, it will likely help to find some of the remaining glitches in the protocol and shape its future evolution. When DCCP hits the mainstream, one can be reasonably well sure that the Linux implementation will be second to none.
The state of the dynamic tick patch
The configurable timer interrupt frequency patch, part of the 2.6.13 kernel, led to a certain amount of controversy over the optimal default value. That default is 250 Hz, but there arguments in favor of both increasing and decreasing that value. There was no consensus on what the default should really be, but there is a certain amount of agreement that the real solution is to merge the dynamic tick patch. By varying the timer interrupt frequency in response to the actual system workload, the dynamic tick approach should be able to satisfy most users.Now that patches are being merged for 2.6.14, the obvious question came up: will dynamic tick be one of them? The answer, it seems, is almost certainly "no." This patch, despite being around in one form or another for years, is still not quite ready.
One issue, apparently, is that systems running with dynamic tick tend to boot slowly, and nobody has yet figured out why. The problem can be masked by simply waiting until the system has booted before turning on dynamic tick, but that solution appeals to nobody. Until this behavior is understood, there will almost certainly be opposition to the merging of this patch.
Another problem with the current patch is that it does not work particularly well on SMP systems. It requires that all CPUs go idle before the timer interrupt frequency can be reduced. But an SMP system may well have individual CPUs with no work to do while others are busy; such a situation could come up fairly often. Srivatsa Vaddagiri is working on a patch for SMP systems, but it is still a work in progress and has not received widespread testing.
The end result is that dynamic tick is unlikely to come together in time to get into 2.6.14; the window for merging of patches of this magnitude is supposed to close within a week or so. So this patch will be for 2.6.15 at the earliest. If the revised development process works as planned, 2.6.15 should not be all that far away. Hopefully.
Improving shared memory performance
When a process forks, the kernel must copy that process's memory space for the new child. Linux has long avoided copying the memory itself; anything which cannot be shared is simply marked "copy on write" and left in place until one process or the other does something to force a particular page to be copied. The kernel does copy the process's page tables, however. If the parent process has a large address space, that copy can take a long time.Recently, Ray Fucillo noted that the amount of time required to create a new process increased notably with the size of any shared memory segments that process was using. After some discussion, Nick Piggin came up with a quick fix: don't bother copying page tables in cases where the kernel will be able to reconstruct them at page fault time anyway. This small patch takes away the fork() penalty for large shared mappings. In many cases, it will make fork() more efficient in general; if the child process never uses those parts of its address space (if it simply uses exec() to run another program, say), the setup and teardown overhead can be avoided altogether. On the other hand, if the child process does use those mappings, a higher cost will be paid overall. Rebuilding page tables one-by-one in response to faults is more expensive than simply copying them in bulk at fork() time. The consensus seems to be that the tradeoff is worthwhile, however, and this patch has been merged for 2.6.14. If any serious performance regressions result, they will hopefully be found before 2.6.14 is released.
One might well ask, however: why bother copying page tables for shared mappings at all? Since the mappings are shared, the associated page tables might as well be too. Sharing page tables would cut down on fork() overhead, save the memory used to store multiple copies of the tables, improve translation buffer performance, and reduce the number of page faults handled by the kernel. To this end, Dave McCracken has posted a new shared page table patch. This patch is simpler than previous versions in that it does not attempt to perform copy-on-write sharing of private mappings; instead, it restricts itself to mappings which are, themselves, shared. Since most processes have a few of these (consider shared libraries, for example), even the smaller patch can achieve a fair amount of sharing.
For the most part, sharing of page tables is straightforward; the kernel need only avoid copying them and point a new process's page directories to the shared tables. The one problem which does come up is reference counting. When each process has its own page tables, it is easy to know when those tables are no longer used. When a page table can be used by more than one process, however, the kernel needs a way to keep track of how many users each table has. The shared page table patch addresses this by using the _mapcount field in the page structure describing the page table page itself.
[Yes, page tables can already be shared by threads which share an entire address space. In that case, however, the kernel can track usage by looking at references to the full address space, rather than to individual portions of it.]
Not everybody is convinced that shared page tables are a good idea. The added complexity may not be justified by the resulting performance gains. Dave claims a 3% improvement on an unnamed "industry standard database benchmark," which is significant. There is also a fundamental conflict between shared page tables and address space randomization. For page tables to be shared, the corresponding mappings must be at the same virtual address in every process, but randomization explicitly breaks that assumption. Dave apparently has ideas for making the patch work in the presence of randomization (if the alignment of the mappings works out), but, for now, the two features are incompatible.
It has also been asked: do shared page tables still yield a performance benefit when Nick's deferred page table copying patch is taken into account? The answer would appear to be "yes." The deferred copying patch is entirely aimed at shortening the process creation time. Shared page tables should also help in that regard, but, unlike the copying patch (which may hurt ongoing performance slightly until the page tables are populated), shared page tables speed things up throughout the life of the process. So there may well be room in the kernel for both patches.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
