The current 2.6 release is 2.6.13
by Linus on August 28. Only a
small number of relatively important fixes went in since -rc7. For those
just tuning in, 2.6.13 includes inotify
support for the Xtensa architecture, kexec and
support, a configuration-time selectable clock interrupt
(the default for i386 changes to 250 Hz), a much-improved CFQ I/O scheduler
with I/O priority support, the voluntary preemption patches, the removal of
the devfs configuration option (though the code remains in place for the
moment) and more. The
contains the details
for the patches merged since 2.6.13-rc7.
The floodgates have opened for 2.6.14; Linus's git repository includes a
large InfiniBand update (with a shared receive queue implementation), a PHY
abstraction layer for ethernet drivers, a serial ATA update, four-level
page table support for the ppc64 architecture, some sk_buff
structure shrinking patches, a big netfilter update (including netlink
interface to a number of netfilter internals and a user-space packet
logging capability), a new linked list primitive, a DCCP implementation
(see below), and more.
The current -mm release remains 2.6.13-rc6-mm2; there have been
no -mm releases over the last week.
The current stable 2.6 kernel is 184.108.40.206, released on August 29.
This one will be the last in the 2.6.12.x series, now that 2.6.13 is out;
it contains a small number of important fixes.
Comments (3 posted)
Kernel development news
Last week's Kernel Page included an article about the TCP offload
proposed by Chelsio Communications. That article
reflected the criticisms of the TOE approach which have been heard on the
development lists. In response, Chelsio's Wael Noureddine has sent us a letter
offload engines. That letter appears in this week's Letters to the Editor
page. It merits a mention here, however, since it provides a different
view of the situation than was seen on this page last week. Readers who do
not normally get to the back page may want to have a look this time around.
Comments (1 posted)
For many years, the bulk of networking over IP has made use of just two
protocols: transmission control protocol (TCP) and user datagram protocol
(UDP). TCP offers a reliable, stream-oriented connection which works well
for a large variety of higher-level network protocols. UDP, instead, makes
a best effort to move individual packets from one host to another, but
makes no promises regarding reliability or ordering. Most higher-level
protocols are built upon TCP, but there are applications which are better
served by UDP. These include:
- Protocols involving brief exchanges which will be slowed unacceptably
by TCP's connection overhead. A classic example is the domain name
system, which can often achieve a name lookup with a single packet in
- Protocols where timely delivery is more important than reliability.
These include internet telephony, streaming media, and certain kinds
of online games. If the network drops a packet, TCP will stall the
data flow until the sending side gets a successful retransmission
through. But a telephony application would rather keep the data
flowing and just do without the missing packet.
The second type of application listed above is an increasingly problematic
user of UDP. Streaming applications are a growing portion of the total
traffic on the net, and they can be the cause of significant congestion.
Unlike TCP, however, UDP has no concept of congestion control. In the
absence of any sort of connection information, there is no way to control
how any given application responds to network congestion. Early versions
of TCP, lacking congestion control, brought about the virtual collapse of
the early Internet; some fear that the growth of UDP-based traffic could
lead to similar problems in the near future.
This concern has led to the creation of the datagram congestion control
protocol (DCCP), which is described by this
draft RFC. Like UDP, DCCP is a datagram protocol. It differs from
UDP, however, in that it includes a congestion control mechanism.
Eventually, it is hoped that users of high-bandwidth, datagram-oriented
protocols will move over to DCCP as a way of getting better network
utilization while being fair to the net as a whole. Further down the road,
after DCCP has proved itself, it would not be surprising to see backbone
network routers beginning to discriminate against high bandwidth UDP
DCCP is a connection-oriented protocol, requiring a three-packet handshake
before data can be transferred. For this reason, it is unlikely to take
over from UDP in some areas, such as for DNS lookups. (There is a
provision in the protocol for sending data with the connection initiation
packet, but implementations are not required to accept that data).
applications tend to use longer-lived connections, however, so they should
not even notice the connection setup overhead.
Actually, DCCP uses a concept known as "half connections." A DCCP half
connection is a one-way, unreliable data pipe; most applications will create two half
connections to send data in both directions. The two half connections can
be tied together to the point that, as with TCP, a data packet traveling in
one direction can carry an acknowledgment for data received from the
other. In other respects, however, the two half connections are distinctly
separate from each other.
One way in which this separation can be seen is with congestion control.
TCP hides congestion control from user space entirely; it is handled by the
protocol, with the system administrator having some say over which
algorithms are used. DCCP, on the other hand, recognizes that different
protocols will have different needs, and allows each half connection to
negotiate its own congestion control regime. There are currently two
"congestion control ID profiles" (CCIDs) defined:
2 uses an algorithm much like that used with TCP. A congestion
window is used which can vary rapidly depending on net conditions;
this algorithm will be quick to take advantage of available bandwidth,
and equally quick to slow things down when congestion is detected.
(See this LWN article
for more information on how TCP congestion control works).
3, called "TCP-friendly rate control" or TFRC, aims to avoid
quick changes in bandwidth use while remaining fair to other network
users. To this end, TFRC will respond more slowly to network events
(such as dropped packets) but will, over time, converge to a bandwidth
utilization similar to what TCP would choose.
It is anticipated that applications which send steady streams of packets
(telephony and streaming media, for example) would elect to use TFRC
congestion control. For this sort of application, keeping the data flowing
is more important than using every bit of bandwidth which is available at
the moment. A control connection for an online game, instead, may be best
served by getting packets through as quickly as possible; applications
using this sort of connection may opt for the traditional TCP congestion
DCCP has a number of other features aimed at minimization of overhead,
resistance to denial of service attacks, and more. For the most part,
however, it can be seen as a form of UDP with explicit connections and
congestion control. Porting UDP applications to DCCP should not be
particularly challenging - once platforms with DCCP support have been
deployed on the net.
To that end, one of the first things which was merged for 2.6.14 was
a DCCP implementation for Linux. This work was done by Arnaldo Carvalho de
Melo, Ian McDonald, and others. It is a significant bunch of code; beyond
the DCCP implementation itself, Arnaldo has done a lot of work to
generalize parts of the Linux network stack. Much of the code which was
once useful only for TCP or UDP can now also be shared with DCCP.
For now, only CCID 3 (TFRC) has been implemented. A CCID 2
implementation, taking advantage of the TCP congestion control code, will
follow. Even before that, however, the 2.6.14 kernel will be the first
widely deployed DCCP implementation on the net. As such, it will likely
help to find some of the remaining glitches in the protocol and shape its
future evolution. When DCCP hits the mainstream, one can be reasonably
well sure that the Linux implementation will be second to none.
Comments (11 posted)
The configurable timer interrupt frequency patch, part of the 2.6.13
kernel, led to a certain amount of controversy over the optimal default
value. That default is 250 Hz, but there arguments in favor of both
increasing and decreasing that value. There was no consensus on what the
default should really be, but there is
a certain amount of agreement
that the real solution is to merge the dynamic
. By varying the timer interrupt frequency in response to
the actual system workload, the dynamic tick approach should be able to
satisfy most users.
Now that patches are being merged for 2.6.14, the obvious question came up:
will dynamic tick be one of them? The answer, it seems, is almost
certainly "no." This patch, despite being around in one form or another
for years, is still not quite ready.
One issue, apparently, is that systems running with dynamic tick tend to
boot slowly, and nobody has yet figured out why. The problem can be masked
by simply waiting until the system has booted before turning on dynamic
tick, but that solution appeals to nobody. Until this behavior is
understood, there will almost certainly be opposition to the merging of
Another problem with the current patch is that it does not work
particularly well on SMP systems. It requires that all CPUs go idle
before the timer interrupt frequency can be reduced. But an SMP system may
well have individual CPUs with no work to do while others are busy; such a
situation could come up fairly often. Srivatsa Vaddagiri is working on a patch for SMP systems, but it is still a
work in progress and has not received widespread testing.
The end result is that dynamic tick is unlikely to come together in time to
get into 2.6.14; the window for merging of patches of this magnitude is
supposed to close within a week or so. So this patch will be for 2.6.15 at
the earliest. If the revised development process works as planned, 2.6.15
should not be all that far away. Hopefully.
Comments (5 posted)
When a process forks, the kernel must copy that process's memory space for
the new child. Linux has long avoided copying the memory itself; anything
which cannot be shared is simply marked "copy on write" and left in place
until one process or the other does something to force a particular page to
be copied. The kernel does
copy the process's page tables,
however. If the parent process has a large address space, that copy can
take a long time.
Recently, Ray Fucillo noted that the amount
of time required to create a new process increased notably with the size of
any shared memory segments that process was using. After some discussion,
Nick Piggin came up with a quick fix: don't
bother copying page tables in cases where the kernel will be able to
reconstruct them at page fault time anyway. This small patch takes away
the fork() penalty for large shared mappings. In many cases, it
will make fork() more efficient in general; if the child process
never uses those parts of its address space (if it simply uses
exec() to run another program, say), the setup and teardown
overhead can be avoided altogether. On the other hand, if the child
process does use those mappings, a higher cost will be paid
overall. Rebuilding page tables one-by-one in response to faults is more
expensive than simply copying them in bulk at fork() time. The
consensus seems to be that the tradeoff is worthwhile, however, and this
patch has been merged for 2.6.14. If any serious performance regressions
result, they will hopefully be found before 2.6.14 is released.
One might well ask, however: why bother copying page tables for shared
mappings at all? Since the mappings are shared, the associated page tables
might as well be too. Sharing page tables would cut down on
fork() overhead, save the memory used to store multiple copies of
the tables, improve translation buffer performance, and reduce the number
of page faults handled by the kernel. To this end, Dave
McCracken has posted a new shared
page table patch. This patch is simpler than previous versions in that
it does not attempt
to perform copy-on-write sharing of private mappings; instead, it restricts
itself to mappings which are, themselves, shared. Since most processes
have a few of these (consider shared libraries, for example), even the
smaller patch can achieve a fair amount of sharing.
For the most part, sharing of page tables is straightforward; the kernel
need only avoid copying them and point a new process's page directories to
the shared tables. The one problem which does come up is reference
counting. When each process has its own page tables, it is easy to know
when those tables are no longer used. When a page table can be used by
more than one process, however, the kernel needs a way to keep track of how
many users each table has. The shared page table patch addresses this by
using the _mapcount field in the page structure
describing the page table page itself.
[Yes, page tables can already be shared by threads which share an entire
address space. In that case, however, the kernel can track usage by
looking at references to the full address space, rather than to individual
portions of it.]
Not everybody is convinced that shared page tables are a good idea. The
added complexity may not be justified by the resulting performance gains.
Dave claims a 3% improvement on an unnamed "industry standard database
benchmark," which is significant. There is also a fundamental conflict between
shared page tables and address space
randomization. For page tables to be shared, the corresponding
mappings must be at the same virtual address in every process, but
randomization explicitly breaks that assumption. Dave apparently has ideas
for making the patch work in the presence of randomization (if the
alignment of the mappings works out), but, for now, the two features are
It has also been asked: do shared page tables still yield a performance
benefit when Nick's deferred page table copying patch is taken into
account? The answer would appear to be "yes." The deferred copying patch
is entirely aimed at shortening the process creation time. Shared page
tables should also help in that regard, but, unlike the copying patch
(which may hurt ongoing performance slightly until the page tables are
populated), shared page tables speed things up throughout the life of the
process. So there may well be room in the kernel for both patches.
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>