Release status
Kernel release status
The current 2.6 prepatch remains 2.6.23-rc3; the 2.6.23-rc4 release
is somewhat overdue as of this writing. There has been a steady flow of
fixes into the mainline git repository over the last week.
The current -mm tree is 2.6.23-rc3-mm1. Recent changes
to -mm include the long-awaited ath5k wireless driver (mac80211-based
support for Atheros 5xxx wireless cards), a long list of x86_64 patches, a
new PID namespaces patch, and the PIE executable randomization patch from
ExecShield.
The current stable 2.6 kernel is 2.6.22.4, released on August 20.
This one contains a single patch - a security fix for the signal
vulnerability which allows, under some circumstances, an arbitrary
signal to be sent to a setuid process. (Note that the 2.6.22.5 process has already begun with that update being due
on or after the 23rd).
For older kernels: 2.6.20.16 was released on
August 16 with a couple dozen fixes. The next 2.6.20 stable update is
currently in review; it is a large patch set with quite a few fixes.
Comments (none posted)
Kernel development news
Quotes of the week
Why can DOS delete an infinite number of files and rm can't?
Because rm was written using the "vi" editor and it causes brain
damage and that's why after 20 years rm hasn't caught up with del.
--
Marc Perkel has solutions for all our
problems
Asserting that critics should patch the holes in your handwaving is
unlikely to impress anybody; arrogance is not in short supply
around here and yours is not even original.
--
Al Viro responds (thanks to Dag Bakke).
Comments (49 posted)
Distributed storage
By Jonathan Corbet
August 21, 2007
Evgeniy Polyakov is not an easily discouraged developer. He has been the
source of a great deal of interesting kernel code - including a network
channels implementation, an asynchronous crypto framework, the kevent
subsystem, the "network tree" memory management layer, and the netlink
connector code. Of all of those patches, only the netlink connector has
made it into the mainline kernel - and that was back in 2005. Undeterred,
Evgeniy has come forward
with another significant patch set for consideration. His ambitions are no
lower this time around: he would like to replace much of the functionality offered by the
device mapper, iSCSI, and network block device (NBD) layers.
He calls the new subsystem
distributed storage, or DST for
short. The goal is to allow the creation of high-performance storage
networks in a reliable and easy manner.
At the lowest level, the DST code implements a simple network protocol
which allows block devices to be exported across a network. The number of
operations supported is small: block read and write operations and a "how
big is your disk?" information request is about it. But it is intended to
be fast, non-blocking, and able to function without copying the data on the
way through. The zero-copy nature of the code allows it to perform I/O
operations with no memory allocations at all - though the underlying
network subsystem might do some allocations of its own.
There is no data
integrity checking built into the DST networking layer; it relies on the
networking code to handle that aspect of things.
There is also no real security support at all. If a block device is
exported for use in DST, it is exported to anybody who can reach the host.
The addition of explicit export lists could certainly be done in the
future, but, for now, hosts exporting drives via DST are probably best not
exposed to anything beyond an immediate local network.
The upper layer of the DST code enables the creation of local disks. A
simple ioctl() call would create a local disk from a remote drive,
essentially reproducing the functionality offered by NBD. Evgeniy claims
better performance than NBD, though, with non-blocking processing, no
user-space threads, and a lack of busy-wait loops. There is also a simple
failure recovery mechanism which will reconnect to remote hosts which go
away temporarily.
Beyond that, though, the DST code can be used to join multiple devices -
both local and remote - into larger arrays. There are currently two
algorithms available: linear and mirrored. In a linear array, each device
is added to the end of what looks like a much larger block device. The
mirroring algorithm replicates data across each device to provide redundancy
and generally faster read performance. There is infrastructure in place
for tracking which blocks must be updated on each component of a mirrored
array, so if one device drops out for a while it can be quickly brought up
to date on its return. Interestingly, that information is not stored on
each component; this is presented as a feature, in that one part of a
mirrored array can be removed and mounted independently as a sort of
snapshot. Block information also does not appear, in this iteration, to be
stored persistently anywhere, meaning that a crash of the DST server could
make recovery of an inconsistent mirrored array difficult or impossible.
Storage arrays created with DST can, in turn, be exported for use in other
arrays. So a series of drives located on a fast local network can be
combined in a sort of tree structure into one large, redundant array of
disks. There is no support for the creation of higher-level RAID arrays at
this time. Support for more algorithms is on the "to do" list, though
Evgeniy has said that the Reed-Solomon codes used for traditional RAID are
not fast enough for distributed arrays. He suggests that WEAVER
codes might be used instead.
At this level, DST looks much like the device mapper and MD layers already
supported by Linux. Evgeniy claims that the DST code is better in that it
does all processing in a non-blocking manner, works with more network
protocols, has simple automatic configuration, does not copy data, and can
perform operations
with no memory allocations. The zero-allocation feature is important in
situations where deadlocks are a worry - and they are often a worry when
remote storage is in use. Making the entire DST stack safe against
memory-allocation deadlocks would require some support in the network layer
as well - but, predictably, Evgeniy has some
ideas for how that can be done.
This patch set is clearly in a very early state; quite a bit of work would
be required before it would be ready for production use with data that
somebody actually cares about. Like all of Evgeniy's patches, DST
contains a number of interesting ideas. If the remaining little details
can be taken care of, the DST code could eventually reach a point where it
is seen as a useful addition to the Linux storage subsystem.
Comments (13 posted)
Who maintains this file?
By Jonathan Corbet
August 21, 2007
Kernel developers are generally encouraged to split patches into small
pieces before posting them to the mailing lists. Making each change
self-contained and easy to understand helps reviewers do their job and is
thus a good thing. That said, anybody who doubted that one can get too
much of a good
thing surely learned the truth when Joe Perches submitted
this patch set made up of almost 550
patches, all to the same file. It is fair to say that this deluge of
patches was not universally welcomed.
Packaging aside, the ultimate goal of Joe's patch was not particularly
controversial: he would like to make it possible to easily find out who is
the maintainer of a specific file in the kernel tree. So, for each entry
in the MAINTAINERS file, he added one or more lines with patterns
describing which files belong to that entry. With that information in
place, his get_maintainer.pl script can quickly identify who is
responsible for any file in the tree. No more digging through
MAINTAINERS or trying to extract email addresses from copyright
notices in the source.
It's an appealing idea, but nobody seems to be entirely clear on how to
implement it. Keeping this information in a central file has a number of
obvious disadvantages. It would clearly go out of date quickly, for
example. The MAINTAINERS file tends to get stale as it is; the
chances of it being patched for every new or renamed file seem quite small.
If developers, contrary to expectations, do keep this file up to date, one
can expect large numbers of conflicts as all the resulting patches try to
touch the same file.
The patch conflict problem could be mitigated by splitting up the
MAINTAINERS file into per-directory versions, much like what was
done with the kernel configuration file in the past. There are now over
400 Kconfig files in the mainline tree; some developers have
expressed dismay at the idea of similar numbers of MAINTAINERS
files being scattered around the tree. And, in any case, per-directory
files aren't much more
likely to be updated than the single, central file.
So around came another idea: why not just put the maintainer information
into the source files? The result would be nicely split documentation
which gets put in front of the relevant developers every time they edit the
file. The record for maintenance of documentation in the code is far from
perfect, but it is much better than the record for completely out-of-line
documentation.
One question which comes up when this approach is considered is whether the
resulting information should go into the binary kernel image or not. It
would be easy to define a new tag like:
MODULE_MAINTAINER("Your name here");
The provided information could then go into a special section in the kernel
image where special tools could find it. Doing things this way would make
it possible for people who don't have a kernel tree handy to look up a
maintainer. On the other hand, it would bloat the kernel image and fix
information in a binary, widely-distributed form where it could persist
long after it goes out of date. So ex-maintainers could continue receiving
mail for years after they have changed all of the relevant documentation.
An alternative would be to just put the maintainer information at the top
of the file as a comment. Then it would only be in the source, and would,
presumably, be relatively easy to keep up to date. At least, until, say, a
mailing list for a major subsystem moves and all of the associated source
files have to be changed. For example, Adrian Bunk noted that the move of the netdev
mailing list to vger would have forced patches to about 1300 files.
Yet another approach is to find a way to store the information in the git
repository. Git already maintains quite a bit of metadata about source
files; to some it seems natural to add maintainer information as well. So
far, the git developers have not shown a lot of appetite for adding this
sort of feature. But Linus did point out
that one could already use git to a similar effect with a simple command:
Do a script like this:
#!/bin/sh
git log --since=6.months.ago -- "$@" |
grep -i '^ [-a-z]*by:.*@' |
sort | uniq -c |
sort -r -n | head
and it gives you a rather good picture of who is involved with a
particular subdirectory or file.
The advantage of doing things this way is that the resulting output gives a
current
picture of who has actually been working on a file - a picture which
requires no explicit maintenance at all. That list of people is probably a
much better group to send copies of patches to than whoever might be listed
in a maintainers file; they are the ones who know about what is happening
in that part of the tree now.
No real resolution has been reached on this topic. It may be that Linus's
approach may be the one taken by default; it already works without the need
to merge any patches at all. The question may well stay around for a
while, though. Approximately 2,000 developers put patches into the
mainline over the course of one year; keeping track of which of those
developers is the best to notify of changes to a particular file is never
going to be easy.
Comments (5 posted)
Network transmit batching
By Jonathan Corbet
August 22, 2007
At the core of most network drivers is the
hard_start_xmit()
method, which is called once for every packet which is to be transmitted.
This method will normally acquire locks and insert the packet into the
adapter's transmit queue. As a rule, outgoing packets do not accumulate in
the kernel; they are handed to the driver, one at a time, when they are
ready to go. There are times, though, when packets cannot be handed off
immediately. If, for example, the hardware transmit queue is currently
full, the networking subsystem will have to hold on to the packet until
things clear out. Once the driver is able to accept packets for the device
again, the one-at-a-time behavior will resume.
The networking developers are always looking for ways to squeeze a little
more performance from their code. Krishna Kumar took a look at the
behavior described above and wondered: why not pass the list of accumulated
packets to the driver in a single call? Batching of transmission operations in
this way has the potential to minimize the cost of locking and device
preparation overhead, making packet transmission as a whole more
efficient. To explore this idea, Krishna has posted a few versions of the
SKB batching patch set.
Implementing SKB batching requires a couple of driver API changes - but
they are small and only required for batching-aware drivers. The first
step is to set the NETIF_F_BATCH_SKBS bit in the features
field of the net_device structure. That flag tells the network
stack that the driver can handle batched transmissions.
The prototype for hard_start_xmit() is:
int (*hard_start_xmit)(struct sk_buff *skb, struct net_device *dev);
That prototype does not change, but a driver which has indicated that
batching is acceptable for dev may find its
hard_start_xmit() method called with skb set to
NULL. The NULL value is an indication that there is a
batch of packets to transmit; that batch will be found enqueued on the
(new) list found at dev->skb_blist. So the (much simplified)
form of a batching-aware driver's hard_start_xmit() function will
look something like:
driver_specific_locking_and_setup();
if (skb)
ret = send_a_packet(internal_dev, skb);
else {
while ((skb = __skb_dequeue(dev->skb_blist)) != NULL) {
ret = send_a_packet(internal_dev, skb);
if (ret)
break;
}
}
driver_specific_cleanup();
The reality of the situation can be a bit more complicated, especially if
the driver implements optimizations like suppressing completion interrupts
until the last packet of the batch has been sent. But the core of the
change is as described here - not a whole lot to it.
As of this writing, the networking developers are still trying to determine
what the performance effects of this patch are. There is particular
interest in seeing how batching compares with TCP segmentation offloading,
which is also, at its core, a transmission batching mechanism. The proof
is very much in the benchmarks for a patch like this; if the results are
good enough, the patch will likely be merged.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>