Brief items
The current stable 2.6 kernel is 2.6.17.8,
released on August 6.
There is a fairly long list of important fixes this time around, but none
with CVE (vulnerability) numbers attached.
The current 2.6 prepatch is 2.6.18-rc4, announced by Linus on
August 6. "The diffstat (and the appended shortlog)
tells the story: a lot of small fixes in various areas, mostly
drivers. Input layer, infiniband, usb, net, sound, vlb. Some cpufreq and
architecture updates. Also some audit rule improvements from Al &
Amy." The changes also include a new event notification mechanism
within the networking code and a function (netdev_alloc_skb()) for
allocating packet buffers in a NUMA-aware fashion. See the
long-format changelog for the details.
The current -mm tree is 2.6.18-rc3-mm2. Recent changes
to -mm include the return of the CacheFS subsystem, full compact
flash support in the libata code, a big x86-64 update, a number of memory
management tweaks, vectored asynchronous I/O support, and a "comprehensive
system accounting" patch.
Comments (none posted)
Kernel development news
Davej's laws of kernel hacking #1: If the number of iterations a
patch goes through to get it right is greater than the number of
lines in the diff, it probably isn't worth it.
-- Dave Jones
Comments (3 posted)
When Linus
announced the
2.6.18-rc4 release, he tossed in one extra bit of
news:
Anyway, I'll be effectively offline for most of the following three
weeks (vacations and a funeral), and while I hope to be able to
update my tree every once in a while, I also asked Greg KH to
maintain a git tree for any worthwhile fixes.
He then promptly fled the scene without actually putting -rc4 up on
kernel.org - an omission which Greg fixed some hours later. While kernel
development will continue as always, we are likely to see rather fewer -rc
releases over the next few weeks, and almost certainly no 2.6.18 final
release.
Andrew Morton, meanwhile, used the 2.6.18-rc3-mm1 announcement to
pass on a little news of his own:
fwiw, I recently took a position with Google.
He evidently made this change to find a working environment which better
suits his habits; from the kernel development point of view, no real
changes are expected.
Finally, Greg Kroah-Hartman has announced a
transition in 2.6.16 support:
This is just a notice to everyone that Adrian [Bunk] is going to now be
taking over the 2.6.16-stable kernel branch, for him to maintain
for as long as he wants to.
He will still be following the same -stable rules that are
documented in the Documentation/stable_kernel_rules.txt file, but
just doing this for the 2.6.16 kernel tree for a much longer time
than the current stable team is willing to do (we have moved on to
the 2.6.17 kernel now.)
Adrian had announced his intention to maintain this kernel for the long
haul early in the 2.6.16 cycle. It will be interesting to see how this
goes; fitting important patches into 2.6.16 will get harder as the mainline
gets more distant. The long-term success of this project may depend on
whether distributors make use of this kernel - and, as a result, help to
maintain it.
Comments (1 posted)
The Grand Unified Flow Cache is one of those items which shows up as a
bullet in networking summit presentations; the networking folks appear to
know what it means, but they have been somewhat remiss in documenting the
idea for the rest of us. This concept has returned in the context of the
network channels discussion, and enough hints have been dropped to let your
editor - who is not afraid to extrapolate a long way from minimal data -
get a sense for what the term means. Should it be implemented, the GUFC
could bring significant changes to the entire networking stack.
The net channel concept requires that the kernel be able to quickly
identify the destination of each packet and drop it into the proper
channel. Even better would be to have a smart network adapter perform that
classification as the packet arrives, taking the kernel out of that part of
the loop altogether. One way of performing this classification would be to form a
tuple from each packet and use that tuple as a lookup key in some sort of
fast data structure. When a packet's tuple is found in this structure (the
flow cache), its fate has been determined and it can be quickly shunted off
to where it needs to be.
This tuple, as described by Rusty Russell,
would be made up of seven parameters:
- The source IP address
- The destination IP address
- A bit indicating whether the source is local
- A bit indicating whether the destination is local
- The IP protocol number
- The source port
- The destination port
These numbers, all together, are sufficient to identify the connection to
which any packet belongs. A quick lookup on an incoming packet should,
thus, yield a useful destination (such as a network channel) for that
packet with no further processing.
Features like netfilter mess up this pretty picture, however. Within the
kernel, netfilter is set up such that every packet is fed to the
appropriate chain(s). As soon as every packet has to go through a common
set of hooks, the advantage of the GUFC is lost. Rusty's description of
the problem is this:
The mistake (?) with netfilter was that we are completely general:
you will see all packets, do what you want. If, instead, we had
forced all rules to be of form "show me all packets matching this
tuple" we would be in a [position to] combine it in a single lookup
with routing etc.
So, the way around this problem would be to change the netfilter API to
work better with a grand unified flow cache. Rules could be written
in terms of the above tuples (with wild cards allowed), and only packets
which match the tuples need pass through the (slow) netfilter path. That
would allow packets which are not of interest to the filtering code to
bypass the whole mechanism - and the decision could be made in a single
lookup.
Often, however, a packet filtering decision can be made on the basis of the
tuple itself - once a packet matches the tuple, there is no real need to
evaluate it against the rule separately. So, for example, once the
connection tracking code has allowed a new connection to be established,
and a tuple describing that connection has been added to the cache, further
filtering for that connection should not be required. If netfilter and the
flow cache worked together effectively, the per-packet overhead could be
avoided in many cases.
One way this might work would be to have a set of callbacks invoked for
each tuple which is added to the flow cache. A module like netfilter could
examine the tuple relative to the current rule set and let the kernel know
if it needs to see packets matching that tuple or not. Then, packets could
be directed to the appropriate filters without the need for wildcard
matching in the tuple cache.
There is a small cost to all of this:
Of course, it means rewriting all the userspace tools,
documentation, and creating a complete new infrastructure for
connection tracking and NAT, but if that's what's required, then so
be it.
Rusty has never let this sort of obstacle stop him before, so all of this
might just happen.
But probably not anytime soon. There's a long list of questions which need
to be answered before a serious implementation attempt is made. Whether
it would truly perform as well as people hope is one of them; these schemes
can get quite a bit slower once all of the real-world details are factored
in. Rule updates could be a challenge; an administrator who has just
changed packet filtering rules is unlikely to wait patiently while the new
rules slowly work their way into the cache. Finding a way to get the
hardware to help in the classification process will not be entirely
straightforward. And so on. But it would seem that there are a number of
interesting ideas in this area. That is bound to lead to
good stuff sooner or later.
Comments (4 posted)
Paravirtualization is the act of running a guest operating system, under
control of a host system, where the guest has been ported to a virtual
architecture which is
almost like the hardware it is actually running
on. This technique allows full guest systems to be run in a relatively
efficient manner. The highest-profile free paravirtualization
implementation remains Xen; on the proprietary side, VMWare has been active
for a long time. Both of these efforts would like to see (at least some
of) their code in the mainline kernel. The kernel developers, however, are
uninterested in merging a large collection of hooks specific to any one
solution.
One attempt to solve this problem, proposed by VMWare, is the VMI interface. VMI works by
isolating any operations which may require hypervisor intervention into a
special set of function calls. The implementation of those functions is
not built into the kernel; instead, the kernel, at boot time, loads a
"hypervisor ROM" which provides the needed functions. The binary interface
between the kernel and this loadable segment is set in stone, meaning that
kernels built for today's implementations should work equally well on
tomorrow's replacement. This design also allows the same binary kernel image to run
under a variety of hypervisors, or, with the right ROM, in native mode on
the bare hardware.
The fixed ABI and ability to load "binary blobs" into the kernel does not
sit well with all kernel developers, however. It looks like another way to
put proprietary code into the kernel, which is something most kernel
hackers would rather support less of. Plus, as Rusty Russell put it:
We're not good at maintaining ABIs. We're going to be especially
bad at maintaining an ABI when the 99% of us running native will
never notice the breakage.
For this and other reasons, VMI has
not had a smooth path into the kernel so far. That has not stopped VMWare
hacker Zachary Amsden from pushing for a binary blob
interface recently on linux-kernel, however.
There have been rumblings for a while concerning an alternative hypervisor
interface (called "paravirt_ops") under development. An early implementation of
paravirt_ops was posted on August 7, making the shape of this interface
clearer. In the end, paravirt_ops is yet another structure filled
with function pointers, like many other operations structures used in the
kernel. In this case, the operations are the various machine-specific
functions that tend to require a discussion with the hypervisor. They
include things like disabling interrupts, changing processor control
registers, changing memory mappings, etc.
As an example, one of the members of paravirt_ops is:
void (fastcall *irq_disable)(void);
The patch also defines a little function for use by the kernel:
static inline void raw_local_irq_disable(void)
{
paravirt_ops.irq_disable();
}
As long as the kernel always uses this function to disable interrupts, it
will use whatever implementation has been provided by the hypervisor which
fills in paravirt_ops.
The patch includes a set of operations for native (non-virtualized systems)
which causes the kernel to behave as it did before - or which will bring
this about, once the remaining bugs are fixed. That kernel may be a little
slower, however, since many operations which were performed by in-line
assembly code are now, instead, done through an indirect function call. To
mitigate the worst performance impacts, the paravirt_ops patch set includes
a self-patching mechanism to fix up some of the function calls - the
interrupt-related ones, in particular.
This interface may look a lot like VMI; both interfaces allow the
replacement of important low-level operations with hypervisor-specific
versions. The difference is that paravirt_ops is an inherently
source-based interface, with no binary interface guarantees. It is assumed
that this interface will change over time, as most other internal kernel
interfaces do. In fact, since this is a relatively new area for kernel
support, chances are that paravirt_ops will be more than usually volatile
for some time. There is
also, currently, no provision for loading the operations at run time, so
kernels must be built to work with a specific hypervisor.
On the surface, paravirt_ops thus looks like a competitor to VMI - a choice
of open, mutable kernel interfaces against binary blobs and a fixed ABI.
As it happens, however, there is a diverse set of developers working on
paravirt_ops, including representatives from Xen and, yes, VMWare. Some of
the VMI code has found its way into the initial paravirt_ops posting. All
of the large players appear to be behind this development - a fact which
will greatly ease its path into the kernel.
So why are the VMWare developers still pushing for a binary interface? It
would appear that they are considering the creation of a glue layer
connecting paravirt_ops with the VMI binary interface. This design leaves
the VMI people solely responsible for maintaining their ABI while freeing
the kernel developers to mess with paravirt_ops at will. Some of the
relevant developers feel more at ease with the VMI interface when it is
connected this way, though there is some residual discomfort about the
possibility of linking non-GPL binary hypervisor modules into the kernel.
The paravirt_ops developers would like to get their code into the 2.6.19
kernel. That schedule looks ambitious, given that the merge window is due
to open in a few weeks and that, as of this writing, paravirt_ops has not
yet done any time in the -mm kernel. It is, however, an option which
should disappear entirely when configured out, so inclusion in 2.6.19 might
not be entirely out of the question.
Comments (3 posted)
Recently, a
set of patches
was posted for inclusion in the mainline kernel. These patches make use of
the (undocumented) "SMAPI" BIOS found in Thinkpad laptops to provide
support for a number of useful Thinkpad features. It looks like it could
be the sort of code that would be welcomed; improving hardware support is
generally considered to be a good thing to do.
There is just one little problem. The code was signed off as:
Signed-off-by: Shem Multinymous <multinymous@gmail.com>
Various developers quickly pointed out that there was little useful
information here, and that code signed off by an obvious pseudonym would be
difficult to trust enough to merge into the kernel. "Mr. Multinymous"
argued the case for inclusion with statements like:
I hereby declare that this patch was developed solely based on
public specifications, observation of hardware behavior by
trial&e[r]ror, and specifications made available to me in clean-room
settings and with no attached obligations. So this patch is as pure
as the mainline hdaps driver it fixes (and probably purer than many
other drivers), and not a single line of it is a derivative work of
$OTHER_OS code.
The author of the code remains unwilling to reveal him or herself,
however, with the result that others have refused to consider the code for
inclusion. The standoff might have been broken by Pavel Machek, who has
offered to sign off the code. Whether that is good enough will be decided
by Linus, presumably, sometime after he returns from his travels.
In the post-SCO world, it does not take a great deal of paranoia or
imagination to suppose that somebody could attempt to sabotage the kernel
project through the deliberate injection of illicit code. If the true
nature of the code were revealed after it had been widely shipped, the
result could be a great deal of trouble for kernel developers, Linux
distributors, and possibly even users. So it is a good thing for the
kernel developers to hold the line and not accept code from anonymous
posters. The SCO episode has shown the world just how clean the kernel
code base is; we would like to keep it that way.
That said, it is hard to avoid the disquieting feeling that, had this code
been posted under a more normal-sounding name, it would not have been
subjected to such scrutiny. Code does show up from unknown names from all
parts of the world, and nobody has the resources or the desire to verify
that those names belong to real people who have a legitimate right to
contribute that code. For this reason, people contributing code which
demonstrates deep knowledge of undocumented hardware will often be asked
just how they came by that knowledge. Verifying the answer can be
difficult, however. Our defenses are thin, but it is
hard to see how they could be improved without killing the process
entirely.
Comments (18 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>