|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is 2.6.17.8, released on August 6. There is a fairly long list of important fixes this time around, but none with CVE (vulnerability) numbers attached.

The current 2.6 prepatch is 2.6.18-rc4, announced by Linus on August 6. "The diffstat (and the appended shortlog) tells the story: a lot of small fixes in various areas, mostly drivers. Input layer, infiniband, usb, net, sound, vlb. Some cpufreq and architecture updates. Also some audit rule improvements from Al & Amy." The changes also include a new event notification mechanism within the networking code and a function (netdev_alloc_skb()) for allocating packet buffers in a NUMA-aware fashion. See the long-format changelog for the details.

The current -mm tree is 2.6.18-rc3-mm2. Recent changes to -mm include the return of the CacheFS subsystem, full compact flash support in the libata code, a big x86-64 update, a number of memory management tweaks, vectored asynchronous I/O support, and a "comprehensive system accounting" patch.

Comments (none posted)

Kernel development news

Quote of the week

Davej's laws of kernel hacking #1: If the number of iterations a patch goes through to get it right is greater than the number of lines in the diff, it probably isn't worth it.

-- Dave Jones

Comments (3 posted)

Some movements in the kernel community

When Linus announced the 2.6.18-rc4 release, he tossed in one extra bit of news:

Anyway, I'll be effectively offline for most of the following three weeks (vacations and a funeral), and while I hope to be able to update my tree every once in a while, I also asked Greg KH to maintain a git tree for any worthwhile fixes.

He then promptly fled the scene without actually putting -rc4 up on kernel.org - an omission which Greg fixed some hours later. While kernel development will continue as always, we are likely to see rather fewer -rc releases over the next few weeks, and almost certainly no 2.6.18 final release.

Andrew Morton, meanwhile, used the 2.6.18-rc3-mm1 announcement to pass on a little news of his own:

fwiw, I recently took a position with Google.

He evidently made this change to find a working environment which better suits his habits; from the kernel development point of view, no real changes are expected.

Finally, Greg Kroah-Hartman has announced a transition in 2.6.16 support:

This is just a notice to everyone that Adrian [Bunk] is going to now be taking over the 2.6.16-stable kernel branch, for him to maintain for as long as he wants to.

He will still be following the same -stable rules that are documented in the Documentation/stable_kernel_rules.txt file, but just doing this for the 2.6.16 kernel tree for a much longer time than the current stable team is willing to do (we have moved on to the 2.6.17 kernel now.)

Adrian had announced his intention to maintain this kernel for the long haul early in the 2.6.16 cycle. It will be interesting to see how this goes; fitting important patches into 2.6.16 will get harder as the mainline gets more distant. The long-term success of this project may depend on whether distributors make use of this kernel - and, as a result, help to maintain it.

Comments (1 posted)

The Grand Unified Flow Cache

The Grand Unified Flow Cache is one of those items which shows up as a bullet in networking summit presentations; the networking folks appear to know what it means, but they have been somewhat remiss in documenting the idea for the rest of us. This concept has returned in the context of the network channels discussion, and enough hints have been dropped to let your editor - who is not afraid to extrapolate a long way from minimal data - get a sense for what the term means. Should it be implemented, the GUFC could bring significant changes to the entire networking stack.

The net channel concept requires that the kernel be able to quickly identify the destination of each packet and drop it into the proper channel. Even better would be to have a smart network adapter perform that classification as the packet arrives, taking the kernel out of that part of the loop altogether. One way of performing this classification would be to form a tuple from each packet and use that tuple as a lookup key in some sort of fast data structure. When a packet's tuple is found in this structure (the flow cache), its fate has been determined and it can be quickly shunted off to where it needs to be.

This tuple, as described by Rusty Russell, would be made up of seven parameters:

  • The source IP address
  • The destination IP address
  • A bit indicating whether the source is local
  • A bit indicating whether the destination is local
  • The IP protocol number
  • The source port
  • The destination port

These numbers, all together, are sufficient to identify the connection to which any packet belongs. A quick lookup on an incoming packet should, thus, yield a useful destination (such as a network channel) for that packet with no further processing.

Features like netfilter mess up this pretty picture, however. Within the kernel, netfilter is set up such that every packet is fed to the appropriate chain(s). As soon as every packet has to go through a common set of hooks, the advantage of the GUFC is lost. Rusty's description of the problem is this:

The mistake (?) with netfilter was that we are completely general: you will see all packets, do what you want. If, instead, we had forced all rules to be of form "show me all packets matching this tuple" we would be in a [position to] combine it in a single lookup with routing etc.

So, the way around this problem would be to change the netfilter API to work better with a grand unified flow cache. Rules could be written in terms of the above tuples (with wild cards allowed), and only packets which match the tuples need pass through the (slow) netfilter path. That would allow packets which are not of interest to the filtering code to bypass the whole mechanism - and the decision could be made in a single lookup.

Often, however, a packet filtering decision can be made on the basis of the tuple itself - once a packet matches the tuple, there is no real need to evaluate it against the rule separately. So, for example, once the connection tracking code has allowed a new connection to be established, and a tuple describing that connection has been added to the cache, further filtering for that connection should not be required. If netfilter and the flow cache worked together effectively, the per-packet overhead could be avoided in many cases.

One way this might work would be to have a set of callbacks invoked for each tuple which is added to the flow cache. A module like netfilter could examine the tuple relative to the current rule set and let the kernel know if it needs to see packets matching that tuple or not. Then, packets could be directed to the appropriate filters without the need for wildcard matching in the tuple cache.

There is a small cost to all of this:

Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it.

Rusty has never let this sort of obstacle stop him before, so all of this might just happen.

But probably not anytime soon. There's a long list of questions which need to be answered before a serious implementation attempt is made. Whether it would truly perform as well as people hope is one of them; these schemes can get quite a bit slower once all of the real-world details are factored in. Rule updates could be a challenge; an administrator who has just changed packet filtering rules is unlikely to wait patiently while the new rules slowly work their way into the cache. Finding a way to get the hardware to help in the classification process will not be entirely straightforward. And so on. But it would seem that there are a number of interesting ideas in this area. That is bound to lead to good stuff sooner or later.

Comments (4 posted)

Connecting Linux to hypervisors

Paravirtualization is the act of running a guest operating system, under control of a host system, where the guest has been ported to a virtual architecture which is almost like the hardware it is actually running on. This technique allows full guest systems to be run in a relatively efficient manner. The highest-profile free paravirtualization implementation remains Xen; on the proprietary side, VMWare has been active for a long time. Both of these efforts would like to see (at least some of) their code in the mainline kernel. The kernel developers, however, are uninterested in merging a large collection of hooks specific to any one solution.

One attempt to solve this problem, proposed by VMWare, is the VMI interface. VMI works by isolating any operations which may require hypervisor intervention into a special set of function calls. The implementation of those functions is not built into the kernel; instead, the kernel, at boot time, loads a "hypervisor ROM" which provides the needed functions. The binary interface between the kernel and this loadable segment is set in stone, meaning that kernels built for today's implementations should work equally well on tomorrow's replacement. This design also allows the same binary kernel image to run under a variety of hypervisors, or, with the right ROM, in native mode on the bare hardware.

The fixed ABI and ability to load "binary blobs" into the kernel does not sit well with all kernel developers, however. It looks like another way to put proprietary code into the kernel, which is something most kernel hackers would rather support less of. Plus, as Rusty Russell put it:

We're not good at maintaining ABIs. We're going to be especially bad at maintaining an ABI when the 99% of us running native will never notice the breakage.

For this and other reasons, VMI has not had a smooth path into the kernel so far. That has not stopped VMWare hacker Zachary Amsden from pushing for a binary blob interface recently on linux-kernel, however.

There have been rumblings for a while concerning an alternative hypervisor interface (called "paravirt_ops") under development. An early implementation of paravirt_ops was posted on August 7, making the shape of this interface clearer. In the end, paravirt_ops is yet another structure filled with function pointers, like many other operations structures used in the kernel. In this case, the operations are the various machine-specific functions that tend to require a discussion with the hypervisor. They include things like disabling interrupts, changing processor control registers, changing memory mappings, etc.

As an example, one of the members of paravirt_ops is:

    void (fastcall *irq_disable)(void);

The patch also defines a little function for use by the kernel:

    static inline void raw_local_irq_disable(void)
    {
    	paravirt_ops.irq_disable();
    }

As long as the kernel always uses this function to disable interrupts, it will use whatever implementation has been provided by the hypervisor which fills in paravirt_ops.

The patch includes a set of operations for native (non-virtualized systems) which causes the kernel to behave as it did before - or which will bring this about, once the remaining bugs are fixed. That kernel may be a little slower, however, since many operations which were performed by in-line assembly code are now, instead, done through an indirect function call. To mitigate the worst performance impacts, the paravirt_ops patch set includes a self-patching mechanism to fix up some of the function calls - the interrupt-related ones, in particular.

This interface may look a lot like VMI; both interfaces allow the replacement of important low-level operations with hypervisor-specific versions. The difference is that paravirt_ops is an inherently source-based interface, with no binary interface guarantees. It is assumed that this interface will change over time, as most other internal kernel interfaces do. In fact, since this is a relatively new area for kernel support, chances are that paravirt_ops will be more than usually volatile for some time. There is also, currently, no provision for loading the operations at run time, so kernels must be built to work with a specific hypervisor.

On the surface, paravirt_ops thus looks like a competitor to VMI - a choice of open, mutable kernel interfaces against binary blobs and a fixed ABI. As it happens, however, there is a diverse set of developers working on paravirt_ops, including representatives from Xen and, yes, VMWare. Some of the VMI code has found its way into the initial paravirt_ops posting. All of the large players appear to be behind this development - a fact which will greatly ease its path into the kernel.

So why are the VMWare developers still pushing for a binary interface? It would appear that they are considering the creation of a glue layer connecting paravirt_ops with the VMI binary interface. This design leaves the VMI people solely responsible for maintaining their ABI while freeing the kernel developers to mess with paravirt_ops at will. Some of the relevant developers feel more at ease with the VMI interface when it is connected this way, though there is some residual discomfort about the possibility of linking non-GPL binary hypervisor modules into the kernel.

The paravirt_ops developers would like to get their code into the 2.6.19 kernel. That schedule looks ambitious, given that the merge window is due to open in a few weeks and that, as of this writing, paravirt_ops has not yet done any time in the -mm kernel. It is, however, an option which should disappear entirely when configured out, so inclusion in 2.6.19 might not be entirely out of the question.

Comments (3 posted)

Code of uncertain origin

Recently, a set of patches was posted for inclusion in the mainline kernel. These patches make use of the (undocumented) "SMAPI" BIOS found in Thinkpad laptops to provide support for a number of useful Thinkpad features. It looks like it could be the sort of code that would be welcomed; improving hardware support is generally considered to be a good thing to do.

There is just one little problem. The code was signed off as:

    Signed-off-by: Shem Multinymous <multinymous@gmail.com>

Various developers quickly pointed out that there was little useful information here, and that code signed off by an obvious pseudonym would be difficult to trust enough to merge into the kernel. "Mr. Multinymous" argued the case for inclusion with statements like:

I hereby declare that this patch was developed solely based on public specifications, observation of hardware behavior by trial&e[r]ror, and specifications made available to me in clean-room settings and with no attached obligations. So this patch is as pure as the mainline hdaps driver it fixes (and probably purer than many other drivers), and not a single line of it is a derivative work of $OTHER_OS code.

The author of the code remains unwilling to reveal him or herself, however, with the result that others have refused to consider the code for inclusion. The standoff might have been broken by Pavel Machek, who has offered to sign off the code. Whether that is good enough will be decided by Linus, presumably, sometime after he returns from his travels.

In the post-SCO world, it does not take a great deal of paranoia or imagination to suppose that somebody could attempt to sabotage the kernel project through the deliberate injection of illicit code. If the true nature of the code were revealed after it had been widely shipped, the result could be a great deal of trouble for kernel developers, Linux distributors, and possibly even users. So it is a good thing for the kernel developers to hold the line and not accept code from anonymous posters. The SCO episode has shown the world just how clean the kernel code base is; we would like to keep it that way.

That said, it is hard to avoid the disquieting feeling that, had this code been posted under a more normal-sounding name, it would not have been subjected to such scrutiny. Code does show up from unknown names from all parts of the world, and nobody has the resources or the desire to verify that those names belong to real people who have a legitimate right to contribute that code. For this reason, people contributing code which demonstrates deep knowledge of undocumented hardware will often be asked just how they came by that knowledge. Verifying the answer can be difficult, however. Our defenses are thin, but it is hard to see how they could be improved without killing the process entirely.

Comments (18 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux v2.6.18-rc4 ?
Andrew Morton 2.6.18-rc3-mm2 ?
Andrew Morton 2.6.18-rc3-mm1 ?
Greg KH Linux 2.6.17.8 ?

Architecture-specific

Core kernel code

Device drivers

Documentation

Michael Kerrisk man-pages-2.37 is released ?

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds