Brief itemsreleased on May 22 with a single fix for a remote denial of service problem in the netfilter SNMP NAT code. 188.8.131.52 was released on May 20 with a rather larger set of fixes.
The current 2.6 prepatch remains 2.6.17-rc4. Fixes continue to accumulate in the mainline git repository, however, and it looks like the -rc5 release could happen sometime soon.
The current -mm tree is 2.6.17-rc4-mm3. Recent changes to -mm include the big serial ATA patch set, an S/390 hypervisor filesystem, the Secmark packet filtering code, a new set of page migration patches, a new framework for hardware random number generator support, the file_operations read/write consolidation patch (since dropped until some problems are fixed), and the UTS namespace patches (see below). The next -mm release will also include the genirq patch set (see below).
Kernel development news
Read the full article for a detailed description of what secmark does and how to use it.a new version of the UTS namespaces patch. This code, a small part of the "lightweight virtualization" or "containers" concept, allows various bits of system naming information (the stuff which can be seen with uname, essentially) to differ between sets of processes on the same system. It may not seem like a big thing, but, as a piece of container technology which has received the approval of several projects working in this area, it gives a hint of how the larger problem might be solved.
Andrew Morton responded with a note praising the way the work has been done, but asking a fundamental question:
All of which begs the question "now what?".
The worry is that the kernel developers could merge a large amount of non-trivial code, make a number of internal kernel interfaces more complicated, and still not have an end result that is useful to the containers community. The fact that the developers working in this area were able to agree on a patch for UTS namespaces is encouraging, but it is not a guarantee that consensus will be reached on the more complicated changes. The possibility of an intractable disagreement derailing the whole process partway through is a real one.
On the other hand, keeping all of the container code out of the kernel until it is reasonably complete has its own costs. Some of the container changes look to be relatively large and intrusive. Maintaining them all out of the tree would not be a great deal of fun. Neither would merging the whole mess at some future point when enough developers can agree that they are "done."
There are a number of features needed by the projects concerned with virtualization and containers. They include:
The ability to virtualize the view of the filesystem through namespaces is also required, but Linux has had that capability for some years now. Some of the more advanced container capabilities - live checkpointing and process migration, for example - will require yet another set of deep kernel hooks.
Most container concepts need most of the items from the list above to be able to provide useful isolation. So, somehow, a path must be found to get those features into the kernel without running into a blocking disagreement partway through - assuming that container support is considered desirable in general, of course.
Andrey Savochkin came up with a proposal which could be a good step forward: implement the network namespaces feature first. It is one of the most complex features, and it must be implemented in a way which doesn't upset the highly refined sensibilities of the networking subsystem developers. Some fairly tricky side problems - such as virtualizing access to /proc and sysfs - will have to be solved in the process. All told, it may be the hardest part of the problem, and it may be the place where an extended disagreement is most likely to show up.
Often, developers like to take on the easier parts of a problem first, then apply any lessons learned to the harder parts. In this case, however, starting with the hardest part may make some sense. If no universally acceptable solution can be found, the idea of generalized container support in the kernel can be dropped before too much other code has been merged. If, instead, the developers involved are able to implement something which pleases (or, at least, does not mortally offend) everybody, they should be able to get over any other roadblocks which may show up later on. In that case, the various pieces of the puzzle could be merged with confidence as they become ready.
An attempt to change the situation can be seen in the genirq patch set by Thomas Gleixner and Ingo Molnar. These patches attempt to take lessons learned about optimal interrupt handling on all architectures, mix in the quirks found in the fifty (yes, fifty) ARM subarchitectures, and create a new IRQ subsystem which is truly generic, and more powerful as well. It is a big patch set which reworks a great deal of crucially important low-level code. Expect some interesting discussion before any eventual mainline merge.
After some cleanup work, the patch gets serious with the creation of a new irq_chip structure. This structure is based on the old hw_interrupt_type structure, but it includes a rather longer list of low-level operations. The things for which the kernel can now request a specific interrupt controller include:
Many of these methods existed previously, but the mask(), mask_ack(), unmask(), set_type(), and set_wake() functions are new. With this set of functions, kernel code can manage interrupt controller chips in a fine-grained manner.
Moving up a level, the existing irq_desc structure, which holds all of the kernel's information about any specific interrupt, now has a pointer to an associated irq_chip structure. It also has a new method, handle_irq(), pointing to the function which actually handles this interrupt. That, perhaps, is the most fundamental change from the existing system, which uses a single handler function (__do_IRQ()) for all interrupts. It is a recognition of the fact that not all interrupts are equal, so there is little to gain by trying to deal with them all in a single, big function.
The biggest difference between interrupts is what is called the "flow type" - a combination of how the interrupt is signaled and how the system processes it. The genirq patches define these flow types:
The current IRQ code attempts to handle all of the above cases in a single, large routine. The new code, instead, creates a number of flow-specific handler functions, then sets the appropriate one as the handle_irq() method in the interrupt descriptor. The result is code which can be optimized for specific needs, and shorter code paths in the interrupt system as a whole. If a particular hardware platform has quirks which are not addressed by the current handlers, creating a new one is a relatively straightforward task.
At the kernel API level, the changes are relatively small; changes to drivers are not generally required. There are a few new capabilities, however. One is that there are some new flags which can be passed to request_irq():
This addition to the API actually happened in 2.6.16, but only the ARM architecture had any support for it at all. With the genirq patches, all architectures support these flags, and the appropriate flow handler will be selected internally. When interrupts are shared, however, all users must agree on how the triggering will be handled.
It is also possible to change the flow type of an IRQ directly with:
int set_irq_type(unsigned int irq, unsigned int type);
Here, type should be one of IRQ_TYPE_EDGE_RISING, IRQ_TYPE_EDGE_FALLING, IRQ_TYPE_EDGE_BOTH, IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_LEVEL_LOW, IRQ_TYPE_SIMPLE, or IRQ_TYPE_PERCPU. Calling this function has the same effect as specifying the trigger type with request_irq(), but it offers a wider range of possibilities. It also does not check for compatibility with any other users of a shared interrupt, so a certain potential for confusion exists.
Some devices can generate interrupts which should wake up the system from a suspended state. Wake-on-LAN behavior in network adaptors is one example; allowing the keyboard to wake the system is another. Kernel code can enable or disable this behavior in the interrupt controller with:
int set_irq_wake(unsigned int irq, unsigned int on);
An error code will be returned if the chip-level controller does not implement this operation.
There has been a relatively small amount of discussion so far; the biggest objection seems to be a claim that the separate flow handlers are an unnecessarily complex addition. The decision on whether genirq is merged very likely depends on whether the ARM maintainers are willing to drop their architecture-specific IRQ implementation and move to the new, generic version. Without that, the genirq code, which contains a lot of work aimed specifically at ARM's needs, will not truly be a generic solution. In the mean time, genirq has found its way into the -mm tree.
A recent patch by Ted Ts'o expands the taint concept in an interesting way. It adds a new file (/proc/sys/kernel/tainted); should user space write to that file, the kernel will be marked tainted with the new "U" flag. The idea, says Ted, is to flag "when userspace is potentially doing something naughty that might compromise the kernel." It took a few more questions before the real truth of the matter came out:
The idea of using SELinux on a system where Java code is free to mess around with physical memory does involve a fair amount of cognitive dissonance. But The Customer Is Always Right, so Ted is making this work. Not entirely willingly, though:
Nobody has stepped forward to say that the kernel should not be tainted in such a situation. Instead, one might almost be able to merge a patch causing the kernel to emit scary horror-movie sounds as well.
There appears to be general agreement that this patch makes sense; certainly there are plenty of situations where user-space actions might affect the stability of the system. There was one request for a log message to be stored with the user-space taint flag so that the reason for its presence would be more clear later on. A concern was also raised that some distributions were using the "U" flag for other reasons (to flag the presence of "unsupported" modules), though it is not clear that this is actually happening. Collisions over the use of taint flags could indeed create confusion, so Dave Jones has suggested that any taint flags used in out-of-tree code should at least be documented with a comment in the mainline kernel. Whether any such flags exist remains to be seen, however.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds