User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is, released on May 22 with a single fix for a remote denial of service problem in the netfilter SNMP NAT code. was released on May 20 with a rather larger set of fixes.

The current 2.6 prepatch remains 2.6.17-rc4. Fixes continue to accumulate in the mainline git repository, however, and it looks like the -rc5 release could happen sometime soon.

The current -mm tree is 2.6.17-rc4-mm3. Recent changes to -mm include the big serial ATA patch set, an S/390 hypervisor filesystem, the Secmark packet filtering code, a new set of page migration patches, a new framework for hardware random number generator support, the file_operations read/write consolidation patch (since dropped until some problems are fixed), and the UTS namespace patches (see below). The next -mm release will also include the genirq patch set (see below).

Comments (none posted)

Kernel development news

Quote of the week

Guys, a kernel developer who cannot understand that user space is important should just drop their pretentions of being a kernel developer, and go play with some toy system like Hurd instead. There you can say "user space doesn't matter".

-- Linus Torvalds

Comments (11 posted)

The Linux Device Driver Kit

Greg Kroah-Hartman has decided that it's time to put an end to people sneering that Linux lacks a proper device driver development kit. So, he has created the first Linux DDK. It includes a fresh kernel, a full copy of LDD3, and copies of all the in-tree kernel documentation. A CD image can be downloaded from

Comments (11 posted)

Secmark explained

James Morris's secmark patches have been circulating for a few weeks now. Secmark is a new mechanism for filtering network packets through SELinux. Your editor had pondered writing an article about secmark, but that turns out to be unnecessary; James did it first.

The idea is to separate labeling and enforcement. Specifically: use iptables to select and label packets, then use SELinux to enforce security policy using these packet labels. This utilizes the expressiveness of iptables rulesets, as well as the flexibility of any its many matches and targets, and powerful components such as connection tracking. At the same time, enforcement of security policy remains the responsibility of the SELinux AVC, and access control rules can be meaningfully analyzed as part of overall SELinux policy analysis.

Read the full article for a detailed description of what secmark does and how to use it.

Comments (1 posted)

Virtualization: now what?

Serge Hallyn recently posted a new version of the UTS namespaces patch. This code, a small part of the "lightweight virtualization" or "containers" concept, allows various bits of system naming information (the stuff which can be seen with uname, essentially) to differ between sets of processes on the same system. It may not seem like a big thing, but, as a piece of container technology which has received the approval of several projects working in this area, it gives a hint of how the larger problem might be solved.

Andrew Morton responded with a note praising the way the work has been done, but asking a fundamental question:

Generally, I think that the whole approach of virtualising the OS so it can run multiple independent instances of userspace is a good one. It's an extension and a strengthening of things which Linux is already doing and it pushes further along a path we've been taking for many years. If done right, it's even possible that each of these featurettes could improve the kernel in its own right - better layering, separation, etc. [...]

All of which begs the question "now what?".

The worry is that the kernel developers could merge a large amount of non-trivial code, make a number of internal kernel interfaces more complicated, and still not have an end result that is useful to the containers community. The fact that the developers working in this area were able to agree on a patch for UTS namespaces is encouraging, but it is not a guarantee that consensus will be reached on the more complicated changes. The possibility of an intractable disagreement derailing the whole process partway through is a real one.

On the other hand, keeping all of the container code out of the kernel until it is reasonably complete has its own costs. Some of the container changes look to be relatively large and intrusive. Maintaining them all out of the tree would not be a great deal of fun. Neither would merging the whole mess at some future point when enough developers can agree that they are "done."

There are a number of features needed by the projects concerned with virtualization and containers. They include:

  • The UTS namespace patch mentioned above.

  • PID virtualization, isolating each group of processes on the system from each other, and allowing process IDs to be reused between containers.

  • Namespaces for SYSV interprocess communication primitives (semaphores, shared memory, and message queues).

  • Time virtualization, so that each container can have its own idea of what time it is.

  • Virtualization of user and group ID values.

  • Network namespaces, intended to give each container a specific set of network interfaces to which it has access. When used in conjunction with IP aliases, this feature can set up a separate IP address for each container and keep containers from accessing each others' traffic.

The ability to virtualize the view of the filesystem through namespaces is also required, but Linux has had that capability for some years now. Some of the more advanced container capabilities - live checkpointing and process migration, for example - will require yet another set of deep kernel hooks.

Most container concepts need most of the items from the list above to be able to provide useful isolation. So, somehow, a path must be found to get those features into the kernel without running into a blocking disagreement partway through - assuming that container support is considered desirable in general, of course.

Andrey Savochkin came up with a proposal which could be a good step forward: implement the network namespaces feature first. It is one of the most complex features, and it must be implemented in a way which doesn't upset the highly refined sensibilities of the networking subsystem developers. Some fairly tricky side problems - such as virtualizing access to /proc and sysfs - will have to be solved in the process. All told, it may be the hardest part of the problem, and it may be the place where an extended disagreement is most likely to show up.

Often, developers like to take on the easier parts of a problem first, then apply any lessons learned to the harder parts. In this case, however, starting with the hardest part may make some sense. If no universally acceptable solution can be found, the idea of generalized container support in the kernel can be dropped before too much other code has been merged. If, instead, the developers involved are able to implement something which pleases (or, at least, does not mortally offend) everybody, they should be able to get over any other roadblocks which may show up later on. In that case, the various pieces of the puzzle could be merged with confidence as they become ready.

Comments (3 posted)

A new generic IRQ layer

The Linux kernel has a generic layer for the handling of hardware interrupts, hidden behind a standard API. There's only one problem: not all architectures use this layer. In particular, ARM is a holdout. It seems that interrupt handling in the ARM world is a complicated, subarchitecture-specific business which does not fit into the current "generic" code at all, so ARM sticks with its own code - even though there is a fair amount of overlap with code found in the generic subsystem. But, even for the architectures which are able to use it, the current IRQ subsystem has shortcomings which are becoming increasingly apparent.

An attempt to change the situation can be seen in the genirq patch set by Thomas Gleixner and Ingo Molnar. These patches attempt to take lessons learned about optimal interrupt handling on all architectures, mix in the quirks found in the fifty (yes, fifty) ARM subarchitectures, and create a new IRQ subsystem which is truly generic, and more powerful as well. It is a big patch set which reworks a great deal of crucially important low-level code. Expect some interesting discussion before any eventual mainline merge.

After some cleanup work, the patch gets serious with the creation of a new irq_chip structure. This structure is based on the old hw_interrupt_type structure, but it includes a rather longer list of low-level operations. The things for which the kernel can now request a specific interrupt controller include:

  • startup(): enable the interrupt and generally get the controller ready to handle it.
  • shutdown(): completely shut down the interrupt.
  • enable(): enable the interrupt.
  • disable(): disable the interrupt.
  • ack(): inform the controller that the CPU has begun processing the interrupt.
  • end(): inform the controller that interrupt processing is done.
  • mask(): mask a specific interrupt, blocking its delivery.
  • mask_ack(): a combination of mask() and ack() which can be optimized on some platforms.
  • unmask(): unmask an interrupt.
  • set_affinity(): bind an interrupt to a specific CPU.
  • retrigger(): re-create and re-deliver an interrupt.
  • set_type(): set the flow type (described below) of the interrupt.
  • set_wake(): enable or disable wake-on-interrupt behavior.

Many of these methods existed previously, but the mask(), mask_ack(), unmask(), set_type(), and set_wake() functions are new. With this set of functions, kernel code can manage interrupt controller chips in a fine-grained manner.

Moving up a level, the existing irq_desc structure, which holds all of the kernel's information about any specific interrupt, now has a pointer to an associated irq_chip structure. It also has a new method, handle_irq(), pointing to the function which actually handles this interrupt. That, perhaps, is the most fundamental change from the existing system, which uses a single handler function (__do_IRQ()) for all interrupts. It is a recognition of the fact that not all interrupts are equal, so there is little to gain by trying to deal with them all in a single, big function.

The biggest difference between interrupts is what is called the "flow type" - a combination of how the interrupt is signaled and how the system processes it. The genirq patches define these flow types:

  • Level-triggered interrupts are active as long as the device asserts its IRQ line. These interrupts must be masked while being processed, and can only be unmasked after the device has stopped asserting the interrupt.

  • Edge-triggered interrupts are signaled by a change in the interrupt line - from low voltage to high, from high to low, or both. These interrupts do not necessarily have to be masked while being processed, but, if they are not masked, more interrupts can arrive before the first has been handled. So the kernel must track "pending" interrupts, and the interrupt handler must loop until all interrupts have been dealt with.

  • "Simple" interrupts do not require any special control, and can be processed directly.

  • Per-CPU interrupts are bound to a single CPU. They are much like simple interrupts, but even simpler: since the handler will only run on one CPU, there is no need for locking.

The current IRQ code attempts to handle all of the above cases in a single, large routine. The new code, instead, creates a number of flow-specific handler functions, then sets the appropriate one as the handle_irq() method in the interrupt descriptor. The result is code which can be optimized for specific needs, and shorter code paths in the interrupt system as a whole. If a particular hardware platform has quirks which are not addressed by the current handlers, creating a new one is a relatively straightforward task.

At the kernel API level, the changes are relatively small; changes to drivers are not generally required. There are a few new capabilities, however. One is that there are some new flags which can be passed to request_irq():

  • SA_TRIGGER_LOW and SA_TRIGGER_HIGH: treat the interrupt source as being level-triggered, with interrupts happening at either the high or low level.

  • SA_TRIGGER_FALLING and SA_TRIGGER_RISING: treat the interrupt as being edge-triggered.

This addition to the API actually happened in 2.6.16, but only the ARM architecture had any support for it at all. With the genirq patches, all architectures support these flags, and the appropriate flow handler will be selected internally. When interrupts are shared, however, all users must agree on how the triggering will be handled.

It is also possible to change the flow type of an IRQ directly with:

    int set_irq_type(unsigned int irq, unsigned int type);

Here, type should be one of IRQ_TYPE_EDGE_RISING, IRQ_TYPE_EDGE_FALLING, IRQ_TYPE_EDGE_BOTH, IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_LEVEL_LOW, IRQ_TYPE_SIMPLE, or IRQ_TYPE_PERCPU. Calling this function has the same effect as specifying the trigger type with request_irq(), but it offers a wider range of possibilities. It also does not check for compatibility with any other users of a shared interrupt, so a certain potential for confusion exists.

Some devices can generate interrupts which should wake up the system from a suspended state. Wake-on-LAN behavior in network adaptors is one example; allowing the keyboard to wake the system is another. Kernel code can enable or disable this behavior in the interrupt controller with:

    int set_irq_wake(unsigned int irq, unsigned int on);

An error code will be returned if the chip-level controller does not implement this operation.

There has been a relatively small amount of discussion so far; the biggest objection seems to be a claim that the separate flow handlers are an unnecessarily complex addition. The decision on whether genirq is merged very likely depends on whether the ARM maintainers are willing to drop their architecture-specific IRQ implementation and move to the new, generic version. Without that, the genirq code, which contains a lot of work aimed specifically at ARM's needs, will not truly be a generic solution. In the mean time, genirq has found its way into the -mm tree.

Comments (none posted)

Tainting from user space

The kernel has long used "tainting" as a way of noting that something has happened which may affect the stability of the system. Should a kernel oops occur, the resulting kernel trace includes information on the kernel's taint status. This information can then be used by developers to ask hard questions about what was really going on. The taint flag was originally added to flag the use of binary-only kernel modules, but its use has grown since then. Events which will taint a current kernel include the forced removal of a module, loading a module without proper (or matching) version information, or running an SMP kernel with processors not designed for SMP operation. Machine check exceptions and certain kinds of memory management errors will also result in a tainted kernel.

A recent patch by Ted Ts'o expands the taint concept in an interesting way. It adds a new file (/proc/sys/kernel/tainted); should user space write to that file, the kernel will be marked tainted with the new "U" flag. The idea, says Ted, is to flag "when userspace is potentially doing something naughty that might compromise the kernel." It took a few more questions before the real truth of the matter came out:

The problem is that the Real-Time Specification for Java (RTSJ) **requires** that the JVM provide class functions which provide direct access to physical memory; all physical memory. In fact, the RTSJ compliance test explicitly checks for this; it requires that you give the compliance test the address of a few hundred megs of physical memory for the test. The absolutely hilarious bit about all of this is that the same customer who wants RTSJ compliance because of federal procurement regulations is also interested in using SELinux.

The idea of using SELinux on a system where Java code is free to mess around with physical memory does involve a fair amount of cognitive dissonance. But The Customer Is Always Right, so Ted is making this work. Not entirely willingly, though:

In fact, I was so unhappy about being forced by the RTSJ specification to do this insane thing that I wanted to make sure that if it were ever used, it would set a TAINT flag to warn people that just about anything unsane could have happened, and the system's stability was at the mercy of the competence of Java application programmers.

Nobody has stepped forward to say that the kernel should not be tainted in such a situation. Instead, one might almost be able to merge a patch causing the kernel to emit scary horror-movie sounds as well.

There appears to be general agreement that this patch makes sense; certainly there are plenty of situations where user-space actions might affect the stability of the system. There was one request for a log message to be stored with the user-space taint flag so that the reason for its presence would be more clear later on. A concern was also raised that some distributions were using the "U" flag for other reasons (to flag the presence of "unsupported" modules), though it is not clear that this is actually happening. Collisions over the use of taint flags could indeed create confusion, so Dave Jones has suggested that any taint flags used in out-of-tree code should at least be documented with a comment in the mainline kernel. Whether any such flags exist remains to be seen, however.

Comments (19 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds