User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.8-rc2; no new prepatches have been released over the last week.

Linus has been committing patches to his BitKeeper repository, however; they include lots more code annotations and fixups, various networking fixes, some NX improvements for old binary support, and a number of other fixes. Things appear to be stabilizing for the real 2.6.8 release.

The current prepatch from Andrew Morton is 2.6.8-rc2-mm1. Recent additions to -mm include the pmdisk/swsusp merge (covered here last week), some performance counter tweaks (enables inheritance of settings across fork() and exec()), some scheduling domains cleanup work, various latency reductions, and a large number of other fixes.

The current 2.4 prepatch is 2.4.27-rc3 and has been since July 3.

Comments (none posted)

Kernel development news

Another look at the new development model

Discussions held at OLS, on the mailing lists, and elsewhere have made it clear that a certain degree of confusion still exists regarding the new kernel development process and what has really changed. In an attempt to clear things up, we'll take one more look at what was decided at this year's kernel summit.

The old process, in use since the 1.0 kernel release, worked with two major forks. The even-numbered fork was the "stable" series, managed in a way which (most of the time) attempted to keep the number of changes to a minimum. The odd-numbered fork, instead, was the development series, where anything goes. The idea was that most users would use the stable kernels, and that those kernels could be expected to be as bug-free as possible.

This mechanism has been made to work, but it has a number of problems which have been noticed over the years. These include:

  • The stable and development trees diverge from each other quickly, especially since big API changes have tended to be saved for early in the development series. This divergence makes it hard to port code between the two trees. As a result, backporting new features into the stable series is hard, and forward-porting fixes is also a challenge. 2.6.0 came out with a number of bugs which had long been fixed in 2.4.

  • The stable tree, after a short while, lacks fixes, features, and improvements which have been added to the development tree. That code may well have proved itself stable in the development series, but it often does not make it into a stable kernel for years. The kernels that people are told to use can run far behind the state of the art.

  • The stable kernels are often very heavily patched by the distributors. These patches include necessary fixes, backports of development kernel features, and more. As a result, stock distribution kernels diverge significantly from the mainline, and from each other. Distributor kernels sometimes are shipped with early implementations of features which evolve significantly before appearing in an official stable kernel, leading to compatibility problems for users.

The focus on keeping changes out of the stable kernel tree is now seen as being a bit misdirected. Well-tested patches can be safely merged, most of the time. Blocking patches, instead, creates an immense "patch pressure" which leads to divergent kernels and a major destabilizing flood whenever the door is opened a little.

So how have things changed? The "new" process is really just an acknowledgment of how things have been done since the 2.6.0 release - or, perhaps, a little before. It looks like this:

  • New patches which appear to be nearing prime-time readiness are added to Andrew Morton's -mm tree. This addition can be done by Andrew himself, or by way of a growing number of BitKeeper repositories which are automatically merged into -mm.

  • Each patch lives in -mm and is tested, commented on, refined, etc. Eventually, if the patch proves to be both useful and stable, it is forwarded on to Linus for merging into the mainline. If, instead, it causes problems or does not bring significant benefit, the patch will eventually be dropped from -mm.

The -mm tree has proved to be a truly novel addition to the development process. Each patch in this tree continues to be tracked as an independent contribution; it can be changed or removed at any time. The ability to drop patches is the real change; patches merged into the mainline lose their identity and become difficult to revert. The -mm tree provides a sort of proving ground which the kernel process has never quite had before. Alan Cox's -ac trees were similar, but they (1) were less experimental than -mm (distributors often merged -ac almost directly into their stock kernels), and (2) -mm does a much better job of tracking each patch independently.

In essence, -mm has become the new kernel development tree. The old process created a hard fork and was not designed to merge changes back into the "old" stable tree. -mm is much more dynamic; it exists as a set of patches to the mainline, and any individual patch can move over to the mainline at any time. New features get the testing they need, then graduate to the mainline when they are ready. New developments move into the stable kernel quickly, the development kernel benefits from all fixes made to the stable branch, and the whole process moves in a much faster and smoother way.

More than one observer in Ottawa made this ironic observation: it would appear that Andrew Morton is now in charge of the development kernel, while Linus manages the stable kernel. That is not quite how things were expected to turn out, but it seems to be working. Consider some of the changes which have been merged since 2.6.0:

  • 4K kernel stacks
  • NX page protection and ia32e architecture support
  • The NUMA API
  • Laptop mode
  • The lightweight auditing framework
  • The CFQ disk I/O scheduler
  • Netpoll
  • Cryptoloop, snapshot, and mirroring in the device mapper
  • Scheduling domains
  • The object-based reverse mapping VM

Some of these changes are truly significant, and things have not stopped there: new patches are going into the kernel at a rate of about 10MB/month. Yet 2.6.7 was, arguably, the most stable 2.6 kernel yet. It contains many of the latest features, has few performance problems, and the number of bug reports has been quite small. The new process is yielding some good results.

Naturally, there are some issues to resolve. One of those is the deprecation of features, which used to be tied to the timing of the old process. The new plan, it seems, is to give users a one-year notice, including a printk() warning in the kernel. The first features to be removed by this path are likely to be devfs and cryptoloop. There is also the question of changes which are simply too disruptive to merge anytime soon. Page clustering, if it is merged, could be one of those. When such a feature comes along, we may yet see the creation of a 2.7 tree to host it. Even then, however, 2.7 will track 2.6 as closely as possible, and it may go away when the feature which drove its existence becomes ready to go into the mainline.

This change to the development process is significant. It is not particularly new, however. The actual change happened the better part of a year ago; it was simply hidden in plain sight. All that has really happened in Ottawa is that the developers have acknowledged that the process is working well. One can easily argue, in fact, that the kernel development process has never functioned better than it does now. So, rather than break such a successful model, the developers are going to let it run.

Comments (16 posted)

Voluntary preemption and hardware interrupts

Ingo Molnar's voluntary preemption work, described here two weeks ago, has continued to progress. Indeed, since Mr. Molnar did not attend the kernel summit or OLS, this could well have been the fastest-moving kernel project over the last week. The 2.6.8-rc1-L2 version of the patch, released on July 27, claims a maximum 100μs latency - almost good enough, says Ingo, for hard real-time work. One of the methods used may raise some eyebrows, however.

The core of the voluntary preemption patch stays true to its original intent: it adds scheduling points in places where the kernel may hold onto the CPU for overly long periods. As kernel testers report problems, Ingo has been going in and breaking up the offending bits of code. Ingo has also added a new knob to control the maximum number of sectors the block I/O subsystem will try to transfer at once; if block operations get too big, the IDE completion routines can take too long to perform their cleanup.

That change pointed at a larger problem, however: some interrupt handlers can, despite conventions to the contrary, occupy the processor for a long time. While an interrupt handler is running, high-priority processes cannot run. Ingo decided to address this problem head on: he has moved hardware interrupt handling into process context.

To do this, he had to change the core kernel interrupt dispatcher. That code now checks to see if the interrupt comes from the timer; if so, it is handled immediately, in the traditional manner. Otherwise, the IRQ number is added to a per-CPU list of pending hardware interrupts, and control returns to the scheduler without having actually serviced the interrupt. Calling the real interrupt handler falls to the ksoftirqd process (which has been renamed irqd). Once it is scheduled, it simply iterates through the list of pending interrupts and calls all of the registered handlers for each. Due to the change in context, the pt_regs structure is no longer available to the handler, but, since almost no interrupt handlers ever use that argument, that will not be a problem.

The irqd process runs at a high priority, but a high-priority, real-time process can still preempt it. While it is dispatching an interrupt to its handler(s), irqd goes into a simulated interrupt mode and cannot be preempted. It drops out of that mode between interrupts, however, making scheduling possible. It is also possible for an interrupt handler to enable preemption at a given point with a call to cond_resched_hardirq() (or one of its variants). Without such a call, hardware interrupt handlers will not be put to sleep.

There are no such calls in drivers in Ingo's current patch - at least, not directly. Ingo has also added a new version of end_that_request_first() (the function used to indicate partial completion of a block I/O request) which allows preemption. The IDE completion handler calls the new version, which makes it preemptable - even though it is an interrupt handler.

Ingo claims some very good results from this work. The software latencies are now all very small. It would be interesting to see whether the redirecting of hardware interrupts has any effect on interrupt response latency, however. It remains to be seen whether a change of this magnitude will be accepted - especially since (involuntary) kernel preemption is supposed to be the real solution to latency problems. Building trust in involuntary preemption is an ongoing process, while the voluntary approach appears to have solved the problem now. In the end, that is likely to count for something.

(Coincidentally, Scott Wood has posted a different patch moving interrupt handlers into process context. His patch creates a separate thread for each interrupt, which allows the priority of each interrupt handler to be set independently. There is also an SA_NOTHREAD flag to request_irq() which allows a driver to request the old, IRQ-context mode. Ingo has said that he is likely to adopt the multi-thread approach for his patch as well).

Comments (none posted)

A kernel events layer

As Linux desktop implementations become more sophisticated, they increasingly need to know about what is going on in the kernel. The desktop code would like to be able to respond properly to events like "disc inserted," "disk full," "processor overheating," "printer on fire," and so on. So far, much of this functionality has been implemented by polling devices and /proc files and looking for changes. That solution is, to say the least, inelegant.

As a way of improving things, Robert Love has posted a patch (since updated) adding a kernel event notification. This patch, initially by Arjan van de Ven, uses the netlink mechanism to broadcast events out to interested user-space processes. The intent is for the events to be further redistributed using D-BUS, but other uses are possible.

Within the kernel, events are created with a call to send_kevent():

    int send_kevent(enum kevent type, 
                    const char *object,
                    const char *signal,
                    const char *fmt, ...);

The type argument gives the broad class of the event; current options are KEVENT_GENERAL, KEVENT_STORAGE, KEVENT_POWER, KEVENT_FS, and KEVENT_HOTPLUG. The object is a unique string giving the source of the event; it is derived from the location of the source file in the kernel tree. The signal says what is actually happening, and the rest of the arguments are a printk()-style format string and arguments giving further information. The patch only adds one set of calls, for noting CPU temperature transitions; they look like:

                "/org/kernel/arch/kernel/cpu/temperature", "high",
                "Cpu: %d\n", cpu);

The patch as a whole is not particularly controversial, but there are some concerns about the "object" namespace. Some developers would like to see the mechanism more closely tied into the device model, so that the object as represented here is related to an object in the sysfs hierarchy. Some have asked whether this mechanism should replace the current hotplug interface; that is not the intent, however. There has also been a call for some wrappers to ensure that, for example, device drivers all generate the same sort of event for the same kind of situation.

This is all detail work; chances are that the event code will find its way into the mainline in some form. Then there is the little issue of making the desktop actually respond to these events in a useful way. But that, of course, is just a user-space problem.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Memory management




Page editor: Forrest Cook
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds