Brief items
The current 2.6 prepatch remains 2.6.8-rc2; no new prepatches have
been released over the last week.
Linus has been committing patches to his BitKeeper repository, however;
they include lots more code annotations and fixups, various networking
fixes, some NX improvements for old binary support, and a number of other
fixes. Things appear to be stabilizing for the real 2.6.8 release.
The current prepatch from Andrew Morton is 2.6.8-rc2-mm1. Recent additions to -mm include
the pmdisk/swsusp merge (covered here last
week), some performance counter tweaks (enables inheritance of settings
across fork() and exec()), some scheduling domains
cleanup work, various latency reductions, and a large number of other fixes.
The current 2.4 prepatch is 2.4.27-rc3 and has been since
July 3.
Comments (none posted)
Kernel development news
Discussions held at OLS, on the mailing lists, and elsewhere have made it
clear that a certain degree of confusion still exists regarding the new
kernel development process and what has really changed. In an attempt to
clear things up, we'll take one more look at what was decided at this
year's kernel summit.
The old process, in use since the 1.0 kernel release, worked with two major
forks. The even-numbered fork was the "stable" series, managed in a way
which (most of the time) attempted to keep the number of changes to a
minimum. The odd-numbered fork, instead, was the development series, where
anything goes. The idea was that most users would use the stable kernels,
and that those kernels could be expected to be as bug-free as possible.
This mechanism has been made to work, but it has a number of problems which
have been noticed over the years. These include:
- The stable and development trees diverge from each other quickly,
especially since big API changes have tended to be saved for early in
the development series. This divergence makes it hard to port code
between the two trees. As a result, backporting new features into the
stable series is hard, and forward-porting fixes is also a challenge.
2.6.0 came out with a number of bugs which had long been fixed in 2.4.
- The stable tree, after a short while, lacks fixes, features, and
improvements which have been added to the development tree. That code
may well have proved itself stable in the development series, but it
often does not make it into a stable kernel for years. The kernels
that people are told to use can run far behind the state of the art.
- The stable kernels are often very heavily patched by the
distributors. These patches include necessary fixes, backports of
development kernel features, and more. As a result, stock
distribution kernels diverge significantly from the mainline, and from
each other. Distributor kernels sometimes are shipped with early
implementations of features which evolve significantly before
appearing in an official stable kernel, leading to compatibility
problems for users.
The focus on keeping changes out of the stable kernel tree is now seen as
being a bit misdirected. Well-tested patches can be safely merged, most of
the time. Blocking patches, instead, creates an immense "patch pressure"
which leads to divergent kernels and a major destabilizing flood whenever
the door is opened a little.
So how have things changed? The "new" process is really just an
acknowledgment of how things have been done since the 2.6.0 release - or,
perhaps, a little before. It looks like this:
- New patches which appear to be nearing prime-time readiness are
added to Andrew Morton's -mm tree. This addition can be done by
Andrew himself, or by way of a growing number of BitKeeper
repositories which are automatically merged into -mm.
- Each patch lives in -mm and is tested, commented on, refined, etc.
Eventually, if the patch proves to be both useful and stable, it is
forwarded on to Linus for merging into the mainline. If, instead, it
causes problems or does not bring significant benefit, the patch will
eventually be dropped from -mm.
The -mm tree has proved to be a truly novel addition to the development
process. Each patch
in this tree continues to be tracked as an independent contribution; it can
be changed or removed at any time. The ability to drop patches is the real
change; patches merged into the mainline lose their identity and become
difficult to revert. The -mm tree provides a sort of proving ground which
the kernel process has never quite had before. Alan Cox's -ac trees were
similar, but they (1) were less experimental than -mm (distributors
often merged -ac almost directly into their stock kernels), and
(2) -mm does a much better job of tracking each patch independently.
In essence, -mm has become the new kernel development tree. The old
process created a hard fork and was not designed to merge changes back into
the "old" stable tree. -mm is much more dynamic; it exists as a set of
patches to the mainline, and any individual patch can move over to the
mainline at any time. New features get the testing they need, then
graduate to the mainline when they are ready. New developments move into
the stable kernel quickly, the development kernel benefits from all fixes
made to the stable branch, and the whole process moves in a much faster and
smoother way.
More than one observer in Ottawa made this ironic observation: it would
appear that Andrew Morton is now in charge of the development kernel, while
Linus manages the stable kernel. That is not quite how things were
expected to turn out, but it seems to be working. Consider some of the
changes which have been merged since 2.6.0:
- 4K kernel stacks
- NX page protection and ia32e architecture support
- The NUMA API
- Laptop mode
- The lightweight auditing framework
- The CFQ disk I/O scheduler
- Netpoll
- Cryptoloop, snapshot, and mirroring in the device mapper
- Scheduling domains
- The object-based reverse mapping VM
Some of these changes are truly significant, and things have not stopped
there: new patches are going into the kernel at a rate of about 10MB/month. Yet
2.6.7 was, arguably, the most stable 2.6 kernel yet. It contains many of
the latest features, has few performance problems, and the number of bug
reports has been quite small. The new process is yielding some good
results.
Naturally, there are some issues to resolve. One of those is the
deprecation of features, which used to be tied to the timing of the old
process. The new plan, it seems, is to give users a one-year notice,
including a printk() warning in the kernel. The first features to
be removed by this path are likely to be devfs and cryptoloop. There is
also the question of changes which are simply too disruptive to merge
anytime soon. Page clustering, if it is merged, could be one of those.
When such a feature comes along, we may yet see the creation of a 2.7 tree
to host it. Even then, however, 2.7 will track 2.6 as closely as possible,
and it may go away when the feature which drove its existence becomes ready
to go into the mainline.
This change to the development process is significant. It is not
particularly new, however. The actual change happened the better part of a year
ago; it was simply hidden in plain sight. All that has really happened in
Ottawa is that the developers have acknowledged that the process is working
well. One can easily argue, in fact, that the kernel development process
has never functioned better than it does now. So, rather than break such a
successful model, the developers are going to let it run.
Comments (16 posted)
Ingo Molnar's voluntary preemption work, described here
two weeks ago, has continued to progress.
Indeed, since Mr. Molnar did not attend the kernel summit or OLS, this
could well have been the fastest-moving kernel project over the last week.
The
2.6.8-rc1-L2 version of the patch,
released on July 27, claims a maximum 100μs latency - almost good
enough, says Ingo, for hard real-time work. One of the methods used may
raise some eyebrows, however.
The core of the voluntary preemption patch stays true to its original
intent: it adds scheduling points in places where the kernel may hold onto
the CPU for overly long periods. As kernel testers report problems, Ingo
has been going in and breaking up the offending bits of code. Ingo has
also added a new knob to control the maximum number of sectors the block
I/O subsystem will try to transfer at once; if block operations get too
big, the IDE completion routines can take too long to perform their
cleanup.
That change pointed at a larger problem, however: some interrupt handlers
can, despite conventions to the contrary, occupy the processor for a long
time. While an interrupt handler is running, high-priority processes
cannot run. Ingo decided to address this problem head on: he has moved
hardware interrupt handling into process context.
To do this, he had to change the core kernel interrupt dispatcher. That
code now checks to see if the interrupt comes from the timer; if so, it is
handled immediately, in the traditional manner. Otherwise, the IRQ number
is added to a per-CPU list of pending hardware interrupts, and control
returns to the scheduler without having actually serviced the interrupt.
Calling the real interrupt handler falls to the ksoftirqd process
(which has been renamed irqd). Once it is scheduled, it simply
iterates through the list of pending interrupts and calls all of the
registered handlers for each. Due to the change in context, the
pt_regs structure is no longer available to the handler, but,
since almost no interrupt handlers ever use that argument, that will not be
a problem.
The irqd process runs at a high priority, but a high-priority,
real-time process can still preempt it. While it is dispatching an
interrupt to its handler(s), irqd goes into a simulated interrupt
mode and cannot be preempted. It drops out of that mode between
interrupts, however, making scheduling possible. It is also possible for
an interrupt handler to enable preemption at a given point with a call to
cond_resched_hardirq() (or one of its variants). Without such a
call, hardware interrupt handlers will not be put to sleep.
There are no such calls in drivers in Ingo's current patch - at least, not
directly. Ingo has also added a new version of
end_that_request_first() (the function used to indicate partial
completion of a block I/O request) which allows preemption. The IDE
completion handler calls the new version, which makes it preemptable - even
though it is an interrupt handler.
Ingo claims some very good results from this work. The software latencies
are now all very small. It would be interesting to see whether the
redirecting of hardware interrupts has any effect on interrupt response
latency, however. It remains to be seen whether a change of this magnitude
will be accepted - especially since (involuntary) kernel preemption is
supposed to be the real solution to latency problems. Building trust in
involuntary preemption is an ongoing process, while the voluntary approach
appears to have solved the problem now. In the end, that is likely to
count for something.
(Coincidentally, Scott Wood has posted a
different patch moving interrupt handlers into process context. His
patch creates a separate thread for each interrupt, which allows the
priority of each interrupt handler to be set independently. There is also
an SA_NOTHREAD flag to request_irq() which allows a driver
to request the old, IRQ-context mode. Ingo has said that he is likely to
adopt the multi-thread approach for his patch as well).
Comments (none posted)
As Linux desktop implementations become more sophisticated, they
increasingly need to know about what is going on in the kernel. The
desktop code would like to be able to respond properly to events like
"disc inserted," "disk full," "processor overheating," "printer on fire,"
and so on. So far, much of this functionality has been implemented by
polling devices and
/proc files and looking for changes. That
solution is, to say the least, inelegant.
As a way of improving things, Robert Love has posted a patch (since updated) adding a kernel event notification.
This patch, initially by Arjan van de Ven, uses the netlink mechanism to
broadcast events out to interested user-space processes. The intent is for
the events to be further redistributed using D-BUS, but other uses are possible.
Within the kernel, events are created with a call to
send_kevent():
int send_kevent(enum kevent type,
const char *object,
const char *signal,
const char *fmt, ...);
The type argument gives the broad class of the event; current
options are KEVENT_GENERAL, KEVENT_STORAGE,
KEVENT_POWER, KEVENT_FS, and KEVENT_HOTPLUG.
The object is a unique string giving the source of the event; it
is derived from the location of the source file in the kernel tree. The
signal says what is actually happening, and the rest of the
arguments are a printk()-style format string and arguments giving
further information. The patch only adds one set of calls, for noting CPU
temperature transitions; they look like:
send_kevent(KEVENT_GENERAL,
"/org/kernel/arch/kernel/cpu/temperature", "high",
"Cpu: %d\n", cpu);
The patch as a whole is not particularly controversial, but there are some
concerns about the "object" namespace. Some developers would like to see
the mechanism more closely tied into the device model, so that the object
as represented here is related to an object in the sysfs hierarchy. Some
have asked whether this mechanism should replace the current hotplug
interface; that is not the intent, however. There has also been a call for
some wrappers to ensure that, for example, device drivers all generate the
same sort of event for the same kind of situation.
This is all detail work; chances are that the event code will find its way
into the mainline in some form. Then there is the little issue of making
the desktop actually respond to these events in a useful way. But that, of
course, is just a user-space problem.
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Forrest Cook
Next page: Distributions>>