Brief items
The current development kernel is 2.6.0-test2, which was
released by Linus on July 27.
It contains a lot of fixes, of course, including a bunch of forward-ported
2.4 patches, numerous architecture updates, some IDE fixes, an option to
remove I/O schedulers from the kernel entirely, and a new
local_t
type for CPU-local data. See
the long-format
changelog for the details.
As of this writing, there are no patches beyond -test2 in Linus's
BitKeeper repository.
The current stable kernel is 2.4.21. Marcelo has been busy with the
prepatches, however; 2.4.22-pre8 was
released on July 24, and 2.4.22-pre9 on
July 29. Both patches limit themselves to fixes.
Comments (none posted)
Kernel development news
While the performance improvements in the 2.6 kernel have impressed and
pleased many users, there has been a constant level of complaining about
the scheduler. In particular, many users are unhappy with the interactive
feel of the 2.6 kernel; reports of jerky response and skipping audio are
common. Some have gone as far as to compare the 2.6-test scheduler with
the 2.4 virtual memory subsystem at the same point in the development
cycle. There is some concern that scheduling could be one of the (few)
embarrassments in the upcoming 2.6.0 release. Those worries are probably
overdone, but the point is that 2.6.0-test scheduling still needs some
work.
The good news is that said work is being done. Con Kolivas has been
posting a set of interactivity patches for about a month now. Those
wanting to try them out can find them in a recent -mm kernel;
patches against the Linus kernels can be found on Con's web
site. His most recent patch is 011int.
Con's work follows a familiar theme: improve interactive performance by
giving a priority boost to interactive processes. All you have to do is
figure out which processes are the interactive ones. Of course, that is
rather easier said than done. Con's interactivity patches have been
through several iterations in an attempt to find the best way to identify
interactive processes and the proper amount of bonus to give them. This
patch series may be converging on a result; several testers have reported
good results.
The core idea (already part of the 2.6.0-test scheduler) is that an
interactive process is one which sleeps much of
the time. With Con's patches, any process which sleeps for at least
one clock tick gets a priority bonus when it wakes up; processes which have
done enough sleeping recently to be explicitly marked as "interactive" get
a bigger bonus than others. Processes
which run without sleeping for their entire time slice lose a point. In
this way, CPU-hog processes sink to the lowest priorities, while a process
reading from a terminal (or an audio stream) will get quick access to the
processor. Additionally, processes which have maxed out their sleep bonus
and are seen as truly interactive can hang out in the run queue for a while
even after their time slice expires.
Life is not always so simple, however. Early versions of this patch tended
to make life hard for newborn processes; it took them quite a while to
build up enough of an interactivity bonus to be able to respond quickly on
a loaded system. So things had to be tweaked to let new processes find
their natural level quickly. There is also the issue of processes that
sleep for a long time, then wake up to do some serious cranking. So
processes that sleep for longer than one second lose their interactivity
bonus, and end up at an "idle" level just below the interactive level.
Work has also been required to balance priorities properly when an
interactive process forks.
More recently, Ingo Molnar has also started looking at the interactivity
problem; his sched-2.6.0-test1-G6 patch
takes a different approach. Ingo starts by changing the scheduler to use
nanosecond resolution in its timekeeping; his claim is that, by working
with high-resolution time, audio skipping problems can be fixed. Then, the
patch splits up time slices so that processes running at the same priority
switch off with each other much more often, ensuring that none of them have
to wait too long before getting some processor time. Finally, Ingo's patch
extends the sleep bonus to include time that the process sits in the run
queue, but does not actually get into the processor.
The two sets of patches are mostly orthogonal to each other; while it
remains hard to apply both Con's and Ingo's patches to a single system, the
two really are addressing different issues. Recent versions of Con's
patches, however, also include some of Ingo's work (almost everything
except the nanosecond resolution). In the end, code from both patches is
likely to find its way into the kernel.
As a postscript, it's worth taking a look at this post from Daniel Phillips, where he states
that the wrong approach is being taken for the audio skipping problem.
Audio playback, says Daniel, is not an interactive task - it is a realtime
task. What is really needed for the audio case is a bounded-latency soft
realtime scheduler, not an endless series of interactive scheduler tweaks.
Comments (8 posted)
The topic of module unload races - where the kernel can end up calling into
a module which has been removed - comes back occasionally. Much work has
been done in 2.5 to reduce and eliminate these races. Part of that effort was
moving module reference counting outside of the modules themselves. The
result was a safer scheme, but one which imposes new requirements on kernel
code which calls into modules. In some kernel subsystems (networking), the
maintainers have decided that there is no need to worry about reference
counting for modules; they simply ignore it.
Enter Rusty Russell. Since the reference counts are seen to be a pain, and
some code isn't using them at all, why not simply get rid of them? He has
submitted a patch which does exactly that.
Of course, the issue of how to safely remove modules remains. Without
reference counts, how does the kernel know when it can actually get rid of
a particular module? With Rusty's patch, a different approach is taken:
modules are never actually removed. If an administrator invokes
rmmod, the module's cleanup function will be called and all kernel
knowledge of the module will go away - but the module code itself will
remain in the kernel. The patch thus sacrifices some system memory on
every unload as a way of avoiding unload races.
Some developers liked this patch, others didn't. For a kernel hacker who
is debugging a module, a little lost memory for each load/unload cycle is
probably not a big problem; the system will likely be rebooted soon
anyway. The patch does present a bigger problem for Linux installers,
however; many of these do hardware detection by loading almost every module
available and seeing which ones actually find something. On a "small"
system (that is, say, 64MB), it is possible that some distribution
installers would simply run out of memory and die.
Rusty proposed adding a special rmmod option which would clean up
memory left behind by deleted modules (while also marking the kernel
tainted). For now, however, all of this has been made irrelevant by Linus,
who decreed: "First off - we're not
changing fundamental module stuff any more." This statement drew an amused response from Rusty ("OK. Who
are you and what have you done with the real Linus?"), but the
general sigh of relief from most kernel hackers could be heard worldwide.
It seems that Linus is truly holding the line and keeping out potentially
disruptive changes this time around.
Comments (4 posted)
The final part of the 2.3 development series featured a strong campaign to
get the ReiserFS filesystem merged. That campaign was successful; ReiserFS
was added in 2.4.1. Now it appears that history may repeat itself with the
2.6 kernel. Hans Reiser has
posted a note
asking that the soon-to-be-posted Reiser4 patch be merged into 2.6.0-test.
Reiser4 is not an updated version of ReiserFS; it is an entirely new
filesystem. According to the posted
benchmarks, Reiser4 outperforms ReiserFS and ext3 on several fronts.
According to Hans, the performance of Reiser4 is now good enough to justify
including it in 2.6-test.
The truly interesting part of Reiser4 is not limited to performance,
however. Reiser4
is presented as a fully atomic filesystem - every operation either executes
fully or not at all. It thus offers the same sort of crash resistence
found in journaling filesystems, but with a couple of differences. One is
that, it is claimed, the "wandering log" technique used in Reiser4 offers
greater speed, since, unlike with other journaling schemes, it is not
necessary to write data twice. And the other is that the "fully atomic"
nature of the filesystem can extend beyond individual operations. Reiser4,
in other words, can provide actual transactions.
A typical journaling filesystem works by writing all of the blocks to be
changed in a given operation to a special journal file, followed by a
"commit record." Once the operation is committed, the blocks can be copied
from the journal to their real destination on the disk. If the system dies
before the commit record is written, the operation is simply discarded and
the filesystem is unchanged. If, instead, a fully committed operation is
found in the journal, it can be replayed. With a scheme like this, an
operation may be lost in a crash, but the filesystem itself will not be
corrupted.
The Reiser4 wandering log technique works a little differently. It does
not overwrite blocks in the filesystem; instead, blocks to be changed are
relocated and the data is written in the new spot. The block pointers in
the filesystem are changed in an (also relocated) directory block. This
process continues up the filesystem tree until, with a single write
pointing to the new root block, the whole operation is committed. The
elimination of the need to write data separately to a journal file can
increase performance, but this technique also has the potential to fragment
files across the disk, hurting read performance. For that reason, Reiser4
allows for plugin modules which can look at operations and opt for a more
normal journaling scheme when it makes sense. There will also be a
"repacker" program which will go through occasionally and rearrange disks
for better read performance.
The ability to perform multi-operation, multi-file transactions is what
will make Reiser4 truly unique, however. A transactional capability will
allow applications to perform complicated operations without the need to
resort to tricks with fsync() and file renaming, and without the
need to use a separate database manager. Of course, there are a few
residual issues, like the fact that the standard Unix system calls make no
provision for starting, committing, and rolling back transactions. So a
new system call interface will be required. The Reiser4 developers are
working on this interface, but have not yet posted it for wide review.
Linus has not committed himself with regard to merging Reiser4 into 2.6.
It's worth noting that, when ReiserFS was merged, it had been stable and
widely used for some time. That is not the case for Reiser4, which is still in
an early stage. Chances are that Reiser4 will have a harder time
getting into the kernel than ReiserFS did. (For more information on
Reiser4, see this document on
transactions, and this one on
wandering logs, dancing trees, and other journaling topics).
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>