The current 2.6 release is 2.6.2-rc2
, which Linus announced
on January 25. Changes since -rc1 include a number of architecture updates, an
IrDA update, and various fixes. See the
for the details.
The latest patch set from Andrew Morton is 2.6.2-rc2-mm1. Changes in recent -mm kernels
include compilation fixes for gcc 3.5, more scheduler tweaks, a new
"_relaxed" API for unordered I/O memory accesses, some code for finding
dangerous sleep_on() calls (see below), x86_64 kgdb support, and
many other fixes.
The current 2.4 kernel is 2.4.24; Marcelo released 2.4.25-pre7, which includes a set of architecture and
filesystem updates, on January 23. Marcelo also notes that 2.4
development will not freeze before 2.4.27; there is already a set of
important patches that will need to go into 2.4.26.
Comments (2 posted)
Kernel development news
is a project to make
a kernel which can run cooperatively in kernel mode with other operating systems. The
goal, in particular, is to run Linux as an application under Windows XP.
That goal has now been achieved for "some common hardware configurations."
Click below for the release announcement.
Full Story (comments: 16)
we looked at
implementing device drivers in user space. Drivers are not the only kernel
functionality which can be moved across the divide, however; it is also
possible to implement filesystems with user-space code. Linux has a long
tradition of user-space filesystems, actually; NFS was implemented that way
for quite some time. Even so, user-space filesystems are not widely used,
for a number of obvious reasons (performance, security, ...). But there
are situations where a user-space filesystem can be a nice thing to have.
For those situations, there is a project called FUSE. Its associated SourceForge page is not
particularly enlightening; one really has to look at the project's code to
understand what FUSE has to offer.
Since the second FUSE 1.1 release candidate has just been announced, this seems like a good time for such
FUSE is a three-part system. The first of those parts is a kernel module
which hooks into the VFS code and looks like a filesystem module. It also
implements a special-purpose device which can be opened by a user-space
process. It then spends its time accepting filesystem requests,
translating them into its own protocol, and sending them out via the device
interface. Responses to requests come back from user space via the FUSE
device, and are translated back into the form expected by the kernel.
In user space, FUSE implements a library which manages communications with
the kernel module. It accepts filesystem requests from the FUSE device and
translates them into a set of function calls which look similar (but not
identical) to the kernel's VFS interface. These functions have names like
open(), read(), write(), rename(),
Finally, there is a user-supplied component which actually implements the
filesystem of interest. It fills a fuse_operations structure with
pointers to its functions which implement the required operations in
whatever way makes sense. This interface is not well documented, but the example filesystem provided with FUSE
(which implements a simple sort of loopback filesystem) is reasonably easy
An old filesystem module (AVFS) uses FUSE to make filesystems out of tar
and zip files, but one could imagine any number of other possibilities. It
would not be that hard to make filesystems which mirror a web site (in
read-only mode, at least), provide access to an object database, or provide
a file-per-user view of the password file, for example. FUSE could be an
ideal platform for experimenters who want to take the "everything is a
file" idea to its limit.
Comments (15 posted)
One of the many goals for the 2.5 development series was the removal of the
function (and its variants). The purpose of
is to cause a process to block until some condition
comes true; unfortunately, it is almost impossible to use safely.
Almost every call to sleep_on()
looks something like the
The problem is that the situation can change between the test (in the
while loop) and when the process actually goes to sleep. If the
wakeup event happens between the two, the process will miss it and may
sleep forevermore. Given that 2.6 was intended to be a more responsive
kernel than its predecessors, this behavior is considered undesirable. The
only way to avoid it, however, is to hold the Big Kernel Lock (BKL) in the code
which calls sleep_on() - and the code which performs the wakeup.
Since elimination of the BKL was also on the to-do list, there is little
enthusiasm for fixing sleep_on() race conditions that way.
The 2.4 kernel provided a couple of safer ways to sleep: the
wait_event() macro or a full "manual sleep" calling
schedule() directly (though the latter can be hard to do
correctly). In 2.5, the prepare_to_sleep() function was added as
an easier (and better performing) way of doing manual sleeps. Even so, the
2.6.2-rc2 kernel still has over 400 calls to the various forms of
sleep_on(). Clearly, the goal of getting rid of that function was
At this point, many people will have concluded that the effort to remove
sleep_on() has been put on hold until 2.7 opens up. It seems,
however, that most users of sleep_on() may yet get fixed in 2.6.
In response to some discussion on the topic, Al Viro stated:
We need to remove racy uses anyway - that can't wait for 2.7. And
I really wonder if there will be anything left after that - right
now only reiserfs uses look like something that might be not
He also noted that any use of sleep_on() within device drivers is
Andrew Morton took the next step in 2.6.2-rc1-mm2; that kernel includes a patch
which dumps out a bunch of debugging information whenever
sleep_on() is called without the BKL held. That code has already
turned up a few bad calls which have been duly reported to the kernel
list. Fixes for those calls have been somewhat slower in coming. They
will likely arrive, however, and as Al speculated, by the time all the bad
calls are fixed there may not be a whole lot left. sleep_on()
will undoubtedly exist when the 2.7.0 kernel is released, but there may be
very few callers of it by then.
Comments (none posted)
Increasingly, the kernel uses reference counts to know when data structures
are no longer needed and can be reclaimed. This reference counting tends
to be managed by the kobject
other mechanisms are used as well. When properly used, this mechanism
Interesting issues can come up, however, when reference-counted objects are
maintained by code in loadable modules. In many situations, the module
cannot be unloaded until all objects it has created have seen their
reference counts go to zero and have been returned to the system.
Otherwise, the system can be left with objects containing invalid references
to module code which no longer exists. Bad things usually result from that
Alan Stern recently ran into this sort of situation; his module registers
various structures with the device model, and must be sure not to allow
itself to be unloaded until those structures have been released. To that
end, he wrote a patch adding two functions
platform_device_unregister_wait()) which unregister those
structures and explicitly wait until they have been released. This patch
did not get very far, however; it was quickly pointed out that, with this
code, it is relatively easy to deadlock the kernel. If the process trying
to remove the module also has an open file descriptor to one of that
module's sysfs entries, everything comes to a halt. The suggested solution,
instead, is to simply not allow the module to be unloaded if it still has
unreclaimed objects outstanding.
That approach is taken in some other contexts. The cdev structure
used to represent char devices uses a kobject for its reference count. The
cdev_get() function does more than just increment the count in the
kobject, however; it also increments the reference count for the module
which drives that device. If any cdev structure owned by a module
has references, the module, too, will have a non-zero reference count and
will not be unloadable.
Another approach has been taken in the network subsystem. The
net_device structure represents a network device; its rules say
that it must be allocated dynamically, with alloc_netdev(). When
the network driver is done with the structure, it calls
free_netdev() to get rid of it. The net_device structure
has its own reference count, but it is not tied to the underlying module's
reference count. Instead, the networking system guarantees that, once
free_netdev() has been called, it will not call into the module
again for that device. The release function for the net_device
structure, which returns its memory to the system, lives in the networking
code, rather than in any loadable module. As a result, the module can be
removed even while some of its net_device structures continue to
exist, and all will be well. Those structures have been detached from the
module which created them, and will be freed by core kernel code.
The real lesson from all this, perhaps, is that the kernel developers are
still figuring out the implications of the lifetime rules of the objects
they create. The addition of sysfs in 2.5 has tended to force this issue;
sysfs exposes a great many internal kernel objects to user space, which can
keep references to those objects for an indeterminate period of time.
Making everything work safely in this environment has proved to be a
challenge at times.
And module unloading, of course, will always be a challenge. There will
likely always be issues involved with removing code from a live kernel. As Linus put it:
The proper thing to do (and what we _have_ done) is to say
"unloading of modules is not supported". It's a debugging feature,
and you literally shouldn't do it unless you are actively
developing that module.
Experience shows that many users are not happy with a kernel which cannot
unload modules, however. So the kernel developers are likely to be
wrestling with these issues for some time yet.
Comments (10 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>