Brief items
The current 2.6 release remains 2.6.13; the first 2.6.14 prepatch
has not yet been released. According to
Andrew
Morton's September 5 kernel status report, there is quite a bit of
stuff yet to be merged, so we may not see 2.6.14-rc1 for a few more days.
That prepatch is already taking shape in Linus's git repository, however.
Merged patches include
wireless extensions v19, relayfs,
the ipw2100 and ipw2200 wireless network drivers,
the hostap driver (which allows a suitably equipped system to function as a
wireless access point), a number of swap file improvements, a new set of
sparse memory support patches (preparing the kernel for memory hotplug), a
number of kernel build system improvements, a klist API change (see below),
a large InfiniBand update (with a shared receive queue implementation), a
PHY abstraction layer for ethernet drivers, a serial ATA update, four-level
page table support for the ppc64 architecture, some sk_buff structure
shrinking patches, a big netfilter update (including netlink interface to a
number of netfilter internals and a user-space packet logging capability),
a new linked list primitive, a DCCP implementation (see last week's Kernel Page), and
more.
The current -mm release is 2.6.13-mm1. Recent changes to
-mm include a big TTY layer buffering rewrite, an IBM accelerometer driver,
and a number of architecture updates.
Comments (2 posted)
Kernel development news
The 2.6.6 kernel contained, among many other things, a patch implementing
single-page (4K) kernel stacks on the x86 architecture. Cutting the kernel
stack size in half reduces the kernel's per-process overhead and eliminates
a major consumer of multi-page allocations. So running with the smaller
stack size is good for kernel performance and robustness. The only problem
has been certain code paths in the kernel which require more stack space
than that. Overrunning the kernel stack will corrupt kernel memory and
lead to unfortunate behavior in a hurry.
Over time, however, most of these problems have been taken care of, to the
point that Adrian Bunk recently asked: is
it time to eliminate the 8K stack option entirely for x86? Some
distributors (e.g. Fedora) have been shipping kernels with 4K stacks for
some time without ill effect. What problems might result, Adrian asked, if
4K stacks became the only option for everyone?
It turns out that there are a few problems still. For example, the reiser4
filesystem
still cannot work with 4K stacks. There is, however, a patch
in the works which should take care of that particular problem.
A more complicated issue comes up in certain complex storage
configurations. If a system administrator builds a fancy set of RAID
volumes involving the device mapper, network filesystems, etc., the path
between the decision to write a block and the actual issuance of I/O can
get quite long. This situation can lead to stack overflows in strange and
unpredictable times.
What happens here is that a filesystem will decide to write a block, which
ends up creating a call to the relevant block driver's
make_request() function (or the block subsystem's generic version
of it). For stacked block devices, such as a RAID volume, that I/O request
will be transformed into a new request for a different device, resulting in
a new, recursive make_request() call. Once a few layers have been
accumulated, the call path gets deep, and the stack eventually runs out.
Neil Brown has posted a patch to resolve
this problem by serializing recursive make_request() calls. With
this patch, the kernel keeps an explicit stack of bio structures
needing submission, and only processes one at a time in any given task.
This patch will truncate the deep call paths, and should resolve the
problem.
That leaves one other problem outstanding: NDISwrapper. This code is a
glue layer which allows Windows network drivers to be loaded into a Linux
kernel; it is used by people who have network cards which are not otherwise
supported by Linux. NDIS drivers, it seems, require larger stacks. Since
they are closed-source drivers written for an operating system which makes
larger stacks available, there is little chance of fixing them. So a few
options have been discussed:
- Ignoring the problem. Since NDISwrapper is a means for loading
proprietary drivers into the kernel - and Windows drivers at that -
many kernel developers will happily refuse to support it at all. The
fact is, however, that disallowing 8K stacks would break (formerly)
working systems for many users, and there are kernel developers who do
not want to do that.
- Hack NDISwrapper to maintain its own special stack, and to switch to
that stack before calling into the Windows driver. This solution
seems possible, but it is a nontrivial bit of hacking to make it work
right.
- Move NDISwrapper into user space with some sort of mechanism for
interrupt delivery and such. These mechanisms exist, so this solution
should be entirely possible.
No consensus solution seems to have emerged as of this writing. There is
time, anyway; removing the 8K stack option is not a particularly urgent
task, and certainly will not be considered for 2.6.14.
Comments (29 posted)
Andrew Morton has stated that the OCFS2 cluster filesystem is likely to be
merged for 2.6.14. OCFS2 is not the only such filesystem under
development, however, and the developers behind the GFS2 filesystem are
wondering when it, too, might be merged - into -mm,
at least. Much work has been done on GFS to address
concerns which have
been raised previously; the developers think that it is getting close to
ready for wider exposure. The resulting discussion raised a couple of
interesting questions about the kernel development process.
The first one was asked by Andrew Morton:
"why?". Given that OCFS2 is going in, does the kernel really need another
clustered filesystem? What, in particular, does GFS bring that OCFS2
lacks? The answers took two forms: (1) Linux has traditionally hosted
a large variety of filesystems, and (2) since cluster filesystems are
relatively new, users should be able to try both and see which one
works better for them. David Teigland also posted a list of GFS features.
GFS will probably win this argument; there is a clear user community, and
filesystems tend not to have any impact on the rest of the kernel. But,
still, some developers are starting to wonder; consider, for example, this message from Suparna Bhattacharya:
And herein lies the issue where I tend to agree with Andrew on --
its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there
enough thought and discussion about where the
fragmentation/diversification is really warranted, vs improving
what is already there, or say incorporating the best of one into
another, maybe over a period of time?
The other issue which came up was the creation of a user-space API for the
distributed lock manager (DLM) used by GFS. If nothing else, the two
cluster filesystem should have a common API so that applications can be
written for either one. One option for this API might be "dlmfs", a
virtual filesystem used with OCFS2. The dlmfs approach allows normal
filesystem operations to be used for lock management tasks; even shell
scripts can perform locking. Concerns with dlmfs include relatively slow
performance and a certain unease with
aspects of the interface:
Actually I think it's rather sick. Taking O_NONBLOCK and making it
a lock-manager trylock because they're
kinda-sorta-similar-sounding? Spare me. O_NONBLOCK means "open
this file in nonblocking mode", not "attempt to acquire a clustered
filesystem lock". Not even close.
(Andrew Morton).
It is not clear that better alternatives exist, however.
One could implement it all with a big set of ioctl() calls, but
nobody really wants to do that. Another approach would be to create a new
set of system calls specifically for lock management. Some have argued in
favor of system calls, but others, such as Alan Cox, are strongly opposed:
Every so often someone decides that a deeply un-unix interface with
new syscalls is a good idea. Every time history proves them totally
bonkers. There are cases for new system calls but this doesn't
seem one of them.
Alan lists a number of reasons why a file descriptor-based approach makes
sense for this sort of operation - they mostly come down to well-understood
semantics and the fact that many things just work.
This is clearly a discussion which could go on for some time. Daniel
Phillips points out that this is not
necessarily a problem. There are currently no user-space users of any DLM
API beyond a few filesystem management tools, so there is no great hurry to
merge any API. The cluster filesystems could go in without any user-space
DLM interface at all while the developers figure out what that interface
should be. And, says Daniel, perhaps there should not be one at all.
Despite the perceived elegance of having a single lock manager on the
system, having user space rely upon its own, user-space DLM is a workable
solution which could simplify the kernel side of things.
Comments (5 posted)
The klist type implements a linked list with built-in locking; it was
described here
last March.
The 2.6.14 kernel will contain a couple of API changes affecting klists.
The first is a simple change for a couple of klist
functions, which now have the following prototypes:
void klist_add_head(struct klist_node *node, struct klist *list);
void klist_add_tail(struct klist_node *node, struct klist *list);
The change is that the order of the two parameters has been switched. This
change makes the klist functions use the same ordering as the older
list_head functions, hopefully leading to a lower level of
programmer confusion.
The more complicated change has to do with reference counting. The klist
list iteration functions can hold references to objects on the list, but
the higher level code (which actually creates the objects) does not know
about those references. Somehow, the klist code must be able to tell the
next layer up about references it holds during list iteration. To that
end, klist_init() has picked up a couple of new parameters:
void klist_init(struct klist *list, void (*get)(struct klist_node *node),
void (*put)(struct klist_node *node));
The get() and put() functions are a bit of glue code
which allows the klist code to take and release references. All code using
klists must now provide these functions at initialization time.
Comments (none posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>