LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 release remains 2.6.13; the first 2.6.14 prepatch has not yet been released. According to Andrew Morton's September 5 kernel status report, there is quite a bit of stuff yet to be merged, so we may not see 2.6.14-rc1 for a few more days.

That prepatch is already taking shape in Linus's git repository, however. Merged patches include wireless extensions v19, relayfs, the ipw2100 and ipw2200 wireless network drivers, the hostap driver (which allows a suitably equipped system to function as a wireless access point), a number of swap file improvements, a new set of sparse memory support patches (preparing the kernel for memory hotplug), a number of kernel build system improvements, a klist API change (see below), a large InfiniBand update (with a shared receive queue implementation), a PHY abstraction layer for ethernet drivers, a serial ATA update, four-level page table support for the ppc64 architecture, some sk_buff structure shrinking patches, a big netfilter update (including netlink interface to a number of netfilter internals and a user-space packet logging capability), a new linked list primitive, a DCCP implementation (see last week's Kernel Page), and more.

The current -mm release is 2.6.13-mm1. Recent changes to -mm include a big TTY layer buffering rewrite, an IBM accelerometer driver, and a number of architecture updates.

Comments (2 posted)

Kernel development news

4K stacks for everyone?

The 2.6.6 kernel contained, among many other things, a patch implementing single-page (4K) kernel stacks on the x86 architecture. Cutting the kernel stack size in half reduces the kernel's per-process overhead and eliminates a major consumer of multi-page allocations. So running with the smaller stack size is good for kernel performance and robustness. The only problem has been certain code paths in the kernel which require more stack space than that. Overrunning the kernel stack will corrupt kernel memory and lead to unfortunate behavior in a hurry.

Over time, however, most of these problems have been taken care of, to the point that Adrian Bunk recently asked: is it time to eliminate the 8K stack option entirely for x86? Some distributors (e.g. Fedora) have been shipping kernels with 4K stacks for some time without ill effect. What problems might result, Adrian asked, if 4K stacks became the only option for everyone?

It turns out that there are a few problems still. For example, the reiser4 filesystem still cannot work with 4K stacks. There is, however, a patch in the works which should take care of that particular problem.

A more complicated issue comes up in certain complex storage configurations. If a system administrator builds a fancy set of RAID volumes involving the device mapper, network filesystems, etc., the path between the decision to write a block and the actual issuance of I/O can get quite long. This situation can lead to stack overflows in strange and unpredictable times.

What happens here is that a filesystem will decide to write a block, which ends up creating a call to the relevant block driver's make_request() function (or the block subsystem's generic version of it). For stacked block devices, such as a RAID volume, that I/O request will be transformed into a new request for a different device, resulting in a new, recursive make_request() call. Once a few layers have been accumulated, the call path gets deep, and the stack eventually runs out. Neil Brown has posted a patch to resolve this problem by serializing recursive make_request() calls. With this patch, the kernel keeps an explicit stack of bio structures needing submission, and only processes one at a time in any given task. This patch will truncate the deep call paths, and should resolve the problem.

That leaves one other problem outstanding: NDISwrapper. This code is a glue layer which allows Windows network drivers to be loaded into a Linux kernel; it is used by people who have network cards which are not otherwise supported by Linux. NDIS drivers, it seems, require larger stacks. Since they are closed-source drivers written for an operating system which makes larger stacks available, there is little chance of fixing them. So a few options have been discussed:

  • Ignoring the problem. Since NDISwrapper is a means for loading proprietary drivers into the kernel - and Windows drivers at that - many kernel developers will happily refuse to support it at all. The fact is, however, that disallowing 8K stacks would break (formerly) working systems for many users, and there are kernel developers who do not want to do that.

  • Hack NDISwrapper to maintain its own special stack, and to switch to that stack before calling into the Windows driver. This solution seems possible, but it is a nontrivial bit of hacking to make it work right.

  • Move NDISwrapper into user space with some sort of mechanism for interrupt delivery and such. These mechanisms exist, so this solution should be entirely possible.

No consensus solution seems to have emerged as of this writing. There is time, anyway; removing the 8K stack option is not a particularly urgent task, and certainly will not be considered for 2.6.14.

Comments (29 posted)

Merging GFS2

Andrew Morton has stated that the OCFS2 cluster filesystem is likely to be merged for 2.6.14. OCFS2 is not the only such filesystem under development, however, and the developers behind the GFS2 filesystem are wondering when it, too, might be merged - into -mm, at least. Much work has been done on GFS to address concerns which have been raised previously; the developers think that it is getting close to ready for wider exposure. The resulting discussion raised a couple of interesting questions about the kernel development process.

The first one was asked by Andrew Morton: "why?". Given that OCFS2 is going in, does the kernel really need another clustered filesystem? What, in particular, does GFS bring that OCFS2 lacks? The answers took two forms: (1) Linux has traditionally hosted a large variety of filesystems, and (2) since cluster filesystems are relatively new, users should be able to try both and see which one works better for them. David Teigland also posted a list of GFS features.

GFS will probably win this argument; there is a clear user community, and filesystems tend not to have any impact on the rest of the kernel. But, still, some developers are starting to wonder; consider, for example, this message from Suparna Bhattacharya:

And herein lies the issue where I tend to agree with Andrew on -- its really nice to have multiple filesystems innovating freely in their niches and eventually proving themselves in practice, without being bogged down by legacy etc. But at the same time, is there enough thought and discussion about where the fragmentation/diversification is really warranted, vs improving what is already there, or say incorporating the best of one into another, maybe over a period of time?

The other issue which came up was the creation of a user-space API for the distributed lock manager (DLM) used by GFS. If nothing else, the two cluster filesystem should have a common API so that applications can be written for either one. One option for this API might be "dlmfs", a virtual filesystem used with OCFS2. The dlmfs approach allows normal filesystem operations to be used for lock management tasks; even shell scripts can perform locking. Concerns with dlmfs include relatively slow performance and a certain unease with aspects of the interface:

Actually I think it's rather sick. Taking O_NONBLOCK and making it a lock-manager trylock because they're kinda-sorta-similar-sounding? Spare me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to acquire a clustered filesystem lock". Not even close.

(Andrew Morton).

It is not clear that better alternatives exist, however. One could implement it all with a big set of ioctl() calls, but nobody really wants to do that. Another approach would be to create a new set of system calls specifically for lock management. Some have argued in favor of system calls, but others, such as Alan Cox, are strongly opposed:

Every so often someone decides that a deeply un-unix interface with new syscalls is a good idea. Every time history proves them totally bonkers. There are cases for new system calls but this doesn't seem one of them.

Alan lists a number of reasons why a file descriptor-based approach makes sense for this sort of operation - they mostly come down to well-understood semantics and the fact that many things just work.

This is clearly a discussion which could go on for some time. Daniel Phillips points out that this is not necessarily a problem. There are currently no user-space users of any DLM API beyond a few filesystem management tools, so there is no great hurry to merge any API. The cluster filesystems could go in without any user-space DLM interface at all while the developers figure out what that interface should be. And, says Daniel, perhaps there should not be one at all. Despite the perceived elegance of having a single lock manager on the system, having user space rely upon its own, user-space DLM is a workable solution which could simplify the kernel side of things.

Comments (5 posted)

A pair of klist API changes

The klist type implements a linked list with built-in locking; it was described here last March. The 2.6.14 kernel will contain a couple of API changes affecting klists.

The first is a simple change for a couple of klist functions, which now have the following prototypes:

    void klist_add_head(struct klist_node *node, struct klist *list);
    void klist_add_tail(struct klist_node *node, struct klist *list);

The change is that the order of the two parameters has been switched. This change makes the klist functions use the same ordering as the older list_head functions, hopefully leading to a lower level of programmer confusion.

The more complicated change has to do with reference counting. The klist list iteration functions can hold references to objects on the list, but the higher level code (which actually creates the objects) does not know about those references. Somehow, the klist code must be able to tell the next layer up about references it holds during list iteration. To that end, klist_init() has picked up a couple of new parameters:

    void klist_init(struct klist *list, void (*get)(struct klist_node *node),
		    void (*put)(struct klist_node *node));

The get() and put() functions are a bit of glue code which allows the klist code to take and release references. All code using klists must now provide these functions at initialization time.

Comments (none posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds