Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.16-rc5, released on February 26. Says Linus: "There's not much to say about this: people have been pretty good, and it's just a random collection of fixes in various random areas." Details can be found in the long-format changelog.

The mainline git repository contains, as of this writing, several dozen fixes merged since -rc5 was released.

The current -mm tree is 2.6.15-rc5-mm1. Recent changes to -mm include a relayfs API change, a new set of notifier patches, a big rework of the /proc code, and the return of the swap prefetching patch.

Comments (none posted)

Quote of the week

It's not funny anymore. The current rate at which new GPL violations get reported and/or discovered, especially from the appliance/embedded market is really alarming.

For example, I haven't yet seen a single linux-based NAS product that was even remotely license compliant when first analyzing it. And I'm not only talking about the SoHo NAS boxes with one or two hard disk drives, but even about enterprise storage systems.

-- Harald Welte

Comments (11 posted)

A bowtie on the OSDL board

Last month, Greg Kroah-Hartman announced that OSDL had accepted a set of recommendations aimed at improving its relations with the kernel development community. One of those recommendations was naming a kernel developer to the OSDL board of directors. OSDL has now followed through by announcing (click below for the press release) that SCSI subsystem maintainer James Bottomley will be joining the board.

Full Story (comments: 6)

ABI stability documentation

Last week's Kernel Page looked at the stability of the user-space interface, especially regarding areas like sysfs, which are not always regarded as being part of the kernel ABI. This week, Greg Kroah-Hartman has made an attempt to make the issue more evident through a set of ABI stability documents. Included in his patch is a proposal for a different way of looking at ABI stability issues.

Linus has, in the recent past, taken a hard line on changes interfaces to user-space:

If you cannot maintain a stable kernel interface, then you damn well should not send your patches in for inclusion in the standard kernel. Keep your own "HAL-unstable" kernel and ask people to test it there.

It really is that easy. Once a system call or other kernel interface goes into the standard kernel, it stays that way. It doesn't get switched around to break user space.

Greg, has, instead, taken the approach that not all kernel interfaces should be seen as stable from the outset. So he has proposed five different classifications for ABI stability:

Stable. Interfaces classified as stable will not break "for at least two years," and probably quite a bit longer. The Linux system call interface is classified in this way.
Testing. A "testing" interface is one which has been through most of the development process. It is not expected to change, but, that notwithstanding, the possibility of an incompatible change before the interface becomes "stable" does exist. This is the time for user-space programs to begin to make real use of the interface, but user-space developers need to pay attention to what is happening on the kernel side. The sysfs files under /sys/class have been designated as having a "testing" level of stability by Greg's documentation.
Unstable. This classification is for relatively new interfaces which are expected to change as problems in the initial implementation become clear. Sysfs files under /sys/devices are classified as "unstable."
Private. This class describes interfaces which are intended to be hidden behind a user-space library and which should not be used directly by applications. The ALSA sound system is an example of a "private" interface.
Obsolete marks interfaces which are destined to be removed, and which should not be used at all. Few long-timer observers will be surprised to see that Greg marked devfs as being obsolete.

Linus doesn't like the unstable and private classifications, calling them "excuses for bad habits." But it is true that inclusion in the mainline can stress an interface in surprising ways, leading to a need for changes. Interface design is hard, even if you don't have to get everything right the first time. So it may make some sense to allow unstable interfaces into the kernel for a short while - as long as they are clearly documented as such. Thus far, there has been no way to warn developers that a certain interface, perhaps, shouldn't be relied upon quite yet.

The notion of private interfaces looks harder to justify. There has been some talk of shipping user-space libraries for private interfaces with the kernel, just to help ensure that the whole package provides a stable application interface for any release. That seems like a fairly unlikely change, however, at least for big interfaces like ALSA.

Changes will likely be made (this scheme might be classified "unstable" at this point), but it seems probable that it will, in some form, be adopted. That can only be a good thing for people interested in a stable user-space interface; once the expectations have been reasonably well documented, it will be easier to live up to them.

Comments (2 posted)

Some patches of interest

There's a few patches in circulation which merit a quick look.

What if you could improve kernel performance by 10% without writing any code? Arjan van de Ven has posted a patch which, he says, does just that - at least, for some specific benchmarks. This patch uses an obscure gcc option which causes the compiler to put every function into its own ELF section. Then, the linker is instructed to arrange those functions into a specific order in the final executable.

A typical, current x86-64 kernel (the architecture Arjan has been working with) fills on the order of 4MB of memory. The kernel uses large pages to hold its text, but a kernel of that size will still require at least two translation buffer (TLB) entries to cover its entire code body. But some kernel functions are used more heavily than others; much of the code in the kernel - error handling, for example - never gets run at all if you are lucky. So, if all of the regularly-used functions are moved to the beginning of the kernel image, the kernel should be able to operate with a single TLB entry for its text - most of the time. TLB entries are important: if an address is found in the TLB, the processor can avoid looking it up in the page tables, speeding access significantly. They are also scarce. So allowing the kernel to operate within a single TLB entry makes a big difference.

There are some details to work out yet. Optimizing TLB use will require that the kernel be loaded at a TLB-aligned address, which is not currently done on many architectures. There is another part of Arjan's patch which, using another gcc option, can move blocks marked with unlikely() into a separate section. Since this option can expand the code, require long-distance jumps within functions, and make stack backtraces hard to read, it is not yet clear whether it makes sense or not. Then, there is the issue of ordering the functions properly. That task will require looking at a lot of kernel profiles to be sure that some workloads won't be optimized at the expense of others. But, once these issues are taken care of, a reorganized and faster kernel will likely result.

On another front: it is generally easy to see, on a Linux system, what resources a given process is using. What's harder to find out is what the process is not using because the resources are not available. As a way of giving more visibility to that side of the equation, Shailabh Nagar has been working on a set of task delay accounting patches. This facility is intended for use with large-scale load management applications, but the information may be useful in other contexts as well.

This patch adds a new structure (struct task_delay_info) which is attached to the task structure. It contains a lock, a couple of timestamp variables, and sets of delay counters. Whenever a process goes into a delayed state (meaning, currently, waiting on a run queue, performing synchronous block I/O, or waiting for a page fault), the time is noted. At the end of the delay, when the process can run again, the system notes how much time has passed and updates a counter in the task_delay_info structure. Thus, over time, one can get a picture of how much time the process has spent waiting for things when it would have rather been executing.

Perhaps the most complicated part of the patch set is the netlink interface used to report delay statistics back to user space. This interface has been carefully written to be as generic as possible on the theory that it may eventually be used for other sorts of process-related reporting as well. There has been a request that some of this information, at least, also be made available through /proc, so that it could be easily displayed by tools like top.

Finally, those who worked with kernel modules in 2.4 and prior kernels will remember the MODULE_PARM() macro, used to define load-time parameters. This macro has been deprecated since 2004, but there are still a few hundred uses of MODULE_PARM() spread across several dozen files in the 2.6.16-rc kernels. These old uses came to attention recently when gcc started optimizing them out. Given the choice between making the old macro work with current gcc and simply getting rid of it, Rusty Russell chose to get rid of it. This patch has not yet been merged anywhere, but it seems uncontroversial. If there are any out-of-tree modules still using MODULE_PARM(), updating them soon might be a good idea.

Comments (9 posted)

The ipw3945 project

While there are a number of hopeful developments around the support of wireless network cards in Linux, that support remains one of the larger roadblocks for many users. It is thus always a welcome thing when a major manufacturer announces Linux support - and the beginnings of a working driver - for their products. So when Intel recently announced a project to support its 3945ABG wireless adapters, there was a certain amount of celebration. There was also come criticism, however, which highlights an ongoing issue with wireless support under Linux.

The ipw3945 project currently has a developer release of the driver, with a stable version expected within a few weeks. This release supports all of the basic features one would expect, with some additional features (quality of service, for example) "not officially supported." It should, in other words, be enough to allow use of the device.

It would seem that there is little to complain about here. But there is this little paragraph from the announcement:

In order to meet the requirements of all geographies into which our adapters ship (over 100 countries) we have placed the regulatory enforcement logic into a user space daemon that we provide as a binary under the same license agreement as the microcode. We provide that binary pre-compiled as both a 32-bit and 64-bit application. The daemon utilizes a sysfs interface exposed by the driver in order to communicate with the hardware and configure the required regulatory parameters.

The requirement for a binary-only blob brought out some concerns from developers who think that the regulatory-agency requirement has been overblown, and that it is not actually necessary to lock down the code in this way. Others disagree, noting that regulations in many parts of the world are quite strict with regard to allowing any user modification of hardware which can transmit. It is probably true that, in order to be able to offer this product in many parts of the world, Intel must lock down much of this logic in binary-only code.

Given that, however, Intel has chosen an interesting way to go about it. The closed code is not part of the driver itself; it is a daemon which runs entirely in user space. The driver itself is fully free software. So there is no non-free code going into the kernel, which is surely a step in the right direction.

The regulatory daemon controls the hardware by way of a special file exported through sysfs. The driver then interprets those commands - which enable or disable specific channels, set maximum power values, and so on - and programs the hardware accordingly. A quick look at the (15,000-line) driver source is sufficient to find the code which actually controls the transmitter's parameters.

So, in other words, this arrangement has not actually locked down much of anything. The daemon comes with the usual "thou shalt not reverse engineer" provisions, but there are people in parts of the world who can safely ignore that requirement. It would seem that little work beyond running the daemon under strace would be required. It might also be possible to write a replacement just by studying the driver code, without looking at the Intel-supplied daemon at all. One way or another, it seems likely that a free replacement for the regulatory daemon will come along, sooner or (not much) later.

Comments (15 posted)

Linus Torvalds Linux v2.6.16-rc5 ?

Andrew Morton 2.6.16-rc5-mm1 ?

Andrew Morton 2.6.16-rc4-mm2 ?

Con Kolivas No idle HZ aka dynticks i386 for 2.6.16-rc5 ?

Ashok Raj ACPI based physical CPU hotplug support for x86_64 ?

Arjan van de Ven Patch to reorder functions in the vmlinux to a defined order ?

Arjan van de Ven Reordering of functions, try 2 ?

MIke Galbraith Task Throttling V14 ?

Shailabh Nagar Per-task delay accounting ?

Peter Williams PlugSched-6.3.1 for 2.6.16-rc5 ?

Yoav Etsion RFC: klogger: kernel tracing and logging tool ?

Junio C Hamano GIT 1.2.3 ?

Petr Baudis Cogito-0.17 ?

Tejun Heo quilt2git v0.2 ?

Ingo Molnar softlockup-timer-driven.patch ?

Jean Tourrilhes WE-20 for kernel 2.6.16 ?

James Ketrenos Intel PRO/Wireless 3945ABG Network Connection ?

Alessandro Zummo RTC subsystem ?

James Smart fc transport: Add support for events ?

Michael Buesch Generic hardware RNG support ?

Greg KH Add kernel<->userspace ABI stability documentation ?

Steven Whitehouse GFS2 Filesystem [0/16] ?

Anton Altaparmakov NTFS: Release 2.1.26 ?

Suparna Bhattacharya DIO simplification and AIO-DIO stability ?

Badari Pulavarty [PATCH 0/4] map multiple blocks in get_block() and mpage_readpages() ?

David Howells Permit NFS superblock sharing [try #2] ?

Eric W. Biederman proc cleanup. ?

Nick Piggin mm: single pcp lists ?

Christoph Lameter Slab: Node rotor for freeing alien caches and remote per cpu pages. ?

Paul Jackson cpuset memory spread slab cache filesys ?

David Gibson hugepage: Strict page reservation for hugepage inodes ?

David McCullough ocf-linux-20060301 - Asynchronous Crypto support for linux ?

Miklos Szeredi mountlo 0.5 - Loopback mounting in userspace ?

Kay Sievers udev 086 release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

A bowtie on the OSDL board

ABI stability documentation

Some patches of interest

The ipw3945 project

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Security-related

Miscellaneous