User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.5; there have been no 2.6.6 prepatches yet. Linus's BitKeeper repository is overflowing with patches for 2.6.6, however, including much of the material from 2.6.5-mc4, the last "merge candidate" tree from Andrew Morton. A great deal of new stuff is going into 2.6.6; see the separate article below for more information.

The current -mm tree is 2.6.5-mm5; recent additions to -mm include more CPU scheduler work, some of Hugh Dickins's "prepare for object-based reverse mapping" patches (see below), a new memory binding API for NUMA systems, and lots of fixes.

The current 2.4 kernel is 2.4.26, which was released on April 14. Among other things, this release includes the fix for the iso9660 filesystem buffer overflow vulnerability. Overall, changes in 2.4.26 include the "forcedeth" nVidia Ethernet driver, a big bonding network driver rework, a lot of XFS work, various architecture updates (including Intel "IA32e" support), TCP Westwood support, an ACPI update, and lots of fixes.

Users of x86_64 systems may want to note that, as of 2.4.26, no more development will be done for that architecture in 2.4.

Comments (3 posted)

Kernel development news

Linus merges up a storm

While Linus took a week off, Andrew Morton maintained a "merge candidate" tree full of patches which were to be added to the mainline on Linus's return. Linus is back; he has been quiet on linux-kernel, but his BitKeeper repository shows that he has been busy: over 700 patches have been merged in the first half of this week. Quite a few of these are significant; there will be a lot of changes in the 2.6.6 kernel. Here's a quick list of some of the more important additions.

  • The usual pile of architecture updates, including x86_64, PPC, ARM, ia64, m68k-noMMU, S/390, and others.

  • POSIX message queue support.

  • Changes to the ext2 and ext3 filesystems which provide significant speedups for the fsync() and fdatasync() calls. Various other performance improvements have been added to those filesystems as well.

  • The addition of the fcntl() method to the file_operations structure (see the March 24 Kernel Page).

  • The "laptop mode" patch. This patch has evolved somewhat since we last looked at it, but the basic idea remains the same: avoid spinning up the disk whenever possible, but, when you do have to perform disk activity, do everything you can.

  • 4KB kernel stacks for the i386 architecture. This patch reduces the kernel's per-process overhead, which is useful for people trying to run thousands of threads. It also removes one of the few places where the kernel needs to allocate multiple, physically-contiguous pages. In 2.6.6, there is a configuration option allowing the continued use of 8KB stacks, though the plan is to eventually remove this option. The configured stack size is stored in modules, so it will not be possible to load a module which was built for the wrong size stack.

  • Non-executable stack support for several architectures. This is not the full "Exec shield" patch from Ingo Molnar, though parts of that patch appear here.

  • A big reiserfs update, including data=ordered support, space preallocation, laptop mode support, and more.

  • IPv6 support in SELinux.

  • The lightweight auditing framework.

  • A mechanism which allows block drivers to respond to queries about the congestion state of their queues. This is useful for higher-level drivers (i.e. the device mapper) which have a complicated queue state.

  • The per-device unplugging patch which makes some significant changes to the block layer, but which yields significant performance improvements. This patch has evolved a lot since it was originally posted, mostly to deal with complexities in the device mapper, RAID, and swapping code.

  • The "completely fair queueing" (CFQ) I/O scheduler (covered here last November). This scheduler tries to evenly divide disk bandwidth among all processes on the system. The CFQ scheduler can be chosen with a configuration option, or by booting with the elevator="cfq" option.

  • Some software suspend fixes, including support for systems with high memory.

  • The external module support patch (described in a separate article below). The behavior of "make clean" has also been reworked to do a more thorough job while, simultaneously, leaving behind enough information to allow the building of external modules.

  • A new configuration option allowing the building of kernels without sysfs support. Be sure to read the help text before disabling sysfs, however; without sysfs the kernel needs more explicit help in finding its root partition.

  • Various libata (serial ATA) improvements and fixes.

  • A long list of NFS cleanups and improvements.

  • Some cosmetic fixes, such as running devfs and the floppy driver through lindent.

  • Some significant page cache and virtual memory changes, which we will get to in the next article.

Overall, one might be forgiven for thinking that 2.6.6 looks much like a development kernel release. In fact, most of more intrusive patches listed above have been around and tested for some time now; they have just finally made their escape from the -mm tree. With the exception of the CPU scheduler patches (which we hope to cover here next week) and, perhaps, the reverse mapping VM changes, 2.6.6 looks likely to contain the bulk of the work that most developers are still hoping to see added to 2.6. 2.6.6 contains enough big changes that its chances of containing an unpleasant surprise or two are fairly high. Within a few more releases, however, 2.6 may well have stabilized to the point that it can be more widely deployed and the bulk of developer attention can move on to 2.7.

Comments (5 posted)

VM changes in 2.6.6

Among the patches merged into the upcoming 2.6.6 release is a set of virtual memory changes. Changes to such a fundamental subsystem are always of interest, especially in the middle of a "stable" kernel series. Here, then, is a quick discussion of what has transpired.

In response to the reverse mapping VM discussions over the last month or so, Hugh Dickins has posted a series of patches which prepare the kernel for a full object-based reverse-mapping scheme and the removal of the per-page PTE chains. Hugh's patches carefully leave room for the inclusion of either his anonmm patches or Andrea Arcangeli's anon_vma work, though he seems to expect that anon_vma will win out. The full set of patches posted so far can be found in the "memory management" part of the "patches and updates" section, below.

Of those patches, the first three have been merged as of this writing. rmap 1 simply creates a new include file (linux/rmap.h) and moves much of the reverse-mapping declarations there. The second patch (rmap 2) changes the way the swap subsystem keeps track of swap cache pages; this change is needed to free up a couple of struct page fields for reverse mapping tasks. Finally, rmap 3 finishes out the struct page work for various architectures.

Later patches in Hugh's series get more ambitious; rmap 7 adds object-based reverse mapping for file-backed memory. Those patches have not been merged as of this writing, however.

A completely different set of patches which changes how the page cache works has been merged. The description of this work, as written by Andrew Morton, reads:

The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat.

This work made some fundamental changes in how page cache pages are tracked. The struct page structure has long included a field called "list", being a list_head structure used to track the state of the page. When the page is marked dirty, or placed under I/O, it is put on a list with other such pages. Unfortunately, managing those lists as the state of the page changes proves to be difficult; hence the juggling analogy.

In response, the page lists have been removed altogether; as a side-benefit, this change shrinks struct page by eight bytes - a significant savings, considering that there is one such structure for every physical page in the system. The lists have been replaced with an enhanced radix tree which supports "tagging" of pages. When a page is dirtied, it is simply marked dirty in the radix tree, rather than being added to a list. Similarly, pages which are currently being written back to disk are marked. A new set of radix tree operations allows the kernel to find these pages when the need arises. Searching the tree is not as fast as following a dedicated list, but the radix tree implementation appears to be fast enough that few people will notice the difference.

These changes required touching a lot of VM and page cache code; every user of the page->list field had to be fixed. As a result of the changes, the order in which dirty pages are written to disk has changed; writing always happens in file-offset order now. This change appears to be an improvement for many applications; Andrew reports as much as 30% faster benchmark results. I/O can slow down for some situations involving parallel writes on SMP systems, however.

Comments (3 posted)

Building external modules

Changes in the kernel build process have yielded a number of benefits in 2.6. They have, however, exposed a few rough edges for people building external modules. The required procedure is a bit inelegant, forces the user to ignore warnings from the build code ("you messed with SUBDIRS, do not complain if something goes wrong"), and does not support modversions. It also requires the presence of a configured and built kernel source tree, something which was not necessary with previous kernels, and a build of an external module will often try to rebuild things in the main tree as well. Fixing up the external module build process has been on the "to do" list for some time.

Finally, somebody has done it. Sam Ravnborg has posted a patch which improves the external module build process in a number of ways.

The basic form of a makefile for an external module will not change much. It should still look something like:

    ifneq ($(KERNELRELEASE),)
    obj-m	:= module.o

    KDIR	:= /lib/modules/$(shell uname -r)/build
    PWD		:= $(shell pwd)

	$(MAKE) -C $(KDIR) M=$(PWD) 

The change has been underlined above; the parameter that once read SUBDIRS=$(PWD) has changed to M=$(PWD). The older SUBDIRS= format will still work, however. It is also no longer necessary to specify the modules target when invoking the kernel build system.

When the kernel build system is invoked with the M= parameter, it does a number of things differently. It will make no effort to ensure that the built files in the kernel source tree are current; if a developer makes a change to the main tree, it is his or her responsibility to rebuild it before trying to make any external modules. Only a few targets (modules, clean, modules_install) are supported when building external modules. And the modpost program now maintains a file (Module.symvers) containing the symbol version information if modversions is in use; this file is used when postprocessing an external module to note the symbol versions expected by that module.

Among other things, the new scheme will allow distributors to package sufficient information for the building of external modules without the user having to actually configure and build the full kernel source tree. That information can be stored under /lib/modules by replacing the build symbolic link (which currently points back to the source tree) with a directory containing just the required information. That should make life simpler for everybody involved.

Comments (1 posted)

What's in the Fedora Core 2 kernel

Fedora Core 2 is scheduled to ship in just over one month. This distribution will be a high-profile deployment of the 2.6 kernel. Red Hat has often shipped highly-patched kernels, and there have been occasional criticisms that the company's kernels are so divergent from the mainline that they are incompatible with other Linux systems. Since we have been messing with the second Fedora Core 2 test release anyway, it seemed like a good time to look and see what sort of kernel it includes. To that end, we pulled down a copy of 2.6.5-1-321 from Arjan van de Ven's directory.

As it turns out, the number of patches contained in this kernel is relatively small. That is not entirely surprising; vendor kernel patch lists tend to get longer as the current development kernel progresses; some vendors, at least, have a tendency to backport features from the development tree. There is no development tree currently, so there is nothing to backport.

That said, the first patch is a big one: it's the full 2.6.5-mc1 tree from Andrew Morton. Now that the merge candidate patches are finding their way into 2.6.6-pre, Red Hat will not need to apply that particular patch itself.

The 2.6.6 kernel will feature an option (on by default) to use 4KB kernel stacks on the i386 architecture. The Fedora kernel has that patch, of course; it also includes a separate patch which takes away the option of using the traditional 8KB stacks. This change has upset some Fedora test users; the 4KB stacks break certain proprietary device drivers (e.g. nVidia) and some users of those drivers would prefer to have the ability to build a kernel that supports them. Red Hat seems determined to follow this path, however, on the assumption that nVidia will fix its drivers (and the general attitude that breaking binary modules is a low-priority problem at best).

Then, there are patches which are true Red Hat stuff. These include "exec shield," which makes buffer overflow attacks harder by enforcing no-execute permissions; the 4G/4G patch which provides expanded 32-bit virtual address spaces to both user space and the kernel; and TUX, the kernel-based high-performance web server. There is also an SELinux/security module patch which allows the kernel to bypass permission checks when creating sockets internally; this one changes the security module interface.

Then, there are various cleanup and safety patches. For example, gcc 3.4 supports a "warn_unused_result" attribute on functions; the compiler will complain when code calls a function marked with this attribute and fails to check the return value. The Red Hat kernel applies that attribute to a few functions (copy_from_user(), pci_enable_device(), etc.) to trap places where the proper checks are not made. Various functions which use too much kernel stack space have been fixed up. There is a patch which fixes some remaining sleep_on() calls and warns about others. The driver for /dev/mem has been fixed to disallow access to most of main memory. And there is a driver for a "crash" device which provides direct read access to main memory, seemingly for use by a crash dump utility.

Finally, there is a small set of bug fixes and patches to ease the build process on various architectures. Overall, the Fedora kernel suggests that, in Red Hat's view, not a whole lot needs to be added to the 2.6 kernel (the upcoming 2.6.6 version, at least) for it to be ready for wide use.

Comments (7 posted)

Patches and updates

Kernel trees

  • Andrea Arcangeli: 2.6.5-aa4. (April 8, 2004)
  • Andrea Arcangeli: 2.6.5-aa5. (April 8, 2004)


Build system

Development tools

Device drivers

Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds