Brief items
The current 2.6 kernel is 2.6.5; there have been no 2.6.6 prepatches
yet. Linus's BitKeeper repository is overflowing with patches for 2.6.6,
however, including much of the material from
2.6.5-mc4, the last "merge candidate" tree from
Andrew Morton. A great deal of new stuff is going into 2.6.6; see the
separate article below for more information.
The current -mm tree is 2.6.5-mm5; recent
additions to -mm include more CPU scheduler work, some of Hugh Dickins's
"prepare for object-based reverse mapping" patches (see below), a new
memory binding API for NUMA systems, and lots of fixes.
The current 2.4 kernel is 2.4.26, which was released on April 14. Among other things,
this release includes the fix for the
iso9660 filesystem buffer overflow vulnerability. Overall, changes in
2.4.26 include the "forcedeth" nVidia Ethernet driver, a big bonding
network driver rework, a lot of XFS work, various architecture updates
(including Intel "IA32e" support), TCP
Westwood support, an ACPI update, and lots of fixes.
Users of x86_64 systems may want to note that, as of 2.4.26, no more development will be done for that
architecture in 2.4.
Comments (3 posted)
Kernel development news
While Linus took a week off, Andrew Morton maintained a "merge candidate"
tree full of patches which were to be added to the mainline on Linus's
return. Linus is back; he has been quiet on linux-kernel, but his
BitKeeper repository shows that he has been busy: over 700 patches have
been merged in the first half of this week. Quite a few of these are
significant; there will be a lot of changes in the 2.6.6 kernel. Here's a
quick list of some of the more important additions.
- The usual pile of architecture updates, including x86_64, PPC, ARM,
ia64, m68k-noMMU, S/390, and others.
- POSIX
message queue support.
- Changes to the ext2 and ext3 filesystems which provide significant
speedups for the fsync() and fdatasync() calls.
Various other performance improvements have been added to those
filesystems as well.
- The addition of the fcntl() method to the
file_operations structure (see the March 24 Kernel Page).
- The "laptop mode" patch. This patch has evolved somewhat since we
last looked at it, but
the basic idea remains the same: avoid spinning up the disk whenever
possible, but, when you do have to perform disk activity, do
everything you can.
- 4KB kernel stacks for the i386 architecture. This patch reduces the
kernel's per-process overhead, which is useful for people trying to
run thousands of threads. It also removes one of the few places where
the kernel needs to allocate multiple, physically-contiguous pages.
In 2.6.6, there is a configuration option allowing the continued use
of 8KB stacks, though the plan is to eventually remove this option.
The configured stack size is stored in modules, so it will not be
possible to load a module which was built for the wrong size stack.
- Non-executable stack support for several architectures. This is not
the full "Exec shield" patch from Ingo Molnar, though parts of that
patch appear here.
- A big reiserfs update, including data=ordered support, space
preallocation, laptop mode support, and more.
- IPv6 support in SELinux.
- The lightweight auditing framework.
- A mechanism which allows block drivers to respond to queries about the
congestion state of their queues. This is useful for higher-level
drivers (i.e. the device mapper) which have a complicated queue state.
- The per-device unplugging patch which
makes some significant changes to the block layer, but which yields
significant performance improvements. This patch has evolved a lot
since it was originally posted, mostly to deal with complexities in
the device mapper, RAID, and swapping code.
- The "completely fair queueing" (CFQ) I/O scheduler (covered here last November). This scheduler tries to
evenly divide disk bandwidth among all processes on the system. The
CFQ scheduler can be chosen with a configuration option, or by booting
with the elevator="cfq" option.
- Some software suspend fixes, including support for systems with high
memory.
- The external module support patch (described in a separate article
below). The behavior of "make clean" has also been reworked
to do a more thorough job while, simultaneously, leaving behind enough
information to allow the building of external modules.
- A new configuration option allowing the building of kernels without
sysfs support. Be sure to read the help text before disabling sysfs,
however; without sysfs the kernel needs more explicit help in finding
its root partition.
- Various libata (serial ATA) improvements and fixes.
- A long list of NFS cleanups and improvements.
- Some cosmetic fixes, such as running devfs and the floppy driver
through lindent.
- Some significant page cache and virtual memory changes, which we will
get to in the next article.
Overall, one might be forgiven for thinking that 2.6.6 looks much like a
development kernel release. In fact, most of more intrusive patches listed
above have been around and tested for some time now; they have just finally
made their escape from the -mm tree. With the exception of the CPU
scheduler patches (which we hope to cover here next week) and, perhaps, the
reverse mapping VM changes, 2.6.6 looks likely to contain the bulk of the
work that most developers are still hoping to see added to 2.6. 2.6.6
contains enough big changes that its chances of containing an unpleasant
surprise or two are fairly high. Within a few more releases, however, 2.6
may well have stabilized to the point that it can be more widely deployed
and the bulk of developer attention can move on to 2.7.
Comments (5 posted)
Among the patches merged into the upcoming 2.6.6 release is a set of
virtual memory changes. Changes to such a fundamental subsystem are always
of interest, especially in the middle of a "stable" kernel series. Here,
then, is a quick discussion of what has transpired.
In response to the reverse mapping VM discussions over the last month or
so, Hugh Dickins has posted a series of patches which prepare the kernel
for a full object-based reverse-mapping scheme and the removal of the
per-page PTE chains. Hugh's patches carefully leave room for the inclusion
of either his anonmm patches or Andrea
Arcangeli's anon_vma work,
though he seems to expect that anon_vma will win out. The full set of
patches posted so far can be found in the "memory management" part of the
"patches and updates" section, below.
Of those patches, the first three have been merged as of this writing. rmap 1 simply creates a new
include file (linux/rmap.h) and moves much of the reverse-mapping
declarations there. The second patch (rmap 2) changes the way the
swap subsystem keeps track of swap cache pages; this change is needed to
free up a couple of struct page fields for reverse mapping tasks.
Finally, rmap 3 finishes
out the struct page work for various architectures.
Later patches in Hugh's series get more ambitious; rmap 7 adds object-based reverse mapping
for file-backed memory. Those patches have not been merged as of this
writing, however.
A completely different set of patches which changes how the page cache
works has been merged. The description of
this work, as written by Andrew Morton, reads:
The basic problem which we (mainly Daniel McNeil) have been
struggling with is in getting a really reliable fsync() across the
page lists while other processes are performing writeback against
the same file. It's like juggling four bars of wet soap with your
eyes shut while someone is whacking you with a baseball bat.
This work made some fundamental changes in how page cache pages are
tracked. The struct page structure has long included a field
called "list", being a list_head structure used to track
the state of the page. When the page is marked dirty, or placed under I/O,
it is put on a list with other such pages. Unfortunately, managing those
lists as the state of the page changes proves to be difficult; hence the
juggling analogy.
In response, the page lists have been removed altogether; as a
side-benefit, this change shrinks struct page by eight bytes - a
significant savings, considering that there is one such structure for every
physical page in the system. The lists have been replaced with an enhanced
radix tree which supports "tagging" of pages. When a page is dirtied, it
is simply marked dirty in the radix tree, rather than being added to a
list. Similarly, pages which are currently being written back to disk are
marked. A new set of radix tree operations allows the kernel to find these
pages when the need arises. Searching the tree is not as fast as following
a dedicated list, but the radix tree implementation appears to be fast
enough that few people will notice the difference.
These changes required touching a lot of VM and page cache code; every user
of the page->list field had to be fixed. As a result of the
changes, the order in which dirty pages are written to disk has changed;
writing always happens in file-offset order now. This change appears to be
an improvement for many applications; Andrew reports as much as 30% faster
benchmark results. I/O can slow down for some situations involving
parallel writes on SMP systems, however.
Comments (3 posted)
Changes in the kernel build process have yielded a number of benefits in
2.6. They have, however, exposed a few rough edges for people building
external modules. The
required procedure is
a bit inelegant, forces the user to ignore warnings from the build code
("you messed with SUBDIRS, do not complain if something goes wrong"),
and does not support modversions. It also requires the presence of a
configured and built kernel source tree, something which was not necessary
with previous kernels, and a build of an external module will often try to
rebuild things in the main tree as well. Fixing up the external module
build process has been on the "to do" list for some time.
Finally, somebody has done it. Sam Ravnborg has posted a patch which improves the external module
build process in a number of ways.
The basic form of a makefile for an external module will not change much.
It should still look something like:
ifneq ($(KERNELRELEASE),)
obj-m := module.o
else
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
$(MAKE) -C $(KDIR) M=$(PWD)
endif
The change has been underlined above; the parameter that once read
SUBDIRS=$(PWD) has changed to M=$(PWD). The
older SUBDIRS= format will still work, however. It is
also no longer necessary to specify the modules target when
invoking the kernel build system.
When the kernel build system is invoked with the M= parameter, it
does a number of things differently. It will make no effort to ensure that
the built files in the kernel source tree are current; if a developer makes
a change to the main tree, it is his or her responsibility to rebuild it
before trying to make any external modules. Only a few targets
(modules, clean, modules_install) are supported
when building external modules. And the modpost program
now maintains a file (Module.symvers) containing the symbol
version information if modversions is in use; this file is used when
postprocessing an external module to note the symbol versions expected by
that module.
Among other things, the new scheme will allow distributors to package
sufficient information for the building of external modules without the
user having to actually configure and build the full kernel source tree.
That information can be stored under /lib/modules by replacing the
build symbolic link (which currently points back to the source
tree) with a directory containing just the required information. That
should make life simpler for everybody involved.
Comments (1 posted)
Fedora Core 2 is
scheduled to ship
in just over one month. This distribution will be a high-profile
deployment of the 2.6 kernel. Red Hat has often shipped highly-patched
kernels, and there have been occasional criticisms that the company's
kernels are so divergent from the mainline that they are incompatible with
other Linux systems. Since we have been messing with the second Fedora
Core 2 test release anyway, it seemed like a good time to look and see
what sort of kernel it includes. To that end, we pulled down a copy of
2.6.5-1-321 from
Arjan van
de Ven's directory.
As it turns out, the number of patches contained in this kernel is
relatively small. That is not entirely surprising; vendor kernel patch
lists tend to get longer as the current development kernel progresses; some
vendors, at least, have a tendency to backport features from the
development tree. There is no development tree currently, so there
is nothing to backport.
That said, the first patch is a big one: it's the full 2.6.5-mc1 tree from Andrew Morton. Now that
the merge candidate patches are finding their way into 2.6.6-pre, Red Hat
will not need to apply that particular patch itself.
The 2.6.6 kernel will feature an option (on by default) to use 4KB kernel
stacks on the i386 architecture. The Fedora kernel has that patch, of
course; it also includes a separate patch which takes away the option of
using the traditional 8KB stacks. This change has upset some Fedora test
users; the 4KB stacks break certain proprietary device drivers
(e.g. nVidia) and some users of those drivers would prefer to have the
ability to build a kernel that supports them. Red Hat seems determined to
follow this path, however, on the assumption that nVidia will fix its
drivers (and the general attitude that breaking binary modules is a
low-priority problem at best).
Then, there are patches which are true Red Hat stuff. These include "exec shield," which makes buffer overflow
attacks harder by enforcing no-execute permissions; the 4G/4G patch which provides expanded 32-bit
virtual address spaces to both user space and the kernel; and TUX, the
kernel-based high-performance web server. There is also an
SELinux/security module patch which allows the kernel to bypass permission
checks when creating sockets internally; this one changes the security
module interface.
Then, there are various cleanup and safety patches. For example, gcc 3.4
supports a "warn_unused_result" attribute on functions; the compiler
will complain when code calls a function marked with this attribute and
fails to check the return value. The Red Hat kernel applies that attribute
to a few functions (copy_from_user(),
pci_enable_device(), etc.) to trap places where the proper checks
are not made. Various functions which use too much kernel stack space have
been fixed up. There is a patch which fixes some remaining
sleep_on() calls and warns about others. The driver for
/dev/mem has been fixed to disallow access to most of main
memory. And there is a driver for a "crash" device which provides direct
read access to main memory, seemingly for use by a crash dump utility.
Finally, there is a small set of bug fixes and patches to ease the build
process on various architectures.
Overall, the Fedora kernel suggests that, in Red Hat's view, not a whole
lot needs to be added to the 2.6 kernel (the upcoming 2.6.6 version, at
least) for it to be ready for wide use.
Comments (7 posted)
Patches and updates
Kernel trees
Build system
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>