The current development kernel is 2.5.46
, which was released
by Linus on November 4. It
includes uClinux (a port of the kernel to systems with no memory management
unit), the "huge TLB" filesystem for working with large pages, more driver
model work, the latest sys_epoll implementation, a big m68k update, the
beginning of initramfs support (see below), an ARM update, extended
attributes and some online resizing support for the ext2 and ext3
filesystems, and numerous other patches. The the long-format changelog
has the details.
Linus has taken a break (from kernel development, anyway) since releasing
2.5.46; his BitKeeper tree is almost empty.
The current prepatch from Alan Cox is 2.5.45-ac1; it adds a number of fixes and backs
out some "dangerous looking" SCSI driver changes.
The latest 2.5 status summary from Guillaume
Boissiere is dated November 4.
Dave Jones has posted version 0.10 of his
"post-Halloween" 2.5 kernel document.
The current stable kernel is 2.4.19; there have been no 2.4.20
prepatches released over the last week.
Comments (none posted)
Kernel development news
The performance of a file system is dependent on many things; one of the
crucial factors is just how that filesystem lays out files on the disk. In
general, it is best to keep related items together; a kernel compilation
will go more quickly if the files within the kernel source tree all live
close to each other on the disk. To achieve this goal, the ext2 and ext3
filesystems have long tried to lay out the contents of a directory in the
same cylinder group (or, at least, in nearby groups).
In the real world, however, it turns out to be better, sometimes, to spread
things out. Imagine setting up a system with users' home directories in
/home. If all the first-level directories within /home
(i.e. the home directories for numerous users) are placed next to each
other, there may be no space left for the contents of those directores.
User files thus end up being placed far from the directories that contain
them, and performance suffers. The ext2 filesystem has suffered from this
sort of performance degradation for some time.
The 2.5.46 kernel contains a new block allocator which attempts to address
this problem. The new scheme, borrowed from BSD, is named the "Orlov
allocator," after its creator Grigory Orlov; he has posted a brief
description of the technique as it is used in the BSD kernels. The
Linux implementation, as implemented by
Alexander Viro, Andrew Morton, and Ted Ts'o, uses a similar technique but
adds a few changes.
Essentially, the Orlov algorithm tries to spread out "top-level"
directories, on the assumption that they are unrelated to each other.
Directories created in the root directory of a filesystem are considered
top-level directories; Ted has added a special inode flag that allows
the system administrator to mark other directories as being top-level
directories as well. If /home lives in the root filesystem (and
people do set up systems that way), a simple chattr command will
make the system treat it as a top-level directory.
When creating a directory which is not in a top-level directory, the
Orlov algorithm tries, as before, to put it into the same cylinder group as
its parent. A little more care is taken, however, to ensure that the
directory's contents will also be able to fit into that cylinder group; if
there are not many inodes or blocks available in the group, the directory
will be placed in a different cylinder group which has more resources
available. The result of all this, hopefully, is much better locality for
files which are truly related to each other and likely to be accessed
As of this writing, only one benchmark
result with the new allocator has been posted. The results are
promising: the time required to traverse through a Linux kernel tree (a
dauntingly big thing, these days) was reduced by 30% or so. The Orlov
scheme needs more rigorous benchmarking; it also needs some serious stress
testing to demonstrate that performance does not degrade as the filesystem
is changed over time. But the initial results are encouraging. Linux has, once
again, benefitted from the ability to borrow good ideas from other free
Comments (1 posted)
One of the many changes rolled into the 2.5.45 kernel was the "hot-n-cold
pages" patch from Martin Bligh, Andrew Morton, and others. It's a
conceptually simple change that shows how far one has to go to deal with
the realities of modern system architecture.
One generally thinks of a system's RAM as being the fastest place to keep
data. But memory is slow; the real speed comes from working out of
the onboard cache in the processor itself. Much effort has, over the
years, gone into trying to optimize the kernel's cache behavior and
avoiding the need to go to main memory. The new page allocation system is
just another step in that direction.
The processor cache contains memory which has been accessed recently. The
kernel often has a good idea of which pages have seen recent accesses and
are thus likely to be present in cache. The hot-n-cold patch tries to take
advantage of that information by adding two per-CPU free page lists (for
each memory zone). When a processor frees a page that is suspected to be
"hot" (i.e. represented in that processor's cache), it gets pushed onto the
hot list; others go onto the cold list. The lists have high and low
limits; after all, if the hot list grows larger than the processor's cache,
the chances of those pages actually being hot start to get pretty small.
When the kernel needs a page of memory, the new allocator
normally tries to get that page from the processor's hot list. Even if the
page is simply going to be overwritten, it's still better to use a
cache-warm page. Interestingly, though, there are times when it makes
sense to use a cold page instead. If the page is to be used for DMA read
operations, it will be filled by the device performing the operation and
the cache will be invalidated anyway. So 2.5.45 includes a new
GPF_COLD page allocation flag for the situations where using a
cold page makes more sense.
The use of per-CPU page lists also cuts down on lock contention, which also
helps performance. When pages must be moved between the hot/cold lists and
the main memory allocator, they are transferred in multi-page chunks, which
also cuts down on lock contention and makes things go faster.
Andrew Morton has benchmarked this patch, and included a number of results
with one of the patchsets. Performance
benefits vary from a mere 1-2% on the all-important kernel compilation time
to 12% on the SDET test. That was enough, apparently, to convince Linus.
Comments (none posted)
The "initramfs" concept has been in the 2.5 plans since back before there
a 2.5 kernel. Things have been very quiet on the initramfs
front, however, until the first patch
up and was merged into the 2.5.46 tree.
The basic idea behind initramfs is that a cpio archive can be attached to
the kernel image itself. At boot time, the kernel unpacks that archive
into a RAM-based disk, which is then mounted and used at the initial root
filesystem. Much of the kernel initialization and bootstrap code can then
be moved into this disk and run in user mode. Tasks like finding the real
root disk, boot-time networking setup, handling of initrd-style ramdisks,
ACPI setup, etc. will be shifted out of the kernel in this way.
An obvious advantage of this scheme is that the size of the kernel code
itself can shrink. That does not free memory for a running system, since
the Linux kernel already dumps initialization code when it is no longer
needed. But a smaller code base for the kernel itself makes the whole
thing a little easier to maintain, and that is always a good thing. But
the real advantages of initramfs are:
- Customizing the early boot process becomes much easier. Anybody who
needs to change how the system boot can now do so with user-space
code; patching the kernel itself will no longer be required.
- Moving the initialization code into user space makes it easier to
write that code - it has a full C library, memory protection, etc.
- As pointed out by Alexander Viro:
user-space code is required to deal with the kernel via system calls.
This requirement will flush a lot of in-kernel "magic" currently used
by the initialization code; the result will be cleaner, safer code.
The patch, as found in 2.5.46, does not do a whole lot; it adds the basic
mechanism but only removes "three simple lines" from the current
initialization code. The bulk of the code will be added in the coming
weeks - now that the "feature" is in the kernel, the details can be filled
in without, technically, breaking the feature freeze. The plan for those
steps has been laid out by Jeff Garzik:
- A small C library ("klibc") will be merged to support initramfs
- A small "kinit" application will be created with klibc. In the
beginning, it will only do enough work to show that the mechanism is
- The "initrd" (initial ramdisk) subsystem will be moved into kinit,
and out of the kernel itself.
- The mounting of the root filesystem will be moved to user space. A
lot of code for dealing with things like NFS-mounted root filesystems
will go away.
That is as far as the plan goes, for now. There is no doubt that other
parts of the initialization process will be moved to user space, however;
it will be interesting to see how that process goes.
There a couple of fundamental open questions that will have to be answered
during the remaining 2.5 development period. One is whether the
initialization process should be handled by a single "kinit" application,
or whether it should be a collection of programs, and, probably, shell
scripts. Then, there is the question of what to do with klibc. It will be
packaged with the kernel for now, but a number of kernel developers think
that klibc (and the whole user-space initialization setup) should
eventually be split off into a separate project. These decisions might not
be made until very shortly before the stable release.
Comments (9 posted)
project is an
IBM-sponsored effort to provide volume management services for Linux. EVMS
had high hopes for inclusion in the 2.5 kernel, but, when it came down to
the wire, Linus opted to merge LVM2 instead. LVM2 lacks many of the
features and fancy GUI management tools found in EVMS, but the kernel
developers found the code to be much more to their liking. So EVMS got
left out in the cold.
Some developers, when their work is passed over for inclusion, complain at
length on the linux-kernel list. Others simply take their marbles and go
home. The EVMS project, instead, has decided to
take a different approach: they will drop their kernel driver and
rework their administration tools to work on top of LVM2 instead. The
result, with luck, should be the best of both worlds for EVMS users: they
get the well-respected management tools on top of the in-kernel LVM2 base.
This decision has been strongly applauded on the kernel list; the EVMS team
even got a rare note of respect from
Alexander Viro. It takes class to pick yourself up from a big
disappointment and move forward with a new, better plan. EVMS should have
a lot of support as it moves into the future.
Comments (1 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
- Rusty Russell: What's left over.. "<span>Here is the list of features which have are being actively
pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as
appropriate for Halloween.</span>"
(October 31, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>