|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.46, which was released by Linus on November 4. It includes uClinux (a port of the kernel to systems with no memory management unit), the "huge TLB" filesystem for working with large pages, more driver model work, the latest sys_epoll implementation, a big m68k update, the beginning of initramfs support (see below), an ARM update, extended attributes and some online resizing support for the ext2 and ext3 filesystems, and numerous other patches. The the long-format changelog has the details.

Linus has taken a break (from kernel development, anyway) since releasing 2.5.46; his BitKeeper tree is almost empty.

The current prepatch from Alan Cox is 2.5.45-ac1; it adds a number of fixes and backs out some "dangerous looking" SCSI driver changes.

The latest 2.5 status summary from Guillaume Boissiere is dated November 4.

Dave Jones has posted version 0.10 of his "post-Halloween" 2.5 kernel document.

The current stable kernel is 2.4.19; there have been no 2.4.20 prepatches released over the last week.

Comments (none posted)

Kernel development news

The Orlov block allocator

The performance of a file system is dependent on many things; one of the crucial factors is just how that filesystem lays out files on the disk. In general, it is best to keep related items together; a kernel compilation will go more quickly if the files within the kernel source tree all live close to each other on the disk. To achieve this goal, the ext2 and ext3 filesystems have long tried to lay out the contents of a directory in the same cylinder group (or, at least, in nearby groups).

In the real world, however, it turns out to be better, sometimes, to spread things out. Imagine setting up a system with users' home directories in /home. If all the first-level directories within /home (i.e. the home directories for numerous users) are placed next to each other, there may be no space left for the contents of those directores. User files thus end up being placed far from the directories that contain them, and performance suffers. The ext2 filesystem has suffered from this sort of performance degradation for some time.

The 2.5.46 kernel contains a new block allocator which attempts to address this problem. The new scheme, borrowed from BSD, is named the "Orlov allocator," after its creator Grigory Orlov; he has posted a brief description of the technique as it is used in the BSD kernels. The Linux implementation, as implemented by Alexander Viro, Andrew Morton, and Ted Ts'o, uses a similar technique but adds a few changes.

Essentially, the Orlov algorithm tries to spread out "top-level" directories, on the assumption that they are unrelated to each other. Directories created in the root directory of a filesystem are considered top-level directories; Ted has added a special inode flag that allows the system administrator to mark other directories as being top-level directories as well. If /home lives in the root filesystem (and people do set up systems that way), a simple chattr command will make the system treat it as a top-level directory.

When creating a directory which is not in a top-level directory, the Orlov algorithm tries, as before, to put it into the same cylinder group as its parent. A little more care is taken, however, to ensure that the directory's contents will also be able to fit into that cylinder group; if there are not many inodes or blocks available in the group, the directory will be placed in a different cylinder group which has more resources available. The result of all this, hopefully, is much better locality for files which are truly related to each other and likely to be accessed together.

As of this writing, only one benchmark result with the new allocator has been posted. The results are promising: the time required to traverse through a Linux kernel tree (a dauntingly big thing, these days) was reduced by 30% or so. The Orlov scheme needs more rigorous benchmarking; it also needs some serious stress testing to demonstrate that performance does not degrade as the filesystem is changed over time. But the initial results are encouraging. Linux has, once again, benefitted from the ability to borrow good ideas from other free kernels.

Comments (1 posted)

Hot and cold pages

One of the many changes rolled into the 2.5.45 kernel was the "hot-n-cold pages" patch from Martin Bligh, Andrew Morton, and others. It's a conceptually simple change that shows how far one has to go to deal with the realities of modern system architecture.

One generally thinks of a system's RAM as being the fastest place to keep data. But memory is slow; the real speed comes from working out of the onboard cache in the processor itself. Much effort has, over the years, gone into trying to optimize the kernel's cache behavior and avoiding the need to go to main memory. The new page allocation system is just another step in that direction.

The processor cache contains memory which has been accessed recently. The kernel often has a good idea of which pages have seen recent accesses and are thus likely to be present in cache. The hot-n-cold patch tries to take advantage of that information by adding two per-CPU free page lists (for each memory zone). When a processor frees a page that is suspected to be "hot" (i.e. represented in that processor's cache), it gets pushed onto the hot list; others go onto the cold list. The lists have high and low limits; after all, if the hot list grows larger than the processor's cache, the chances of those pages actually being hot start to get pretty small.

When the kernel needs a page of memory, the new allocator normally tries to get that page from the processor's hot list. Even if the page is simply going to be overwritten, it's still better to use a cache-warm page. Interestingly, though, there are times when it makes sense to use a cold page instead. If the page is to be used for DMA read operations, it will be filled by the device performing the operation and the cache will be invalidated anyway. So 2.5.45 includes a new GPF_COLD page allocation flag for the situations where using a cold page makes more sense.

The use of per-CPU page lists also cuts down on lock contention, which also helps performance. When pages must be moved between the hot/cold lists and the main memory allocator, they are transferred in multi-page chunks, which also cuts down on lock contention and makes things go faster.

Andrew Morton has benchmarked this patch, and included a number of results with one of the patchsets. Performance benefits vary from a mere 1-2% on the all-important kernel compilation time to 12% on the SDET test. That was enough, apparently, to convince Linus.

Comments (none posted)

Initramfs arrives

The "initramfs" concept has been in the 2.5 plans since back before there was a 2.5 kernel. Things have been very quiet on the initramfs front, however, until the first patch showed up and was merged into the 2.5.46 tree.

The basic idea behind initramfs is that a cpio archive can be attached to the kernel image itself. At boot time, the kernel unpacks that archive into a RAM-based disk, which is then mounted and used at the initial root filesystem. Much of the kernel initialization and bootstrap code can then be moved into this disk and run in user mode. Tasks like finding the real root disk, boot-time networking setup, handling of initrd-style ramdisks, ACPI setup, etc. will be shifted out of the kernel in this way.

An obvious advantage of this scheme is that the size of the kernel code itself can shrink. That does not free memory for a running system, since the Linux kernel already dumps initialization code when it is no longer needed. But a smaller code base for the kernel itself makes the whole thing a little easier to maintain, and that is always a good thing. But the real advantages of initramfs are:

  • Customizing the early boot process becomes much easier. Anybody who needs to change how the system boot can now do so with user-space code; patching the kernel itself will no longer be required.

  • Moving the initialization code into user space makes it easier to write that code - it has a full C library, memory protection, etc.

  • As pointed out by Alexander Viro: user-space code is required to deal with the kernel via system calls. This requirement will flush a lot of in-kernel "magic" currently used by the initialization code; the result will be cleaner, safer code.

The patch, as found in 2.5.46, does not do a whole lot; it adds the basic mechanism but only removes "three simple lines" from the current initialization code. The bulk of the code will be added in the coming weeks - now that the "feature" is in the kernel, the details can be filled in without, technically, breaking the feature freeze. The plan for those steps has been laid out by Jeff Garzik:

  • A small C library ("klibc") will be merged to support initramfs applications.

  • A small "kinit" application will be created with klibc. In the beginning, it will only do enough work to show that the mechanism is functioning properly.

  • The "initrd" (initial ramdisk) subsystem will be moved into kinit, and out of the kernel itself.

  • The mounting of the root filesystem will be moved to user space. A lot of code for dealing with things like NFS-mounted root filesystems will go away.

That is as far as the plan goes, for now. There is no doubt that other parts of the initialization process will be moved to user space, however; it will be interesting to see how that process goes.

There a couple of fundamental open questions that will have to be answered during the remaining 2.5 development period. One is whether the initialization process should be handled by a single "kinit" application, or whether it should be a collection of programs, and, probably, shell scripts. Then, there is the question of what to do with klibc. It will be packaged with the kernel for now, but a number of kernel developers think that klibc (and the whole user-space initialization setup) should eventually be split off into a separate project. These decisions might not be made until very shortly before the stable release.

Comments (9 posted)

EVMS changes direction

The EVMS project is an IBM-sponsored effort to provide volume management services for Linux. EVMS had high hopes for inclusion in the 2.5 kernel, but, when it came down to the wire, Linus opted to merge LVM2 instead. LVM2 lacks many of the features and fancy GUI management tools found in EVMS, but the kernel developers found the code to be much more to their liking. So EVMS got left out in the cold.

Some developers, when their work is passed over for inclusion, complain at length on the linux-kernel list. Others simply take their marbles and go home. The EVMS project, instead, has decided to take a different approach: they will drop their kernel driver and rework their administration tools to work on top of LVM2 instead. The result, with luck, should be the best of both worlds for EVMS users: they get the well-respected management tools on top of the in-kernel LVM2 base.

This decision has been strongly applauded on the kernel list; the EVMS team even got a rare note of respect from Alexander Viro. It takes class to pick yourself up from a big disappointment and move forward with a new, better plan. EVMS should have a lot of support as it moves into the future.

Comments (1 posted)

Patches and updates

Kernel trees

Marc-Christian Petersen [PATCH] Linux-2.5.45-mcp1 ?
Marc-Christian Petersen [PATCH] Linux-2.5.45-mcp2 ?
Stephen Hemminger linux-2.5.46-dcl1 ?
Andrea Arcangeli 2.4.20rc1aa1 ?

Architecture-specific

Build system

Core kernel code

Matthew Dobson node_online_map 2.5.45 (5/5) ?
Eric W. Biederman kexec for 2.5.45 ?
Eric W. Biederman kexec for 2.5.46 ?
Rusty Russell More modules fun! ?
Michael Hohnbaum NUMA Scheduler (1/2) ?
Michael Hohnbaum NUMA Scheduler (2/2) ?
Davide Libenzi total-epoll r2 ... ?
Davide Libenzi total-epoll r3 ... ?
Manfred Spraul slab ctor prototype change ?

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous

Rusty Russell What's left over. "<q>Here is the list of features which have are being actively pushed, not NAK'ed, and are not in 2.5.45. There are 13 of them, as appropriate for Halloween.</q>" ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds