|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 release remains 2.6.6; no 2.6.7 prepatches have been released as of this writing.

Linus's BitKeeper repository contains over 650 changesets, however, indicating that work is proceeding even in the absence of formal releases. These patches include a generic msleep() function for millisecond-scale waits, a CPU frequency control update, a set of autofs4 patches, del_singleshot_timer() (covered here last week), a set of patches to shrink the heavily-used dentry structure, the "filtered wakeup" mechanism (see the May 5 Kernel Page), a libata update, some architecture updates, the scheduling domains patch set (covered here last month), the removal of the Intermezzo filesystem due to lack of use and support (see below), a sysctl variable giving "huge page" access to a administrator-specified group), the ability to re-enable interrupts while waiting in spin_lock_irqsave() (for all architectures now), support in reiserfs for quotas and external attributes (added over Hans Reiser's objections), and lots of fixes.

The current kernel prepatch from Andrew Morton is 2.6.6-mm4. Recent additions to -mm include the anon-vma reverse mapping code (see below), a fix for the "phenomenally broken" ramdisk driver, the reservation of a system call number for the "kexec" functionality, and lots of fixes.

The current 2.4 prepatch is 2.4.27-pre3, which was released by Marcelo on May 18. Changes this time around include a JFS update, some driver updates, a big serial ATA update, and a number of fixes.

Comments (none posted)

Kernel development news

The status of object-based reverse mapping

The discussion has been quiet in recent times, but work on replacing the low-level reverse-mapping virtual memory code in the 2.6 kernel continues. When we last looked at the new, object-based reverse mapping ("objrmap") approach, there were two competing implementations:

  • Andrea Arcangeli's anon-vma, which adds a data structure creating a connection between each physical page and the virtual memory area (VMA) structures which reference it.

  • Hugh Dickins's anonmm, which associates pages with the top-level memory management ("mm") structure instead.

The two approaches are conceptually similar, but each has its strong and weak points. Their performance is essentially equivalent. Thus far, there has not been any sort of spirited debate over which should be included; most kernel developers, if they have a preference, have kept it to themselves.

Hugh has been busy over the last few weeks, however, creating a series of 40 patches aimed at slowly moving the reverse mapping code over to the object-based approach. The first five of those patches, which are restricted to cleanup and preparatory work, have been merged into the 2.6 mainline. "rmap-10" added anonmm; it was promptly merged into the -mm tree. This action did not imply that anonmm had been chosen over anon-vma, however; it was simply the first step in the testing process which would lead to a final decision.

Hugh's final series of patches (rmap-34 to rmap-40) completes the process by replacing anonmm with anon-vma; these patches are present in 2.6.6-mm4. Hugh introduces the patch set by saying:

Judge for yourselves which you prefer. I do think I was wrong to call anon_vma more complex than anonmm (its lists are easier to understand than my refcounting), and I'm happy with its vma merging after the last patch. It just comes down to whether we can spare the extra 24 bytes (maximum, on 32-bit) per vma for its advantages in swapout and mremap.

As Hugh notes, anon-vma should have better swapping performance, since its structures make it easier to find the VMA for a given page. Additionally, the anonmm code works best when shared anonymous pages have the same virtual address in each address space that uses them; if a process moves pages with mremap(), some relatively complicated work must be performed to make things work. The anon-vma solution does not have that particular problem.

On the other hand, expanding the VMA structure is not something which should be done lightly; some loads can use huge numbers of VMAs, and they must all be located in low memory. That said, either reverse mapping scheme should free far more low memory than it consumes; that is, after all, one of the main points behind this entire exercise.

There still has been no public word on which scheme will be chosen, or when it might be merged. The current state of affairs suggests, however, that anon-vma will be the one that goes in unless some sort of major problem turns up. As for timing: enough major work has already gone into 2.6.7 that it's hard to imagine throwing major VM surgery into the mix. So 2.6.8 is the earliest such a merge could possibly happen. A couple of 2.6 releases after that, the forking of the 2.7 tree might just become a possibility.

Comments (4 posted)

4K stacks: some issues remain

Last week's Kernel Page talked about the push toward 4K stacks on the i386 architecture. While most of the problems with the smaller stack size have been worked out, a few remain. Witness, for example, this problem report; it would appear that the 2.6.6 Radeon framebuffer driver is overflowing the 4K stack.

The problem was quickly narrowed down to a couple of new fields added to the radeon_regs structure:

struct radeon_regs {
        ....
        u32             palette[256];
        u32             palette2[256];
};

If one of these structures is placed on the kernel stack (as happens in the radeonfb driver), those two arrays, by themselves, take half of the available space. If that weren't sufficiently annoying, there is the little fact that those arrays are part of an ongoing development and are not actually used for anything in 2.6.6.

Fixing this particular problem is relatively easy, but this episode has reawakened interest in finding large stack users automatically. One never knows when a developer will expand a data structure without realizing that it is used on the stack in some other place; rather than letting users find this sort of mistake the hard way, it would be better to look for them explicitly earlier in the development process. To that end, several scripts have been posted which seek out large stack users in a compiled Linux kernel. A quick look at these scripts makes it clear that kernel code is, by no means, the scariest code out there:

objdump --disassemble "$@" | \
sed -ne '/>:/{s/[<>:]*//g; h; }
 /subl\?.*\$0x[^,][^,][^,].*,%esp/{
 s/.*\$0x\([^,]*\).*/\1/; /^[89a-f].......$/d; G; s/\(.*\)\n.* \(.*\)/\1 \2/; p; };
 /subl\?.*%.*,%esp/{ G; s/\(.*\)\n\(.*\)/Dynamic \2 \1/; p; }; ' | \
 sort | \
perl -e 'while (<>) { if (/^([0-9a-f]+)(.*)/) { $decn = hex("0x" . $1);\
     if ($decn > 400) { print "$decn $2\n";} } }'

(from a script by Keith Owens and Arjan van de Ven). Several variants have been posted, most of which are trying to support multiple architectures. None yet have solved the full problem, however: finding full call chains whose cumulative stack usage exceeds the space available. With or without that feature, some sort of stack usage checker is likely to be merged into the kernel build system before too long. That should help the developers to trap the most obvious problems before they find their way into a released kernel.

Comments (4 posted)

Module parameters in sysfs

In the 2.6 kernel, parameters to loadable modules are set up with the module_param() macro:

    module_param(name, type, perm);

The perm parameter was set aside for the sysfs representation of this parameter but has, until now, been unused; almost every declared parameter simply sets it to zero in the 2.6.6 kernel. A new patch has been posted, however, which makes module parameters in sysfs a reality.

This patch creates a new /sys/module directory; a subdirectory will be created for each module loaded into the system. For unloadable modules, a read-only parameter (called refcnt) will be set up which contains the module's current reference count. There will also be attributes for every module parameter whose perm value is not zero; that value will, as expected, set the permissions mask for that parameter.

If the permissions mask allows, module parameters will be writable. In theory, this will give module authors an easy way to export administrator-tweakable knobs to user space. It is worth noting, however, that there is no mechanism for notifying a module that one of its parameters has been changed. Module authors, thus, will have to be careful to ensure that their modules will properly detect and respond to changes to parameters at any time before exporting those parameters in a writable mode. Even so, this patch represents the tying-up of yet another 2.6 loose end.

Comments (none posted)

Goodbye to old code

One of the most important tasks in kernel maintenance is not the addition of new code, but removal of old code that is no longer useful. Unused code bloats the kernel and, potentially, becomes a breeding ground for bugs and security problems. Getting that code out of the way helps keep the kernel cruft level down.

In recent times, the ax has fallen on two subsystems. The first is the InterMezzo filesystem, which has been removed for 2.6.7. InterMezzo is a distributed filesystem from Peter Braam and company with a number of interesting ideas, but, apparently, few users. Maintenance has been lacking, and Mr. Braam finally agreed that it should be removed, noting "In the past 4 years nobody has supported InterMezzo sufficiently for it to become successful." The Lustre filesystem, which is Mr. Braam's current project, appears to be headed for greater success.

A patch has been posted which removes support for the PC9800 architecture. There have been a few small objections to this removal, drawing this response from Alexander Viro:

So are you volunteering to maintain the port? Maintainers are MIA; the damn thing doesn't compile; all patches it gets are basically blind ones ("we have that API change, this ought to take care of those drivers and let's hope that possible mistakes will be caught by testers"). Considering the lack of testers (kinda hard to test something that refuses to build), the above actually spells in one word: "bitrot".

There has been a rather conspicuous shortage of people stepping up to maintain the PC9800 port, so chances are that it will be going away soon.

Comments (4 posted)

Patches and updates

Kernel trees

Andrew Morton 2.6.6-mm2 ?
Andrew Morton 2.6.6-mm3 ?
Andrew Morton 2.6.6-mm4 ?
Marcelo Tosatti Linux 2.4.27-pre3 ?

Architecture-specific

Fruhwirth Clemens AES i586 optimized ?
Mikael Pettersson perfctr-2.7.2 for 2.6.6-mm2: i386 ?

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Randy.Dunlap kill off PC9800 ?

Memory management

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds