User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.7-rc1, which was announced by Linus on May 22. The most significant changes are certainly the scheduling domains patch, and, surprisingly, the full set of object-based reverse mapping patches, including the anon_vma work. This patch also includes a generic msleep() function for millisecond-scale waits, a CPU frequency control update, a set of autofs4 patches, a set of patches to shrink the heavily-used dentry structure, the "filtered wakeup" mechanism (see the May 5 Kernel Page), a libata update, some architecture updates, the removal of the Intermezzo filesystem due to lack of use and support, a sysctl variable giving "huge page" access to a administrator-specified group, the ability to re-enable interrupts while waiting in spin_lock_irqsave() (for all architectures now), support in reiserfs for quotas and external attributes, the NUMA API, a big ramdisk fixup, and lots of fixes. See the long-format changlog for the details.

Linus's BitKeeper repository contains an implementation of separate interrupt stacks for the PPC64 architecture, an ALSA update, and a fair number of fixes.

The current tree from Andrew Morton is 2.6.6-mm5. Recent additions to -mm include a reworking of the symbolic link following code (allowing the eventual increase of the maximum symbolic link depth from five to eight), a new block I/O request barrier implementation (for IDE and SCSI), and the usual collection of fixes. Andrew has also quietly restored the 8KB stack option on x86 systems.

The current 2.4 prepatch is 2.4.27-pre3; no prepatches have been released since May 18.

Comments (none posted)

Kernel development news

The merging of anon_vma and 4G/4G

Immediately prior to releasing 2.6.7-rc1, Linus merged the full remaining set of virtual memory patches from Andrea Arcangeli and Hugh Dickins, including the anon_vma code. This action has raised eyebrows in some quarters; some developers had been under the impression that 2.6 was a stable kernel series. Nobody seems to doubt that the object-based reverse mapping code is a good idea in the long run, but merging it now strikes some developers as unlikely to increase the stability of the 2.6 kernel in the near future.

Linus defends the change in this way:

It's not "fundamental", in that the reverse mapping is still done. It's just done in a slightly different way. Going to rmap was a _fundamental_ change to how we did VM. In contrast, this was just an "implementation detail".

Most "implementation details" fit into rather less than 40 individual patches, do not involve difficult special cases (such as making all uses of mremap() work correctly), and avoid making significant changes to core parts of the virtual memory subsystem. That said, one should note that the core decision-making VM code has not been changed; the algorithm for choosing pages to move into and out of memory is the same as before. It is also notable that there have been almost no VM-related problem reports since 2.6.7-rc1 was released. This particular change may just work out in the short term after all.

A related topic is the 4G/4G patch, which separates kernel and user space entirely so that each can make full use of the 4G virtual address space on 32-bit systems. This patch has been considered for merging for some time, but has never quite found its way in. Most developers see it as an ugly hack (though, perhaps, a necessary one), and there is fear of the (possibly overstated) performance overhead that the 4G/4G mode imposes. Even so, some people wonder when this patch might be merged.

The answer seems to be "never, if at all possible." The motivations behind this patch are (1) to make more kernel-space low memory available on large-memory systems, and (2) to provide a larger virtual address space for applications. The first reason may well have just become moot; the anon_vma patch was merged because, among other things, it significantly reduces the amount of low memory used by the VM subsystem. The initial reports suggest that the current VM code handles 32GB of memory nicely on 32-bit systems. Since 32-bit systems rarely come more heavily loaded than that (so far), it is thought that the VM has gotten as good as it needs to be on those systems.

The real hope, however, is that a serious transition to 64-bit systems will happen before too long. The x86 architecture has been stretched much further than anybody would have expected it to go, and x86_64 makes the transition so easy that there is very little reason not to do it. The 4G/4G patch is likely to hang around (and be included by some distributors) for some time; if nothing else, all of the currently-deployed monster x86 systems are likely to go on running for a while yet. But the mainline kernel may just get away with saying "switch to 64-bit" and leaving that particular patch out.

Comments (5 posted)

The Big Kernel Lock lives on

It was recently noted that ioctl() system calls are still executed with the Big Kernel Lock (BKL) held. A suggestion was made that drivers which can implement ioctl() without the BKL held should be specially flagged as a way of increasing parallelism. That suggestion looks like it will not get very far. But it did pique your editor's interest in current use of the BKL. Besides, there hasn't been a whole lot else going on this week.

The BKL is an artifact from when the Linux kernel first supported multiprocessor systems. Making the kernel safe for concurrent access from multiple CPUs has been a multi-year task; it is not a job that could have been done all at once at the beginning. So Linux 2.0 supported SMP systems by way of the BKL, which only allowed one processor to be running kernel code at any given time. The BKL is essentially a spinlock, but with a couple of interesting properties:

  • The BKL can be taken recursively; the kernel remembers how many times a given thread has called lock_kernel() and does the right thing. Normal spinlocks are rather less forgiving.

  • Code holding the BKL can sleep. The lock is released while the given thread sleeps, and reacquired upon awakening.

The BKL made SMP Linux possible, but it didn't scale very well. Its overhead could be felt even with two processors, and it made running on anything larger problematic. So the kernel developers have been breaking the BKL into finer-grained locks ever since. Thus, for example, the block I/O subsystem went from the BKL to its own lock (io_request_lock) in 2.2, and from that to individual queue locks in 2.6. The kernel now has thousands of locks, and some people had assumed that the BKL would be gone by 2.6.

As it turns out, there are still over 500 lock_kernel() calls in the 2.6.6 kernel. For the curious, here are some of the places which still rely on this old, system-wide lock:

  • The core kernel retains a few calls. The implementation of the reboot() system call is one of them; this is, of course, not one of the more performance-sensitive parts of the kernel. The boot-time early initialization process is also run with the BKL held. The sysctl() system call is run under the BKL; interestingly, while much of /proc is also implemented under the BKL, it appears that reads and writes to /proc/sys do not run with the BKL held.

  • Many older filesystems (UFS, coda, HPFS, FAT, NCP, SMB, Minix, etc.) make heavy use of the BKL for serialization. The UnixWare "Boot File System" implementation has several calls; somehow, they seem unlikely to be fixed anytime soon. There are also lock_kernel() calls in NFS, UDF, isofs, the reiserfs journaling code, autofs, and some others. The ext2 filesystem uses the BKL to protect modifications to the superblock; ext3, instead, had all of its lock_kernel() calls purged during the 2.5 development process.

  • The rpciod kernel thread spends its entire life with the BKL held.

  • Core dumps are created with the BKL held.

  • Block and character devices have their open() methods called under the BKL. Block release() methods are also called this way, but that is not true for char drivers. The default llseek() method runs under the BKL, but, if a driver or filesystem provides its own llseek() method, that method will not be called with the BKL held. The fasync() method is always called under the BKL. As noted at the beginning, ioctl() methods are called with the lock held; additionally, the ugly code which does 32-bit emulation on 64-bit systems needs the BKL.

  • The file locking code still requires the BKL.

  • Almost 10% of the lock_kernel() calls can be found in the (old, deprecated) OSS sound code. The ALSA code has no BKL calls, with one exception: the implementation of its /proc files.

  • Most of the architectures retain some calls in the arch-specific code. The ptrace() system call is one common place for these calls. i386 also uses the BKL to protect llseek() calls on the CPUID and MSR pseudo-devices. uClinux performs execve() calls under the BKL.

  • Almost all of the remaining BKL calls are to be found in device drivers. The TTY subsystem still has quite a few of them, as does USB. Many of these calls are protecting llseek() implementations. Quite a few of the rest are for the creation of special-purpose kernel threads: the daemonize() function needs to be called with the BKL held. Those calls can, presumably, go away as the driver code is (slowly) migrated over to the new kthread calls.

Given how poorly the BKL is viewed, it may be surprising that so many places in the kernel still use it. The simple fact is that, with regard to the BKL, all of the low-hanging fruit has long since been taken. For most of the remaining calls, removing the BKL is not worth the trouble and code churn. So, while removal of the remaining calls over the 2.7 development series looks entirely possible, it would not be surprising if that does not happen.

Comments (10 posted)

The new Debian kernel team

Herbert Xu was the maintainer of a surprising number of core Debian packages, including the i386 and Alpha kernels. Unfortunately, Mr. Xu became upset over the Debian Project's perceived recognition of Taiwan as a separate country, and resigned from the project on May 5. Many of his packages have been picked up by others or have gone into the orphan state, but the kernel packages are important enough to require more careful handling.

The actual process of selecting the new kernel maintainer would appear to have been done in private; we were not able to get an answer from the Debian leader about just how it was done. The results have now been made public, however. The Debian kernel will now be maintained by a team, with William Lee Irwin and Al Viro at the core. Additional helpers include Troy Benjegerdes, Dann Frazier, Goto Masanori, Christoph Hellwig, Benjamin Herrenschmidt, Anton Blanchard, and Arjan van de Ven.

In other words, Debian will now have a set of kernel packages maintained by active kernel developers. This should help to improve the quality of Debian's kernels (though, it should be said, complaints about Mr. Xu's kernels were rare) and to improve the feedback from Debian into the kernel development process. Mr. Irwin's plans include "aggressive mainline tracking" and, eventually, a unified source package for all architectures supported by Debian. Expect some interesting things from the Debian kernel in the near future.

Comments (14 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds