Brief items
The current 2.6 prepatch is 2.6.7-rc1, which was
announced by Linus on May 22. The most
significant changes are certainly the
scheduling
domains patch, and, surprisingly, the full set of object-based reverse
mapping patches, including the
anon_vma
work. This patch also includes a generic
msleep() function for
millisecond-scale waits, a CPU frequency control update, a set of autofs4
patches, a set of patches to shrink the heavily-used
dentry
structure, the "filtered wakeup" mechanism (see
the May 5 Kernel Page), a libata update,
some architecture updates, the removal of the Intermezzo filesystem due to
lack of use and support, a sysctl variable giving "huge page" access to a
administrator-specified group, the ability to re-enable interrupts while
waiting in
spin_lock_irqsave() (for all architectures now),
support in reiserfs for quotas and external attributes, the NUMA API, a big
ramdisk fixup, and lots of fixes. See
the
long-format changlog for the details.
Linus's BitKeeper repository contains an implementation of separate
interrupt stacks for the PPC64 architecture, an ALSA update, and a fair
number of fixes.
The current tree from Andrew Morton is 2.6.6-mm5. Recent additions to -mm include a
reworking of the symbolic link following code (allowing the eventual
increase of the maximum symbolic link depth from five to eight), a new
block I/O request
barrier implementation (for IDE and SCSI), and the usual collection of
fixes. Andrew has also quietly restored the 8KB stack option on x86 systems.
The current 2.4 prepatch is 2.4.27-pre3; no prepatches have been
released since May 18.
Comments (none posted)
Kernel development news
Immediately prior to releasing 2.6.7-rc1, Linus merged the full remaining
set of virtual memory patches from Andrea Arcangeli and Hugh Dickins,
including the anon_vma code. This action has raised eyebrows in some
quarters; some developers had been under the impression that 2.6 was a
stable kernel series. Nobody seems to doubt that the object-based reverse
mapping code is a good idea in the long run, but merging it now strikes
some developers as
unlikely to increase the stability of the 2.6 kernel in the near future.
Linus defends the change in this way:
It's not "fundamental", in that the reverse mapping is still
done. It's just done in a slightly different way. Going to rmap
was a _fundamental_ change to how we did VM. In contrast, this was
just an "implementation detail".
Most "implementation details" fit into rather less than 40 individual
patches, do not involve difficult special cases (such as making all uses of
mremap() work correctly), and avoid making significant changes to
core parts of the virtual memory subsystem. That said, one should note
that the core decision-making VM code has not been changed; the
algorithm for choosing pages to move into and out of memory is the same as
before. It is also notable that there have been almost no VM-related
problem reports since 2.6.7-rc1 was released. This particular change may
just work out in the short term after all.
A related topic is the 4G/4G patch, which separates kernel and user space
entirely so that each can make full use of the 4G virtual address space on
32-bit systems. This patch has been considered for merging for some time,
but has never quite found its way in. Most developers see it as an ugly
hack (though, perhaps, a necessary one), and there is fear of the
(possibly overstated) performance overhead that the 4G/4G mode imposes.
Even so, some people wonder when this patch might be merged.
The answer seems to be "never, if at all possible." The motivations behind
this patch are (1) to make more kernel-space low memory available on
large-memory systems, and (2) to provide a larger virtual address
space for applications. The first reason may well have just become moot;
the anon_vma patch was merged because, among other things, it significantly
reduces the amount of low memory used by the VM subsystem. The initial reports suggest that the current VM code
handles 32GB of memory nicely on 32-bit systems. Since 32-bit systems
rarely come more heavily loaded than that (so far), it is thought that the VM has
gotten as good as it needs to be on those systems.
The real hope, however, is that a serious transition to 64-bit systems will
happen before too long. The x86 architecture has been stretched much
further than anybody would have expected it to go, and x86_64 makes the
transition so easy that there is very little reason not to do it. The
4G/4G patch is likely to hang around (and be included by some distributors)
for some time; if nothing else, all of the currently-deployed monster x86
systems are likely to go on running for a while yet. But the mainline
kernel may just get away with saying "switch to 64-bit" and leaving that
particular patch out.
Comments (5 posted)
It was recently
noted that
ioctl() system calls are still executed with the Big Kernel Lock
(BKL) held. A suggestion was made that drivers which can implement
ioctl() without the BKL held should be specially flagged as a way
of increasing parallelism. That suggestion looks like it will not get very
far. But it did pique your editor's interest in current use of the BKL.
Besides, there hasn't been a whole lot else going on this week.
The BKL is an artifact from when the Linux kernel first supported
multiprocessor systems. Making the kernel safe for concurrent access from
multiple CPUs has been a multi-year task; it is not a job that
could have been done all at once at the beginning. So Linux 2.0 supported
SMP systems by way of the BKL, which only allowed one processor to be
running kernel code at any given time. The BKL is essentially a spinlock,
but with a couple of interesting properties:
- The BKL can be taken recursively; the kernel remembers how many times
a given thread has called lock_kernel() and does the right
thing. Normal spinlocks are rather less forgiving.
- Code holding the BKL can sleep. The lock is released while the given
thread sleeps, and reacquired upon awakening.
The BKL made SMP Linux possible, but it didn't scale very well. Its
overhead could be felt even with two processors, and it made running on
anything larger problematic. So the kernel developers have been breaking
the BKL into finer-grained locks ever since. Thus, for example, the block
I/O subsystem went from the BKL to its own lock (io_request_lock)
in 2.2, and from that to individual queue locks in 2.6. The kernel now has
thousands of locks, and some people had assumed that the BKL would be gone
by 2.6.
As it turns out, there are still over 500 lock_kernel() calls in
the 2.6.6 kernel. For the curious, here are some of the places which still
rely on this old, system-wide lock:
- The core kernel retains a few calls. The implementation of the
reboot() system call is one of them; this is, of course, not
one of the more performance-sensitive parts of the kernel. The
boot-time early initialization process is also run with the BKL held. The
sysctl() system call is run under the BKL;
interestingly, while much of /proc is also implemented under
the BKL, it appears that reads and writes to /proc/sys do not
run with the BKL held.
- Many older filesystems (UFS, coda, HPFS, FAT, NCP, SMB, Minix, etc.)
make heavy use of the BKL for serialization. The UnixWare "Boot File
System" implementation has several calls; somehow, they seem unlikely
to be fixed anytime soon. There are also lock_kernel() calls
in NFS, UDF, isofs, the reiserfs journaling code, autofs, and some others.
The ext2 filesystem uses the BKL to protect modifications to the
superblock; ext3, instead, had all of its lock_kernel() calls
purged during the 2.5 development process.
- The rpciod kernel thread spends its entire life with the BKL
held.
- Core dumps are created with the BKL held.
- Block and character devices have their open() methods called
under the
BKL. Block release() methods are also called this way, but
that is not true for char drivers.
The default llseek() method runs under the BKL, but, if
a driver or filesystem provides its own llseek() method, that
method will not be called with the BKL held. The fasync()
method is always called under the BKL. As noted at the beginning,
ioctl() methods are called with the lock held; additionally,
the ugly code which does 32-bit emulation on 64-bit systems needs
the BKL.
- The file locking code still requires the BKL.
- Almost 10% of the lock_kernel() calls can be found in the
(old, deprecated) OSS sound code. The ALSA code has no BKL calls,
with one exception: the implementation of its /proc files.
- Most of the architectures retain some calls in the arch-specific
code. The ptrace() system call is one common place for these
calls. i386 also uses the BKL to protect llseek() calls on
the CPUID and MSR pseudo-devices. uClinux performs execve()
calls under the BKL.
- Almost all of the remaining BKL calls are to be found in device
drivers. The TTY subsystem still has quite a few of them, as does
USB. Many of these calls are protecting llseek()
implementations. Quite a few of the rest are for the creation of
special-purpose kernel threads: the daemonize() function
needs to be called with the BKL held. Those calls can, presumably, go
away as the driver code is (slowly) migrated over to the new kthread
calls.
Given how poorly the BKL is viewed, it may be surprising that so many
places in the kernel still use it. The simple fact is that, with regard to
the BKL, all of the low-hanging fruit has long since been taken. For most
of the remaining calls, removing the BKL is not worth the trouble and code
churn. So, while removal of the remaining calls over the 2.7 development
series looks entirely possible, it would not be surprising if that does not
happen.
Comments (10 posted)
Herbert Xu was the maintainer of a surprising number of core Debian
packages, including the i386 and Alpha kernels. Unfortunately, Mr. Xu
became upset over the Debian Project's perceived recognition of Taiwan as a
separate country, and
resigned from the
project on May 5. Many of his packages have been picked up by
others or have gone into the orphan state, but the kernel packages are
important enough to require more careful handling.
The actual process of selecting the new kernel maintainer would appear to
have been done in private; we were not able to get an answer from the
Debian leader about just how it was done. The results have now been made public, however. The Debian kernel
will now be maintained by a team, with William Lee Irwin and
Al Viro at the core. Additional helpers include Troy Benjegerdes, Dann
Frazier, Goto Masanori, Christoph Hellwig, Benjamin Herrenschmidt, Anton
Blanchard, and Arjan van de Ven.
In other words, Debian will now have a set of kernel packages maintained by
active kernel developers. This should help to improve the quality of
Debian's kernels (though, it should be said, complaints about Mr. Xu's
kernels were rare) and to improve the feedback from Debian into the kernel
development process. Mr. Irwin's plans include "aggressive mainline
tracking" and, eventually, a unified source package for all architectures
supported by Debian. Expect some interesting things from the Debian kernel
in the near future.
Comments (14 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>