Brief items
The current development kernel is 2.5.34,
released by Linus on September 9. People
who had trouble with 2.5.33 may want to give this one a try; it has some
important per-CPU fixes, and the floppy driver is said to
really
work this time. Also included is a bunch of block I/O work from Al Viro,
memory management work from Andrew Morton, a JFS update, and quite a few
other fixes and updates. The
long-format
changelog is available, as usual. Note that this kernel has a bug
which can cause IDE partitions to disappear.
Linus's BitKeeper tree, which may be 2.5.35 by the time you read this,
contains a large set of patches including a new sys_exit_group() system call (more
thread work by Ingo Molnar), a major merge of IDE code from the 2.4-ac tree
(which, according to Alan Cox, works "better
than expected," but one should still be careful), yet more VM changes via
Andrew Morton (see below), and a number of other fixes and updates.
The current 2.5 status summary from Guillaume
Boissiere came out on September 10.
The current stable kernel is 2.4.19. Marcelo released 2.4.20-pre6 on September 10; it
adds a number of updates and a couple of bugs which make it fail to compile
or boot for a number of users.
Alan Cox's current prepatch is 2.4.20-pre5-ac5, which is given over mostly to
new IDE code. "You can now load ide pci drivers at boot time or as
modules. Don't try unloading the modules yet"
Comments (none posted)
Kernel development news
People sending mail to Linus may want to cut back on their LINES OF
YELLING, keep an eye on vulgar words in their code comments, and so on. It seems
that Linus has started using SpamAssassin, and it is
causing him to lose a few patches due to false
positives. The number of false positives is small enough that he intends
to continue using the filter. And, in the end, most developers probably
agree that kernel development benefits if Linus spends less time wading
through spam.
Comments (2 posted)
Ever since Rik van Riel's reverse mapping VM implementation was merged into
the kernel, people have wondered how it could be made to work more
quickly. The rmap code accelerates many memory management operations, but slows down
others. It would be nice to get to the point where the performance
regressions have been mitigated (or eliminated) while keeping the benefits
of the rmap code. Linus's current BitKeeper tree contains one patch from
Andrew Morton which is a big step in that direction.
As described here last
January, the rmap code works by keeping track of which page tables
reference every physical page on the system. This is done by adding a
linked list of rmap entries to the page structure; each entry in
the list points to one page table entry referencing the page. The
maintenance of this list is the source of the bulk of the rmap code's
overhead. The many thousands of these pte_chain
structures require a lot of processing to keep current, are inefficient
(the structure contains two pointers; the one which points to the next
pte_chain entry is pure overhead), and put lots of pressure on the
memory allocation subsystem.
Andrew's solution to this problem is simply to expand the
pte_chain structures to hold multiple page table pointers.
Anywhere between seven and 31 PTE pointers can be stored in a single
pte_chain entry, depending on the architecture. The
chain overhead is reduced accordingly, and the system's cache behavior is
improved. This change, it is claimed, takes 10% off that all-important
kernel compile time - at least on Andrew's wimpy little 16-processor NUMA
system.
One other optimization, which has been in the kernel for a while, is to
eliminate the PTE chain entirely for pages which are only mapped into a
single process - of which there are many on a typical system. In that
case, a flag is set in the page structure, and the pointer for the
PTE chain points, instead, directly at the page table entry of interest.
The rmap code still has its performance costs, especially in the
fork system call. But those costs are shrinking - as are
inefficiencies throughout the kernel.
Comments (none posted)
Lest one think that tweaking rmap is all that is happening in the memory
management world: a great deal of code is currently circulating which makes
big changes, and it has been finding its way into Linus's kernel.
For example, 2.5.34 includes Patricia Gaughen's discontiguous memory patch,
which is aimed at the needs of large, NUMA systems. On such systems, you
no longer just have a simple array of memory; instead, the system's RAM is
broken up into zones, each of which is attached to a particular NUMA node.
Memory accesses within a node are faster than cross-node references, so the
kernel needs to know where any given page resides. Memory on these systems
can also have address holes between each node's zone.
The discontiguous memory patch does away with the classic mem_map
array, which contained one struct page structure for each
page on the system. The memory map is now split into separate, per-node
maps, and all references to mem_map in the kernel have
been changed. Rather than dealing with simple indexes into mem_map,
the kernel now works with page frame numbers; an old reference to
mem_map+i is now pfn_to_page(i). For the most part, code
which did not access mem_map directly will likely require no
changes in response to the discontiguous memory patches. But there will be
exceptions...
Andrew Morton's "-mm" patches have become the staging area for memory
management changes. The current patch as of this writing (2.5.34-mm1) contains a long list of other
changes, including:
- Directory indexes for the ext3 filesystem (by Daniel Phillips).
Calling this one "memory
management" is a bit of a stretch, of course, but it is a definite
performance improver when large directories are used.
- A patch by William Lee Irwin which lets the i386 architecture
maintain page tables in high memory.
- A change to the readv and writev system calls
(by Janet Morgan) which submits all segments for I/O in parallel; this
patch greatly speeds up direct disk I/O operations.
- Rohit Seth's large page patch for the i386 architecture (covered here
last month).
- A patch which allows copy_from_user and copy_to_user
to be called in atomic (non-blocking) situations. If the copy
operation encounters a page fault, it simply fails.
- ..and many other changes.
One interesting side result from work like the atomic copy_*_user
functions and the preemptible kernel is a formalization of just when the
kernel is performing an atomic operation. Code in the 2.4 (and prior)
kernel could check for certain situations where atomic operation was
required, such as when servicing an interrupt. In 2.5, other atomic
situations (i.e. holding a spinlock) are tracked, and it is easy for code
with a need to say "don't interrupt me or sleep now." The result should be
more explicit code and fewer bugs.
Comments (2 posted)
Paolo Ciarrocchi recently posted
an article
giving some benchmark results on his laptop; these results generally show
that 2.5.33 performs a little more slowly than the 2.4 kernels. Given that
much of the work in 2.5 has been oriented around performance, what is
happening here? Daniel Phillips
summarized
things as follows:
I suspect the overall performance loss on the laptop has more to do
with several months of focussing exclusively on the needs of 4-way
and higher smp machines.
The fear that large systems performance work would slow things down on the
hardware that most of us actually use has been present for years. Could it
be that the big iron is finally taking over the kernel?
The answer, for now, is probably "no." 2.5 development efforts have indeed
emphasized large systems performance so far. The small-systems performance
has not been impaired so much as simply passed over for now. As Andrew
Morton put it:
It's on the larger machines where 2.4 has problems. Fixing them up
makes the kernel broader, more general purpose. We're seeing
50-100% gains in some areas there. Giving away a few percent on
smaller machines at this stage is OK. But yup, we need to go and
get that back later
Small-systems tuning, of course, is work that can mostly happen after next
month's feature freeze. Expect some serious efforts in that direction -
small and embedded systems, after all, are a huge part of the Linux user
base. It wouldn't do to leave them out in the cold.
Comments (none posted)
Patches and updates
Kernel trees
Build system
Core kernel code
- Andrew Morton: readv/writev rework. "<span>This is Janet Morgan's patch which converts the readv/writev code
to submit all segments for IO before waiting on them, rather than
submitting each segment separately.</span>"
(September 11, 2002)
Development tools
Device drivers
- Jens Axboe: 2.5.34 IDE. "<span>I've updated 2.5 IDE code to match what is currently in 2.4.20-pre5-ac4,
since is much nicer and better structured.</span>"
(September 11, 2002)
Filesystems and block I/O
Memory management
- Ed Tomlinson: slabnow.
(September 10, 2002)
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>