Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.34, released by Linus on September 9. People who had trouble with 2.5.33 may want to give this one a try; it has some important per-CPU fixes, and the floppy driver is said to really work this time. Also included is a bunch of block I/O work from Al Viro, memory management work from Andrew Morton, a JFS update, and quite a few other fixes and updates. The long-format changelog is available, as usual. Note that this kernel has a bug which can cause IDE partitions to disappear.

Linus's BitKeeper tree, which may be 2.5.35 by the time you read this, contains a large set of patches including a new sys_exit_group() system call (more thread work by Ingo Molnar), a major merge of IDE code from the 2.4-ac tree (which, according to Alan Cox, works "better than expected," but one should still be careful), yet more VM changes via Andrew Morton (see below), and a number of other fixes and updates.

The current 2.5 status summary from Guillaume Boissiere came out on September 10.

The current stable kernel is 2.4.19. Marcelo released 2.4.20-pre6 on September 10; it adds a number of updates and a couple of bugs which make it fail to compile or boot for a number of users.

Alan Cox's current prepatch is 2.4.20-pre5-ac5, which is given over mostly to new IDE code. "You can now load ide pci drivers at boot time or as modules. Don't try unloading the modules yet"

Comments (none posted)

Linus gets spam filtering

People sending mail to Linus may want to cut back on their LINES OF YELLING, keep an eye on vulgar words in their code comments, and so on. It seems that Linus has started using SpamAssassin, and it is causing him to lose a few patches due to false positives. The number of false positives is small enough that he intends to continue using the filter. And, in the end, most developers probably agree that kernel development benefits if Linus spends less time wading through spam.

Comments (2 posted)

Speeding up reverse mapping

Ever since Rik van Riel's reverse mapping VM implementation was merged into the kernel, people have wondered how it could be made to work more quickly. The rmap code accelerates many memory management operations, but slows down others. It would be nice to get to the point where the performance regressions have been mitigated (or eliminated) while keeping the benefits of the rmap code. Linus's current BitKeeper tree contains one patch from Andrew Morton which is a big step in that direction.

As described here last January, the rmap code works by keeping track of which page tables reference every physical page on the system. This is done by adding a linked list of rmap entries to the page structure; each entry in the list points to one page table entry referencing the page. The maintenance of this list is the source of the bulk of the rmap code's overhead. The many thousands of these pte_chain structures require a lot of processing to keep current, are inefficient (the structure contains two pointers; the one which points to the next pte_chain entry is pure overhead), and put lots of pressure on the memory allocation subsystem.

Andrew's solution to this problem is simply to expand the pte_chain structures to hold multiple page table pointers. Anywhere between seven and 31 PTE pointers can be stored in a single pte_chain entry, depending on the architecture. The chain overhead is reduced accordingly, and the system's cache behavior is improved. This change, it is claimed, takes 10% off that all-important kernel compile time - at least on Andrew's wimpy little 16-processor NUMA system.

One other optimization, which has been in the kernel for a while, is to eliminate the PTE chain entirely for pages which are only mapped into a single process - of which there are many on a typical system. In that case, a flag is set in the page structure, and the pointer for the PTE chain points, instead, directly at the page table entry of interest.

The rmap code still has its performance costs, especially in the fork system call. But those costs are shrinking - as are inefficiencies throughout the kernel.

Comments (none posted)

Other memory management work

Lest one think that tweaking rmap is all that is happening in the memory management world: a great deal of code is currently circulating which makes big changes, and it has been finding its way into Linus's kernel.

For example, 2.5.34 includes Patricia Gaughen's discontiguous memory patch, which is aimed at the needs of large, NUMA systems. On such systems, you no longer just have a simple array of memory; instead, the system's RAM is broken up into zones, each of which is attached to a particular NUMA node. Memory accesses within a node are faster than cross-node references, so the kernel needs to know where any given page resides. Memory on these systems can also have address holes between each node's zone.

The discontiguous memory patch does away with the classic mem_map array, which contained one struct page structure for each page on the system. The memory map is now split into separate, per-node maps, and all references to mem_map in the kernel have been changed. Rather than dealing with simple indexes into mem_map, the kernel now works with page frame numbers; an old reference to mem_map+i is now pfn_to_page(i). For the most part, code which did not access mem_map directly will likely require no changes in response to the discontiguous memory patches. But there will be exceptions...

Andrew Morton's "-mm" patches have become the staging area for memory management changes. The current patch as of this writing (2.5.34-mm1) contains a long list of other changes, including:

Directory indexes for the ext3 filesystem (by Daniel Phillips). Calling this one "memory management" is a bit of a stretch, of course, but it is a definite performance improver when large directories are used.
A patch by William Lee Irwin which lets the i386 architecture maintain page tables in high memory.
A change to the readv and writev system calls (by Janet Morgan) which submits all segments for I/O in parallel; this patch greatly speeds up direct disk I/O operations.
Rohit Seth's large page patch for the i386 architecture (covered here last month).
A patch which allows copy_from_user and copy_to_user to be called in atomic (non-blocking) situations. If the copy operation encounters a page fault, it simply fails.
..and many other changes.

One interesting side result from work like the atomic copy_*_user functions and the preemptible kernel is a formalization of just when the kernel is performing an atomic operation. Code in the 2.4 (and prior) kernel could check for certain situations where atomic operation was required, such as when servicing an interrupt. In 2.5, other atomic situations (i.e. holding a spinlock) are tracked, and it is easy for code with a need to say "don't interrupt me or sleep now." The result should be more explicit code and fewer bugs.

Comments (2 posted)

Too much attention on large systems?

Paolo Ciarrocchi recently posted an article giving some benchmark results on his laptop; these results generally show that 2.5.33 performs a little more slowly than the 2.4 kernels. Given that much of the work in 2.5 has been oriented around performance, what is happening here? Daniel Phillips summarized things as follows:

I suspect the overall performance loss on the laptop has more to do with several months of focussing exclusively on the needs of 4-way and higher smp machines.

The fear that large systems performance work would slow things down on the hardware that most of us actually use has been present for years. Could it be that the big iron is finally taking over the kernel?

The answer, for now, is probably "no." 2.5 development efforts have indeed emphasized large systems performance so far. The small-systems performance has not been impaired so much as simply passed over for now. As Andrew Morton put it:

It's on the larger machines where 2.4 has problems. Fixing them up makes the kernel broader, more general purpose. We're seeing 50-100% gains in some areas there. Giving away a few percent on smaller machines at this stage is OK. But yup, we need to go and get that back later

Small-systems tuning, of course, is work that can mostly happen after next month's feature freeze. Expect some serious efforts in that direction - small and embedded systems, after all, are a huge part of the Linux user base. It wouldn't do to leave them out in the cold.

Comments (none posted)

Marc-Christian Petersen [PATCH] Linux-2.5.34-mcp1 ?

Andrea Arcangeli 2.4.20pre5aa1 ?

Andrea Arcangeli 2.4.20pre5aa2 ?

Jeff Dike UML 2.5.33 ?

john stultz linux-2.5.34_timer-change_A2 ?

Greg Ungerer : linux-2.5.34uc0 (MMU-less support) ?

Roman Zippel linux kernel conf 0.4 ?

Roman Zippel linux kernel conf 0.5 ?

Sam Ravnborg kbuild: Make filelist for clean and mrproper distributed 0/6 ?

Sam Ravnborg kbuild: clean and mrproper file list created dynamically 1/6 ?

Sam Ravnborg atm: List files to be deleted during clean and mrproper 2/6 ?

Sam Ravnborg sound/oss: Files to be deleted during mrproper 3/6 ?

Sam Ravnborg drivers/pci,hamradio,scsi,aic7xxx,video,zorro clean and mrproper files 4/6 ?

Sam Ravnborg scripts: Removed mrproper targets, they are now handled automagically 5/6 ?

Sam Ravnborg Distibuted clean and mrproper handling 6/6 ?

Andrew Morton atomic copy_*_user() infrastructure ?

Robert Love 2.4: updated preemptable kernel ?

Rusty Russell 1/2 Initialization Ordering For Built-in Modules ?

Rusty Russell 2/2 Initialization Ordering For Built-in Modules ?

Andrew Morton readv/writev rework "<q>This is Janet Morgan's patch which converts the readv/writev code to submit all segments for IO before waiting on them, rather than submitting each segment separately.</q>" ?

Dipankar Sarma Read-Copy Update 2.5.34 ?

Ingo Molnar sys_exit_group(), threading, 2.5.34 ?

Karim Yaghmour 1/8 LTT for 2.5.33: Core infrastructure ?

Karim Yaghmour 2/8 LTT for 2.5.33: Trace driver ?

Karim Yaghmour 3/8 LTT for 2.5.33: Core trace statements ?

Karim Yaghmour 4/8 LTT for 2.5.33: i386 trace support ?

Karim Yaghmour 5/8 LTT for 2.5.33: PowerPC trace support ?

Karim Yaghmour 6/8 LTT for 2.5.33: S/390 trace support ?

Karim Yaghmour 7/8 LTT for 2.5.33: SuperH trace support ?

Karim Yaghmour 8/8 LTT for 2.5.33: MIPS trace support ?

Rik van Riel iowait stats for 2.5.33 ?

Robert Williamson ltp-20020910 released ?

Greg KH USB changes for 2.5.33 ?

Greg KH USB changes for 2.5.34 ?

Corey Minyard Version 2 of the Linux IPMI driver ?

Greg KH PCI hotplug changes for 2.5.34 ?

Jeroen Vreeken [ANNOUNCE] epcam 0.5 ?

Jens Axboe 2.5.34 IDE "<q>I've updated 2.5 IDE code to match what is currently in 2.4.20-pre5-ac4, since is much nicer and better structured.</q>" ?

Matt_Domsch@Dell.com x86 BIOS Enhanced Disk Device (EDD) polling ?

Christoph Hellwig XFS filesystem support ?

Anton Altaparmakov Introduce fs/inode.c::ilookup(). (1/3) ?

Anton Altaparmakov Introduce fs/inode.c::ilookup(). (2/3) ?

Anton Altaparmakov Introduce fs/inode.c::ilookup(). (3/3) ?

Hans Reiser [BK] ReiserFS changesets for 2.4 (performs writes more than 4k at a time) ?

Alexander Viro Re: Missing IDE partition 3 of 3 on 2.5.34? (This patch fixes the 2.5.34 missing partition problem). ?

David Kleikamp [PATCH] JFS acls ?

David Kleikamp [PATCH] JFS extended attributes ?

Andrew Morton infrastruture for monitoring request queue congestion ?

Andrew Morton truncate/invalidate_inode_pages[2] rewrite ?

Christoph Hellwig rework inode allocation to allow filesystems more control ?

Daniel Phillips Alternative raceless page free ?

Andrew Morton 2.5.33-mm3 ?

Andrew Morton 2.5.33-mm5 ?

Andrew Morton 2.5.34-mm1 ?

David Woodhouse On paging of kernel VM. ?

Dave McCracken Rough cut at shared page tables ?

Ed Tomlinson slabnow ?

Andrew Morton pdflush congestion avoidance ?

Andrew Morton struct writeback_control ?

Andrew Morton radix_tree_gang_lookup ?

Jean Tourrilhes Wireless Extensions v15 ?

Hirokazu Takahashi zerocopy NFS for 2.5.33 ?

nf@hipac.org Release of nf-hipac v0.0.2 ?

Oliver Xymoron Entropy fixes, take 3 ?

Oliver Xymoron Entropy fixes - fls ?

Oliver Xymoron Entropy fixes - batch cleanup ?

Oliver Xymoron Entropy fixes - xfer cleanup *important* ?

Oliver Xymoron Entropy fixes - trust_pct for headless ?

Oliver Xymoron Entropy fixes - untrusted sources ?

Oliver Xymoron Entropy fixes - core accounting ?

Oliver Xymoron Entropy fixes - update users ?

Oliver Xymoron Entropy fixes - remove legacy ?

Oliver Xymoron Entropy fixes - refactor reseeding ?

Oliver Xymoron Oops - [PATCH 10/11] Entropy fixes - ?

Oliver Xymoron Entropy fixes - independent pool for /dev/urandom ?

Benjamin LaHaise libaio 0.3.90 -- test release for sync up ?

Stephen Hemminger non syscall gettimeofday ?

Thomas Molina 2.5 Problem Status Report ?

Kernel development

Brief items

Kernel release status

Kernel development news

Linus gets spam filtering

Speeding up reverse mapping

Other memory management work

Too much attention on large systems?

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous