Speeding up reverse mapping

[Posted September 11, 2002 by corbet]

Ever since Rik van Riel's reverse mapping VM implementation was merged into the kernel, people have wondered how it could be made to work more quickly. The rmap code accelerates many memory management operations, but slows down others. It would be nice to get to the point where the performance regressions have been mitigated (or eliminated) while keeping the benefits of the rmap code. Linus's current BitKeeper tree contains one patch from Andrew Morton which is a big step in that direction.

As described here last January, the rmap code works by keeping track of which page tables reference every physical page on the system. This is done by adding a linked list of rmap entries to the page structure; each entry in the list points to one page table entry referencing the page. The maintenance of this list is the source of the bulk of the rmap code's overhead. The many thousands of these pte_chain structures require a lot of processing to keep current, are inefficient (the structure contains two pointers; the one which points to the next pte_chain entry is pure overhead), and put lots of pressure on the memory allocation subsystem.

Andrew's solution to this problem is simply to expand the pte_chain structures to hold multiple page table pointers. Anywhere between seven and 31 PTE pointers can be stored in a single pte_chain entry, depending on the architecture. The chain overhead is reduced accordingly, and the system's cache behavior is improved. This change, it is claimed, takes 10% off that all-important kernel compile time - at least on Andrew's wimpy little 16-processor NUMA system.

One other optimization, which has been in the kernel for a while, is to eliminate the PTE chain entirely for pages which are only mapped into a single process - of which there are many on a typical system. In that case, a flag is set in the page structure, and the pointer for the PTE chain points, instead, directly at the page table entry of interest.

The rmap code still has its performance costs, especially in the fork system call. But those costs are shrinking - as are inefficiencies throughout the kernel.