Speeding up reverse mapping
[Posted September 11, 2002 by corbet]
Ever since Rik van Riel's reverse mapping VM implementation was merged into
the kernel, people have wondered how it could be made to work more
quickly. The rmap code accelerates many memory management operations, but slows down
others. It would be nice to get to the point where the performance
regressions have been mitigated (or eliminated) while keeping the benefits
of the rmap code. Linus's current BitKeeper tree contains one patch from
Andrew Morton which is a big step in that direction.
As described here last
January, the rmap code works by keeping track of which page tables
reference every physical page on the system. This is done by adding a
linked list of rmap entries to the page structure; each entry in
the list points to one page table entry referencing the page. The
maintenance of this list is the source of the bulk of the rmap code's
overhead. The many thousands of these pte_chain
structures require a lot of processing to keep current, are inefficient
(the structure contains two pointers; the one which points to the next
pte_chain entry is pure overhead), and put lots of pressure on the
memory allocation subsystem.
Andrew's solution to this problem is simply to expand the
pte_chain structures to hold multiple page table pointers.
Anywhere between seven and 31 PTE pointers can be stored in a single
pte_chain entry, depending on the architecture. The
chain overhead is reduced accordingly, and the system's cache behavior is
improved. This change, it is claimed, takes 10% off that all-important
kernel compile time - at least on Andrew's wimpy little 16-processor NUMA
system.
One other optimization, which has been in the kernel for a while, is to
eliminate the PTE chain entirely for pages which are only mapped into a
single process - of which there are many on a typical system. In that
case, a flag is set in the page structure, and the pointer for the
PTE chain points, instead, directly at the page table entry of interest.
The rmap code still has its performance costs, especially in the
fork system call. But those costs are shrinking - as are
inefficiencies throughout the kernel.
(
Log in to post comments)