Speeding up reverse mapping
As described here last January, the rmap code works by keeping track of which page tables reference every physical page on the system. This is done by adding a linked list of rmap entries to the page structure; each entry in the list points to one page table entry referencing the page. The maintenance of this list is the source of the bulk of the rmap code's overhead. The many thousands of these pte_chain structures require a lot of processing to keep current, are inefficient (the structure contains two pointers; the one which points to the next pte_chain entry is pure overhead), and put lots of pressure on the memory allocation subsystem.
Andrew's solution to this problem is simply to expand the pte_chain structures to hold multiple page table pointers. Anywhere between seven and 31 PTE pointers can be stored in a single pte_chain entry, depending on the architecture. The chain overhead is reduced accordingly, and the system's cache behavior is improved. This change, it is claimed, takes 10% off that all-important kernel compile time - at least on Andrew's wimpy little 16-processor NUMA system.
One other optimization, which has been in the kernel for a while, is to eliminate the PTE chain entirely for pages which are only mapped into a single process - of which there are many on a typical system. In that case, a flag is set in the page structure, and the pointer for the PTE chain points, instead, directly at the page table entry of interest.
The rmap code still has its performance costs, especially in the
fork system call. But those costs are shrinking - as are
inefficiencies throughout the kernel.
