Another attempt at speculative page-fault handling

August 14, 2017

This article was contributed by Nur Hussein

While the best way to avoid performance problems associated with page faults is usually to avoid faulting altogether, that is not always an option. Thus, it is important that the kernel handle page faults with a minimum of overhead. One particular pain point in current kernels comes about in multi-threaded workloads that are all incurring faults in the same address space. Speculative page-fault handling is an old idea for improving the scalability of such workloads that may finally be approaching a point where it can be considered for inclusion.

Memory-management performance with multi-threaded workloads is often bounded by the ability to acquire the mmap_sem reader/writer semaphore, which serializes access to the data structures describing a process's address space. The handling of page faults, in particular, requires mmap_sem, so a process with a large number of threads generating faults will see contention. Speculative page faults are an attempt to alleviate this contention by doing lockless reads of a process's virtual memory areas (VMA) without holding mmap_sem.

The speculative page-fault patches first appeared in 2009 and have been discussed and improved upon intermittently by various kernel developers through the years, but this work has not found its way into the kernel. Laurent Dufour has recently revived this effort by resubmitting the patches with fixes and improvements of his own, and an active discussion took place on the linux-kernel mailing list on its inclusion. Notably, Dufour reports a 20% speed improvement loading a 2TB database with the speculative page fault code.

As noted above, the mmap_sem semaphore can be a point of significant contention for multi-threaded workloads. In particular, page-fault handling requires access to a process's VMA structures that describe its memory layout, and that access requires mmap_sem. Even when only read locking is required (as is the case for page faults), frequent access to mmap_sem leads to cache-line bouncing and poor performance. The idea behind speculative page faults is to increase memory-management performance by avoiding the use of the mmap_sem in page-fault handling. Doing so requires a way to perform a lockless walk of the VMAs; that, in turn, means facing a number of problems that mmap_sem is explicitly there to prevent.

The first of these problems, naturally, is that the VMA describing the area where a fault occurs may, if mmap_sem is not held, change during the handling of the fault. The strategy taken to address this problem is to do as much work as possible that doesn't depend on the state of the VMA, then checking to see if anything has changed before changing the process's address space directly. So, for example, a free page can be allocated and its data read in from disk independently of the address space, but actually putting that page into the address space requires a consistent view of the VMA.

The kernel has a longstanding mechanism for this kind of access: the seqlock. So the patch set adds a seqlock to the VMA structure, along with the code to increment its sequence count everywhere that the VMA is changed. The speculative fault-handling code can then record the sequence number before doing any work and verify that the number has not changed at the end. If the sequence number does change, the VMA has been changed and the speculative work was done in vain; in this case the attempt fails and the fault is retried in the old-fashioned way.

The second problem is a bit trickier; without mmap_sem, a VMA may disappear entirely while a fault is being handled. This situation is avoided by using read-copy-update (RCU) to keep VMA structures around while a fault is being handled. SRCU (the sleepable variant of RCU) is used to serialize VMA updates and accommodate lockless reads of the VMA. SRCU is required because a number of fault-handling operations can sleep.

When handling a page fault speculatively, the kernel will do a lockless page-table walk and grab the finer-grained page-table lock. Then it does a srcu_read_lock() to do a VMA lookup, and checks the write-sequence count of the VMA. It will need to use the VMA to find the page that the address faulted on, but to do this it needs to drop the page table lock to honor locking order rules. Once the page is found, the VMA is validated again by repeating the page table walk, obtaining the page-table lock, and validating that the VMA sequence number did not change. If it didn't, the page is installed in the page table, and the page-table lock is released.

Another pitfall with speculative page-fault handling has to do with translation lookaside buffer (TLB) invalidation. Various actions, such as unmapping a memory area, can call for a TLB invalidation; that is handled by sending inter-processor interrupts (IPIs) to tell each CPU to invalidate its own TLB. The unmap path can lock specific page-table entries and perform the invalidation while that lock is held. The speculative fault-handling path, meanwhile, will attempt to take the page-table lock it needs with interrupts disabled. Should that attempt happen on a page-table entry that is held by the unmap path, the processor will spin in a loop with interrupts disabled, meaning it will never receive the TLB-invalidation IPI. The result is a deadlock, a situation that is even worse for performance than mmap_sem contention. Once this problem is understood the solution is straightforward: use a "trylock" operation to acquire the lock in the speculative path, and fall back to traditional fault handling if it fails.

The first speculative page fault patches were posted in 2009 by Hiroyuki Kamezawa. The ensuing discussion led Peter Zijlstra to create his own implementation based on the idea of using RCU to enable lockless reading of the VMA structures. Zijlstra's implementation had some issues of its own; no code was merged, and the discussion stalled and fizzled out. The idea was revived again by Zijlstra in 2014, because many of the issues blocking the progress of previous attempts had been solved. However, the discussion fizzled out again, and there was no push for merging the code. In June, Dufour forward-ported Zijlstra's patches and added some of his own. One of the problems Dufour addresses with his patches is the TLB invalidation issue.

Despite the two previous abandoned speculative page fault implementations, the idea is useful enough for another attempt to be made at trying to get this code merged into the kernel. Dufour's database-loading speed improvements led Michal Hocko to ask if there were any other tests or benchmarks that were run, such as kernbench or other highly threaded workloads. In response, the August 8 posting (linked above) includes a number of results from different benchmarks, showing modest to significant improvements depending on the test.

At this point, it would appear that the significant issues with this patch set have been addressed. Given that speculative page-fault handling shows significant improvements for some workloads, one might not be faulted for speculating that there is a reasonable chance that this work will be merged in the relatively near future — a mere eight years after the initial idea was floated.

[Thanks to Peter Zijlstra for answering my questions about this work.]

Index entries for this article
Kernel	Memory management/Scalability
GuestArticles	Hussein, Nur

Another attempt at speculative page-fault handling

Posted Aug 18, 2017 15:58 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

I saw what you did in the last sentence.