The future of the page cache

Posted Jan 27, 2017 0:21 UTC (Fri) by dgc (subscriber, #6611)
In reply to: The future of the page cache by Frogging101
Parent article: The future of the page cache

Sitting in the audience during Willy's talk, I felt that what I said had either been misunderstood or lost it's meaning when taken out of it's original context. Now that I've read the article and the responses, I think there's a bit of both here.

The context this is missing was a discussion about how the "page cache" is central to the /IO path/ and how that was detrimental to making efficient, optimal decisions about how access to file data was arbitrated. e.g. buffered IO always iterates pages first and maps them one at a time, so the filesystem has no opportunity to optimise multi-page IO accesses, nor can the page cache do things like transparently allocate large pages for caching because the filesystem might not be able to allocate a contiguous extent to back the large page. Direct IO would iterate per-iovec, per-page to get mappings, so again give the fs much more mapping work to do than is necessary to map the user data to the underlying storage. And DAX was different again, but again it iterated page by page and called into the filesystem for each page.

So what I was really pointing out is that we have 3 different IO paths, each with their own quirks and differences, but all with the same fundamental problem - that in-memory data structure iteration determined the IO patterns that the filesystem is asked to map. What I was talking about is that we need to invert the order of access - to first map the /entire/ IO region, then iterate the in-memory data structures that need to be manipulated. IOWs, what I was saying is that we need to move to filesystem-based extent IO mapping infrastructure, not that there was "no longer a need for a page cache".

That's a /very different/ message, hence my concern that key developers have not understood why we've implemented the fs/iomap.c infrastructure for extent based IO mapping.

If you think I'm still wrong, look at the numbers: when we switched XFS buffered writes (i.e. through the page cache) to this mechanism, we saw a 20-30% improvement in high bandwidth, large IO throughput because we now do a per userspace IO mapping call rather than one per page being copied via the page cache. When we switched XFS direct IO to use iomap, Jens Axboe reported a 20% increase in small read/write IOPS on high end flash devices and a significant reduction in per-IO latency, all without needing to optimise the code to within an inch of being unmaintainable (*cough* fs/directio.c *cough*). And when we switched DAX to use the iomap code, we saw both ext4 and XFS read and write throughput increase by 50-500% (depending on hardware and operations being performed).

Even better, this new infrastructure allows the iomap actor function to determine what size data structure best matches the extent size the filesystem mapped, allowing transparent use of different size pages in the page cache (for buffered IO) or page tables (for DAX) without the filesystem having to do anything special to enable it.

And to clarify a common misconception: DAX does not use the /page cache/. It does not use struct pages, it doesn't cache the pages on LRUs for aging and reclaim when memory is low, etc. What DAX uses is the per-inode address space infrastructure for managing file mappings. i.e it uses the mapping tree to map page table entries directly to the data the file contains, not a page that contains a cached copy. IOWs, with DAX the mapping tree is used to arbitrate access to the data store rather than managing the life cycle of cached data.

We know that caching data is not necessary for DAX - caching mappings and other metadata needed to access the data rapidly is all we need to do. Even for high-end SSDs, caching data can be harmful to performance, regardless of what other people says, because the overhead of page cache allocation, data copies, and memory reclaim can be vastly higher than simply reading the data from storage again.

So call this splitting hairs, but the term "page cache" is used to encompass lots of different pieces that aren't actually pages or caches. What Willy was suggesting is that we should be unifying the caching and arbitration of metadata that gets us to the data efficiently through the common address space infrastructure. I think the message is being mangling and/or misunderstood by calling this "page cache functionality" when what we are actually talking about is optimising file address space management, not the caching of pages...

-Dave