The future of the page cache
This was, he started, his first talk ever as a Microsoft employee — something he thought he would never find himself saying. He then launched into his topic by saying that computing is all about caching. His new laptop can execute 10 billion instructions per second, but only as long as it doesn't take a cache miss. Memory on that system can only deliver 530 million cache lines per second, so it doesn't take many cache misses to severely impact its performance. Things get even worse if the data you want isn't cached in main memory and has to be read from a storage device, even a fast solid-state device.
It has always been that way; a PDP-11 was also significantly slowed by cache misses. But the problem is getting worse. CPU speeds have increased more than memory speeds, which, in turn, have increased more than storage speeds. The cost of not caching your data properly is thus going up.
The page cache
Unix systems have had a buffer cache, which sits between the filesystem and the disk for the purpose of caching disk blocks in memory, for a long time. While preparing the talk, he went back to look at Sixth-edition Unix (released in 1975) and found a buffer cache there. Linux has had a buffer cache since the beginning. In the 1.3.50 release in 1995, Linus Torvalds added a significant innovation in the form of the page cache. This cache differs from the buffer cache in that it sits between the virtual filesystem (VFS) layer and the filesystem itself. With the page cache, there is no need to call into filesystem code at all if the desired page is present already. Initially, the page and buffer caches were entirely separate, but Ingo Molnar unified them in 1999. Now, the buffer cache still exists, but its entries point into the page cache.
The page cache has a great deal of functionality built into it. There are
some obvious functions, like finding a page at a given index; if the page
doesn't exist, it can be created and optionally filled from disk. Dirty
pages can be pushed back to disk. Pages can be locked, unlocked, and removed
from the cache. Threads can wait for changes in a page's state, and there
are interfaces to search for pages in a given state. The page cache is
also able to keep track of errors associated with persistent storage.
Locking for the page cache is handled internally. There tends to be disagreement in the kernel community over the level at which locking should be handled; in this case it has been settled in favor of internal locking. There is a spinlock to control access when changes are being made to the page cache, but lookups are handled using the lockless read-copy-update (RCU) mechanism.
Caching is the art of predicting the future, he said. When the cache grows too large, various heuristics come into play to decide which pages should be removed. Pages used only once are likely to not be used again, so those are kept in the "inactive" list and pushed out relatively quickly. A second use will promote a page from the inactive list to the active list. Unused pages eventually age off the active list and are put back onto the inactive list. Exceptional "shadow" entries are used to track pages that have fallen off the end of the inactive list and have been reclaimed; these entries have the effect of lengthening the kernel's memory about pages that were used in the relatively distant past.
Huge pages have been a challenge for the page cache for a while. The kernel's transparent huge page feature initially only worked with anonymous (non file-backed) memory. There are good reasons for using huge pages in the page cache, though. Initial work in this area simply adds a large set of single-page entries to the page cache to correspond to a single huge page. Wilcox concluded that this approach was "silly"; he enhanced the radix tree code, used to track pages in the page cache, to be able to handle huge-page entries directly. Pending patches will cause the page cache to use a single entry for huge pages.
Do we still need the page cache?
Recently, Dave Chinner asserted that there was no longer a need for a page cache. He noted that the DAX subsystem, initially created by Wilcox to provide direct access to file data stored in persistent memory, bypasses the page cache entirely. "There is nothing like having your colleagues question your entire motivation", Wilcox said. There are people who disagree with Chinner, though, including Torvalds, who popped up in a separate forum saying that the page cache is important because good things don't come from having low-level filesystem code in the critical path for data access.
With that last statement in mind, Wilcox delved into how an I/O request using DAX works now. He designed the original DAX code and, in so doing, concluded that there was no need to use the page cache. That decision, he said, was wrong.
In current kernels, when an application makes a system call like read() to read some data from a file stored in persistent memory, DAX gets involved. Since the requested data is not present in the page cache, the VFS layer calls the filesystem-specific read_iter() function. That, in turn, calls into the DAX code, which will call back into the filesystem to turn the file offset into a block number. Then the block layer is queried to get the location of that block in persistent memory (mapping it into the kernel's address space if need be) so that the block's contents can be copied back to the application.
That is "not awful", but it should work differently, he said. The initial steps would be the same, in that the read_iter() function would still be called, and it would call into the DAX code. But, rather than calling back into the filesystem, DAX should call into the page cache to get the physical address associated with the desired offset in the file. The data is then copied back to user space from that address. This all assumes that the information is already present in the page cache but, when that is the case, the low-level filesystem code need not get involved at all. The filesystem had already done the work, and the page cache had cached the result.
When Torvalds wrote the above-mentioned post about the page cache, he said:
This, Wilcox said, was "so right"; the locking in DAX has indeed been disastrous. He originally thought it would be possible to get away with relatively simple locking, but complexity crept in with each new edge case that was discovered. DAX locking is now "really ugly" and he is sorry that he made the mistake of thinking that he could bypass the page cache. Now, he said, he has to fix it.
Future work
He concluded with a number of enhancements he would like to see made around DAX and the page cache. The improved huge-page support mentioned above is one of them; that is already sitting in the -mm tree and should be done soon. The use of page-frame numbers instead of page structures has been under discussion for a while since there is little desire to make the kernel store vast numbers of page structures for large persistent memory arrays.
He would like to revisit the idea of filesystems with a block size larger than the system's page size. That is something that people have wanted for many years; now that the page cache can handle more than one page size, it should be possible. "A simple matter of coding", he said. He is looking for other interested developers to work with on this project.
Huge swap entries are also an area of interest. We have huge anonymous pages in memory but, when it comes time to swap them out, they get broken up into regular pages. "That is probably the wrong answer". There is work in improving swap performance, but it needs to be reoriented toward keeping huge pages together. That might help with the associated idea of swapping to persistent memory. Data in a persistent-memory swap space can still be accessed, so it may well make sense to just leave it there, especially if it is not being heavily modified.
The video of this talk, including a bonus section on page-cache locking, is available.
[Your editor would like to thank linux.conf.au and the Linux Foundation
for assisting with his travel to the event.]
Index entries for this article | |
---|---|
Kernel | Memory management/Page cache |
Conference | linux.conf.au/2017 |
Posted Jan 25, 2017 23:40 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (5 responses)
Maybe it is just a different perspective, but this statement - which jumped out at me as I watched the talk - seems like a poor description of the truth.
As Matthew said elsewhere in the talk, the page cache can be seen as a large collection of individual caches, one for each file. One of those files that gets cached is the "file" you get when you open a block device, e.g. open("/dev/sda"). This creates a page-cache for that block device.
So the buffer cache doesn't exist, but the API does and it gives access to part of the page cache. I'm sure that is what Matthew meant, but I didn't think it can across very clearly.
Posted Jan 26, 2017 19:12 UTC (Thu)
by willy (subscriber, #9762)
[Link] (4 responses)
I thought I did say that! I haven't rewatched my talk yet, but it's certainly what I intended to say.
What I didn't mention is that there are actually two potential places to look for data in the page cache; one for the file which contains this data block, and one for the block device's address space. So there's the potential for aliasing in the page cache in a way that there wasn't in the block cache. It doesn't usually cause any problems because metadata is in different blocks from file data. It's only problematic if someone's expecting to modify file blocks and disc blocks and have them be coherent with each other's changes.
I've also been thinking about my statement that it's a lot of little page caches. That's correct from the point of view of "doing a lookup", but from the point of view of maintaining the active/inactive list, it's a global cache.
Posted Jan 27, 2017 7:05 UTC (Fri)
by geuder (subscriber, #62854)
[Link] (3 responses)
> I thought I did say that! I haven't rewatched my talk yet, but it's certainly what I intended to say.
Well, if kernel hackers have difficulties to find the right words to explain the subsystems they are working on...
Of course implementation can be tricky to get optimal in all cases, but userspace should remain unaffected. However, I haven't found it easy either to use it correctly. Not working with DAX here, but right the opposite speed-wise, USB2 memory sticks and eMMC...
To "flash" an embedded system I do something like
curl .../img.xz | zx -d | dd of=/dev/sda obs=1M
So when is the writing "done"?
What people typically recommend is calling sync(1). However, sync is documented to work on filesystems. And /dev/sda is not a filesystem. So sync(1) did not sound like right tool here. Instead I changed my code to
blockdev --flashbufs /dev/sda
It seems to work in practice, but is it the right thing to do?
Interesting enough when running the code on a rather recent Core i7, the dd completes fast and flashbufs takes 5-7 secs. Fair enough, buffering happened and it takes time to get data out over USB to not so great flash memory. But if I run the same code on a modest ARM the dd seems to hang before completion and then the flashbufs completes in no time. So it looks like on Intel the dd doesn't wait for buffers to be flushed but on ARM it does. For sure the difference cannot be Intel vs. ARM, but maybe some sysctl knob? Haven't compared whether both use the same IO scheduler.
A third option would be to run dd with oflags=direct. By a simple "watch free -w" I can see that this does not use buffer entries.
Writing being "done" could mean different things. In the most common case the user wants to remove the USB stick.
I am looking at different case right now
... | dd of=/dev/sda obs=1M
So I don't necessarily need the data to written to media (yet), but I need the 2 block devices /dev/sda and /dev/sda1 to be "cache coherent". Linux does not seem to guarantee this, I can see the fsck fail rather frequently on the Intel machine (never seen it on the ARM machine). In practice blockdev --flashbufs /dev/sda before running fsck /dev/sda1 seems to do it. But I'm still a bit puzzled why it takes 5-7 seconds on one system and 0 on another.
Posted Jan 27, 2017 7:53 UTC (Fri)
by geuder (subscriber, #62854)
[Link]
Posted Feb 8, 2017 11:36 UTC (Wed)
by abelloni (subscriber, #89904)
[Link] (1 responses)
Posted Feb 8, 2017 11:54 UTC (Wed)
by geuder (subscriber, #62854)
[Link]
However, it seems like a big waste to handle the whole x GiB synchronously, just to know when the last block has been written.
I have not measured though. And I have not tested whether it solves the /dev/sdX -- /dev/sdX1 coherence issue.
For the more common case of USB removal (not my problem here) I have found
udiskctl power-off
in the meantime. At least for everybody using systemd.
Posted Jan 26, 2017 0:00 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (4 responses)
It has always struck me as strange that people see this as a big issue. It has always (at least since 2.4 when the page-cache unification happened) seemed like a simple matter of coding.
When accessing regular files, it doesn't matter at all if the filesystem has a block size larger than the page size. You can still determine from the indexing that the filesystem keeps, where on the device to find the data for a particular page in the page-cache. With a large-block-size filesystem, consecutive pages will be consecutive in the device more often, but that doesn't make anything harder.
The only place I can see that makes a meaningful difference is when the filesystem has complex data structures that span a whole filesystem block. For example, if the filesystem stores directories with some sort of search-tree in each block, then performing operations on that search tree requires having all pages of that tree present at once.
Given the strong desire for large-block-filesystems which Matthew reported, and given the fact that we still don't have them, I can't help thinking that there is some extra complexity that I cannot see. Could someone enlighten me?
Posted Jan 26, 2017 9:46 UTC (Thu)
by kiryl (subscriber, #41516)
[Link] (3 responses)
I see two options:
- Allocation one high-order page per block. In this case it's relatively trivial to find all relevant pages and maintain consistency -- compound pages would do the job. But having steady supply of high order pages is hard. IIUC, the most common case for block size > page size is 64k block with 4k pages. Order-4 pages is already large enough to cause problems. So we need the second option at least as fallback.
- Allocate a set of order-0 pages and fan-out the block to them. It would also require some additional data structure to bundle them together. Maintaining consistency across pages would be a mess. I wouldn't subscribe to implement this...
Posted Jan 27, 2017 2:51 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
True. This is not unlike the 'struct buffer_head' used when the fs blocksize is smaller than the page size. This is a little bit of work to do that, but it is no out-of-proportion with the sort of work already done for similar goals.
> We also need to maintain consistent metadata across all pages for a block (i.e. locking, flags, etc.).
I don't see that.
Certainly all the pages in a block need to be dirtied at the same time. Maybe there would be some awkwardness there...
Thanks.
Posted Jan 28, 2017 22:05 UTC (Sat)
by dgc (subscriber, #6611)
[Link] (1 responses)
Was done in 2007 using compound pages. Rejected over (valid) fears of memory fragmentation taking systems down.
> - Allocate a set of order-0 pages and fan-out the block to them
Was also done in 2007 for ext2, using a struct fsblock to replace the struct bufferhead. Never gained traction because it required rewriting the entire IO path of each filesystem that wanted to support bs > ps.
All of this, and more, can be read about from this LWN article from 2007:
https://lwn.net/Articles/250335/
As it is, for bs > ps we can still do page granularity data IO /if/ the underlying storage has a sector size <= ps. Only block allocation/freeing/zeroing in the filesystem needs to be done on block boundaries....
-Dave.
Posted Jan 29, 2017 21:50 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
Yes, "filesystem block size" must imply different things to people, because I wrote a filesystem driver in 2001 that had filesystem blocks larger than the 4K page size and it's hard to believe there aren't lots of them today. I remember some hairiness in making sure a person couldn't see a previous owner's data in a new block in his file (considering system interruptions and everything), but other than that, it was pretty straightforward.
Filesystem block size normally means the unit of allocation from the storage device address space. I have the impression people here are talking instead about the unit of I/O to the storage device.
Posted Jan 26, 2017 0:57 UTC (Thu)
by Frogging101 (guest, #113180)
[Link] (2 responses)
I don't get this at all. How can the page cache be a bad thing? Having a page cache means that you can, for example, recursively grep the kernel source much faster the second time around. Or if you read a big file and then want to read it again, you can do it without waiting for the hard drive every single time. From his message:
>We're rapidly moving away from the world where a page cache is
> I we try to make the page cache the "one true IO optimisation source"
What on earth is he talking about? In most scenarios at the user level, direct I/O will *not* be faster than a page cache. Even on an SSD. I use an SSD and I experience the benefit of the page cache every single day. What is this magical incoming storage technology that has faster access times than RAM and will completely supplant everything we use now to the point that we don't need a page cache at all?
If I'm misunderstanding this, please do set me straight. But what I'm reading is very puzzling.
Posted Jan 26, 2017 1:58 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Jan 27, 2017 0:21 UTC (Fri)
by dgc (subscriber, #6611)
[Link]
The context this is missing was a discussion about how the "page cache" is central to the /IO path/ and how that was detrimental to making efficient, optimal decisions about how access to file data was arbitrated. e.g. buffered IO always iterates pages first and maps them one at a time, so the filesystem has no opportunity to optimise multi-page IO accesses, nor can the page cache do things like transparently allocate large pages for caching because the filesystem might not be able to allocate a contiguous extent to back the large page. Direct IO would iterate per-iovec, per-page to get mappings, so again give the fs much more mapping work to do than is necessary to map the user data to the underlying storage. And DAX was different again, but again it iterated page by page and called into the filesystem for each page.
So what I was really pointing out is that we have 3 different IO paths, each with their own quirks and differences, but all with the same fundamental problem - that in-memory data structure iteration determined the IO patterns that the filesystem is asked to map. What I was talking about is that we need to invert the order of access - to first map the /entire/ IO region, then iterate the in-memory data structures that need to be manipulated. IOWs, what I was saying is that we need to move to filesystem-based extent IO mapping infrastructure, not that there was "no longer a need for a page cache".
That's a /very different/ message, hence my concern that key developers have not understood why we've implemented the fs/iomap.c infrastructure for extent based IO mapping.
If you think I'm still wrong, look at the numbers: when we switched XFS buffered writes (i.e. through the page cache) to this mechanism, we saw a 20-30% improvement in high bandwidth, large IO throughput because we now do a per userspace IO mapping call rather than one per page being copied via the page cache. When we switched XFS direct IO to use iomap, Jens Axboe reported a 20% increase in small read/write IOPS on high end flash devices and a significant reduction in per-IO latency, all without needing to optimise the code to within an inch of being unmaintainable (*cough* fs/directio.c *cough*). And when we switched DAX to use the iomap code, we saw both ext4 and XFS read and write throughput increase by 50-500% (depending on hardware and operations being performed).
Even better, this new infrastructure allows the iomap actor function to determine what size data structure best matches the extent size the filesystem mapped, allowing transparent use of different size pages in the page cache (for buffered IO) or page tables (for DAX) without the filesystem having to do anything special to enable it.
And to clarify a common misconception: DAX does not use the /page cache/. It does not use struct pages, it doesn't cache the pages on LRUs for aging and reclaim when memory is low, etc. What DAX uses is the per-inode address space infrastructure for managing file mappings. i.e it uses the mapping tree to map page table entries directly to the data the file contains, not a page that contains a cached copy. IOWs, with DAX the mapping tree is used to arbitrate access to the data store rather than managing the life cycle of cached data.
We know that caching data is not necessary for DAX - caching mappings and other metadata needed to access the data rapidly is all we need to do. Even for high-end SSDs, caching data can be harmful to performance, regardless of what other people says, because the overhead of page cache allocation, data copies, and memory reclaim can be vastly higher than simply reading the data from storage again.
So call this splitting hairs, but the term "page cache" is used to encompass lots of different pieces that aren't actually pages or caches. What Willy was suggesting is that we should be unifying the caching and arbitration of metadata that gets us to the data efficiently through the common address space infrastructure. I think the message is being mangling and/or misunderstood by calling this "page cache functionality" when what we are actually talking about is optimising file address space management, not the caching of pages...
-Dave
The future of the page cache
The buffer_head api that used to give you access to the buffer cache, now gives you access to the page cache for the block device. Some filesystems use this for accessing the super block and inode tables. Other filesystems just go straight to the device, and do their own caching.
The future of the page cache
The future of the page cache
>> So the buffer cache doesn't exist, but the API does and it gives access to part of the page cache. I'm sure that is what Matthew meant, but I didn't think it can across very clearly.
fsck.ext4 /dev/sda1
resize2fs /dev/sda1
The future of the page cache
The future of the page cache
The future of the page cache
The future of the page cache
> "A simple matter of coding", he said.
If those pages are not contiguous in kernel memory, then it would be a little bit harder to search and update the tree, but only a little bit. Using accessor-functions is not rocket-science, and completely bypassing them when not needed is quite easy to arrange.
The future of the page cache
The future of the page cache
When a page is read, there is no need to also read other pages in the same filesystem block. So different pages in the one block do not all need to be up-to-date for example.
When writing to a block we would need to pre-read all pages which were not being over-written, but that is not intrinsically different to current code.
The future of the page cache
The future of the page cache
As it is, for bs > ps we can still do page granularity data IO /if/ the underlying storage has a sector size <= ps. Only block allocation/freeing/zeroing in the filesystem needs to be done on block boundaries....
The future of the page cache
> needed to give applications decent performance. DAX doesn't have a
> page cache, applications wanting to use high IOPS (hundreds of
> thousands to millions) storage are using direct IO, because the page
> cache just introduces latency, memory usage issues and
> non-deterministic IO behaviour.
> then we're screwing ourselves because the incoming IO technologies
> simply don't require it anymore. We need to be ahead of that curve,
> not playing catchup,
I think what you're missing is that this is talking about persistent memory. The data is already present in the processor's address space and can be accessed at near-RAM speeds. Copying that data into the page cache would be wasteful, which is why DAX doesn't do it. The point of the talk was that, while we don't need the data in the page cache, the page cache also manages metadata, and that we still want.
The future of the page cache
The future of the page cache