Sharing pages between mappings

Posted Mar 26, 2017 22:42 UTC (Sun) by dgc (subscriber, #6611)
Parent article: Sharing pages between mappings

> Per-page sharing, instead, was widely agreed to be insane.

The information in the article leads me to believe that this conclusion is wrong, simply because nobody in the discussion has realised what is actually being shared. IMO, this is not a "mapping level" problem as what we are trying to share is a cached range of /physical storage/, not a random set of disjunct page cache pages spread across multiple inodes/mappings. i.e. we want to share information across all references to a physical block, not references between page cache mappings.

Keep in mind that the page cache is divorced from the physical location and sharing aspects of blocks in disks - it is indexed and accessed entirely via {inode, logical file offset}. It is (clearly) very difficult to retrofit physical sharing information into the structure of the cache because it was specially designed to avoid knowing anything about the physical location and characteristics of the underlying physical data layout. The information about what is shared and how it is managed is hidden /entirely/ by the filesystem. The filesystem is what shares and unshares teh physical storage, so cache sharing really needs to be driven by those filesystem operations, not mm/ level operations and structures.

So: if we had a filesystem that cached the physical location data indexed by {device, block address}, then we could simply do a lookup in that cache first to find if we can share existing cached pages into the new inode/mapping. If it exists simply take a reference to it and share the pages that object holds. Then things like writes and breaking of COW information (which are managed by the filesystem) only requires mapping invalidation ( which removes the reference to the existing underlying physical cache object) and replacing it with the new physical cache object and reference.

This doesn't require any changes to the higher layer mm code and even for it to be aware of sharing - it's all filesystem level stuff and is completely managed by existing callouts into the filesystems (i.e. same callouts that result in breaking COW extents on disk). The rest of the system does not even need to be aware that the filesystem is using a physical block layer cache....

FWIW, I think I'll call this cache construct by it's traditional name: a "buffer cache".

Sharing pages between mappings

Posted Mar 27, 2017 8:38 UTC (Mon) by mszeredi (guest, #19041) [Link] (3 responses)

Dave, I think the big issue is not how we find pages to share or unshare (which is what you are talking about), but how to deal with assumptions in the mm code. E.g. having a page->mapping leads to the assumption that a page can only belong to a single inode (okay, we have inode->i_mapping which is what viro was referring to, but that only helps if we never want to break the sharing). Having a page->index leads to the assumption that the page can only be at a certain offset in the file. Whatever you do in your filesystem, those assumptions must be dealt with in the mm, which is what the major pain is, I think.

> If it exists simply take a reference to it and share the pages that object holds.
> Then things like writes and breaking of COW information (which are managed
> by the filesystem) only requires mapping invalidation ( which removes the
> reference to the existing underlying physical cache object) and replacing it
> with the new physical cache object and reference.

So how is this supposed to work on the level of vm structures (inode->i_mapping, file->f_mapping, mapping->i_mmap, etc)?

Sharing pages between mappings

Posted Mar 27, 2017 23:42 UTC (Mon) by dgc (subscriber, #6611) [Link] (2 responses)

When you are talking about reflink or snapshots, the shared pages in shared files are /always/ at the same file offset.
If you do something that changes the file offset of a shared extent (which can only be done by special fallocate() modes) then the first thing the filesystem does is invalidate the page cache over that range on that inode, then it breaks the extent level sharing. The next data access will repopulate the page cache with unshared pages. IOWs, it breaks any sharing that has previously occurred.

Now, of course, if you are doing de-duplication then that's a different matter, but that's easily handled at lookup time - if the inode file offset is different between two shared data extents, then simply treat them as "not sharable". Hence we can simply ignore that case altogether.

As for inode->i_mapping/page->mapping/file->f_mapping, all I figured was that we use a shadow mechanism. page->mapping/file->f_mapping always points to the inode->i_mapping, but we add a second layer to it. The inode->i_mapping points to all it's private pages, and contains a pointer to a mapping containing shared pages. That "shared page" mapping would be the "new buffer cache" address space inode->i_mapping. We always look up private pages first, and only if nothing is found we do a shared mapping page lookup.

Pages in the shared mapping are always read-only. Hence when we get a write fault on a shared page, we COW the page and insert the new page into the inode->mapping private space, then call ->fault and the filesystem does a COW operation to break the extent sharing. Lookups on that file offset will now find the private writable page rather than the shared page. Lookups on other inodes that share the original page won't even know that this happened.

Also, from the mmap side of things, we get told by the mm/ subsystem what the "private" inode is (via file_inode(vmf->vma->vm_file) which is set to the "private" inode when the region is mmap()d) so the page fault paths should never need to care that page->mapping->host points at a shared address space. And for things like invalidation, we already have hooks into the filesystems to handle internal references to pages (e.g. buffer heads),
so the majority of mm/ level operations really won't need to care what we do with the mappings.

Sure, there's some dragons to be defeated in all this, but when I looked at it a year ago there weren't any obvious showstopper issues....

Sharing pages between mappings

Posted Mar 28, 2017 4:26 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

BTRFS_IOC_CLONE_RANGE lets you share pages at different file offsets.

Sharing pages between mappings

Posted Mar 28, 2017 21:57 UTC (Tue) by dgc (subscriber, #6611) [Link]

BTRFS_IOC_CLONE_RANGE to different offsets (a.k.a. FICLONERANGE at the VFS level) is exactly the same case as extents shared by FIDEDUPERANGE - page->index does not match for the two files, so don't share cached pages.

These operations are, however, quite rare and are not used for setting up exact directory tree clones for containers and so aren't really a significant optimisation target that people are concerned about....