Sharing pages between mappings
Sharing pages between mappings
Posted Mar 26, 2017 22:42 UTC (Sun) by dgc (subscriber, #6611)Parent article: Sharing pages between mappings
The information in the article leads me to believe that this conclusion is wrong, simply because nobody in the discussion has realised what is actually being shared. IMO, this is not a "mapping level" problem as what we are trying to share is a cached range of /physical storage/, not a random set of disjunct page cache pages spread across multiple inodes/mappings. i.e. we want to share information across all references to a physical block, not references between page cache mappings.
Keep in mind that the page cache is divorced from the physical location and sharing aspects of blocks in disks - it is indexed and accessed entirely via {inode, logical file offset}. It is (clearly) very difficult to retrofit physical sharing information into the structure of the cache because it was specially designed to avoid knowing anything about the physical location and characteristics of the underlying physical data layout. The information about what is shared and how it is managed is hidden /entirely/ by the filesystem. The filesystem is what shares and unshares teh physical storage, so cache sharing really needs to be driven by those filesystem operations, not mm/ level operations and structures.
So: if we had a filesystem that cached the physical location data indexed by {device, block address}, then we could simply do a lookup in that cache first to find if we can share existing cached pages into the new inode/mapping. If it exists simply take a reference to it and share the pages that object holds. Then things like writes and breaking of COW information (which are managed by the filesystem) only requires mapping invalidation ( which removes the reference to the existing underlying physical cache object) and replacing it with the new physical cache object and reference.
This doesn't require any changes to the higher layer mm code and even for it to be aware of sharing - it's all filesystem level stuff and is completely managed by existing callouts into the filesystems (i.e. same callouts that result in breaking COW extents on disk). The rest of the system does not even need to be aware that the filesystem is using a physical block layer cache....
FWIW, I think I'll call this cache construct by it's traditional name: a "buffer cache".
      Posted Mar 27, 2017 8:38 UTC (Mon)
                               by mszeredi (guest, #19041)
                              [Link] (3 responses)
       
> If it exists simply take a reference to it and share the pages that object holds. 
So how is this supposed to work on the level of vm structures (inode->i_mapping, file->f_mapping, mapping->i_mmap, etc)?     
 
     
    
      Posted Mar 27, 2017 23:42 UTC (Mon)
                               by dgc (subscriber, #6611)
                              [Link] (2 responses)
       
Now, of course, if you are doing de-duplication then that's a different matter, but that's easily handled at lookup time - if the inode file offset is different between two shared data extents, then simply treat them as "not sharable". Hence we can simply ignore that case altogether. 
As for inode->i_mapping/page->mapping/file->f_mapping, all I figured was that we use a shadow mechanism. page->mapping/file->f_mapping always points to the inode->i_mapping, but we add a second layer to it. The inode->i_mapping points to all it's private pages, and contains a pointer to a mapping containing shared pages. That "shared page" mapping would be the "new buffer cache" address space inode->i_mapping. We always look up private pages first, and only if nothing is found we do a shared mapping page lookup. 
Pages in the shared mapping are always read-only. Hence when we get a write fault on a shared page, we COW the page and insert the new page into the inode->mapping private space, then call ->fault and the filesystem does a COW operation to break the extent sharing. Lookups on that file offset will now find the private writable page rather than the shared page. Lookups on other inodes that share the original page won't even know that this happened. 
Also, from the mmap side of things, we get told by the mm/ subsystem what the "private" inode is (via file_inode(vmf->vma->vm_file) which is set to the "private" inode when the region is mmap()d) so the page fault paths should never need to care that page->mapping->host points at a shared address space. And for things like invalidation, we already have hooks into the filesystems to handle internal references to pages (e.g. buffer heads), 
Sure, there's some dragons to be defeated in all this, but when I looked at it a year ago there weren't any obvious showstopper issues.... 
 
 
 
 
 
     
    
      Posted Mar 28, 2017 4:26 UTC (Tue)
                               by roc (subscriber, #30627)
                              [Link] (1 responses)
       
     
    
      Posted Mar 28, 2017 21:57 UTC (Tue)
                               by dgc (subscriber, #6611)
                              [Link] 
       
These operations are, however, quite rare and are not used for setting up exact directory tree clones for containers and so aren't really a significant optimisation target that people are concerned about.... 
     
    Sharing pages between mappings
      
> Then things like writes and breaking of COW information (which are managed
> by the filesystem) only requires mapping invalidation ( which removes the
> reference to the existing underlying physical cache object) and replacing it
> with the new physical cache object and reference.
Sharing pages between mappings
      
If you do something that changes the file offset of a shared extent (which can only be done by special fallocate() modes) then the first thing the filesystem does is invalidate the page cache over that range on that inode, then it breaks the extent level sharing. The next data access will repopulate the page cache with unshared pages. IOWs, it breaks any sharing that has previously occurred. 
so the majority of mm/ level operations really won't need to care what we do with the mappings.
Sharing pages between mappings
      
Sharing pages between mappings
      
 
           