Sharing pages between mappings
Szeredi is working with the overlayfs filesystem, which works by stacking a virtual filesystem on top of another filesystem to provide a modified view of that lower filesystem. When pages from the real file in the lower filesystem are read, they show up in the page cache. When the upper filesystem is accessed, the virtual file at that level is a separate mapping, so the same pages show up a second time in the page cache. The same sort of problem can arise in a single copy-on-write (COW) filesystem like Btrfs; different files can share the same data on disk, but that data is duplicated in the page cache. At best, the result of this duplication is wasted memory.
Kirill Shutemov noted that anonymous memory (program data that does not
have a file behind it) has similar semantics; a page can appear in many
![Miklos Szeredi [Miklos Szeredi]](https://static.lwn.net/images/conf/2017/lsfmm/MiklosSzeredi-sm.jpg) different address spaces.  For anonymous pages, the anon_vma mechanism allows the kernel
to keep track of everything and provides proper COW semantics.  Perhaps
something similar could be done with file-backed pages.
different address spaces.  For anonymous pages, the anon_vma mechanism allows the kernel
to keep track of everything and provides proper COW semantics.  Perhaps
something similar could be done with file-backed pages.
James Bottomley said that the important questions were how much it would cost to maintain these complicated mappings, and how coherence would be maintained. He pointed out that pages could be shared, go out of sharing for a while, then become shared again. Perhaps, he said, the KSM mechanism could be used to keep things in order. Szeredi said he hadn't really thought about all of those issues yet.
On the question of cost, Josef Bacik said that his group had tried to implement this sort of feature and found it to be "insane". There are a huge number of places in the code that would need to be audited for correct behavior. There would be a lot of real-world benefits, he said, but he decided that it simply wasn't worth it.
Matthew Wilcox suggested a scheme where there would be a master inode on each filesystem with other inodes sharing pages linked off of it. But Al Viro responded that this approach has its own challenges, since the inodes involved do not all have to be on the same filesystem. Given that, he asked, where would this master inode be? Bacik agreed, saying that he had limited his exploration to single-filesystem sharing; things get "even more bonkers" if multiple filesystems are involved. If this is going to be done at all, he said, it should be done on a single filesystem first.
Bottomley said that the problems come from attempting to manage the sharing at the page level. If it were done at the inode level instead, things would be easier. Viro said that inodes can actually share data now, but it's an all-or-nothing deal; there is no way to share only a subset of pages. At that level, this functionality has worked for the last 15 years. But, since the entire file must be shared, Szeredi pointed out, the scheme falls down if the sharing must be broken at some point — if the file is written, for example. Viro suggested trying to steal all of the pages when that happens, but Szeredi said that memory mappings would still point to the shared pages.
Bottomley then suggested stepping back and considering the use cases for this feature. Users with lots of containers, he said, want to transparently share a lot of the same files between those containers; this sort of feature would be useful in such settings. Bacik added that doing this sharing at the inode level would lose a lot of flexibility, but it might be enough for the container case which, he said, might be the most important case. Jan Kara suggested simply breaking the sharing when a file is opened for write, or even requiring that users explicitly request sharing, but Bottomley responded that container users would not do that.
The conclusion from the discussion is that per-inode sharing of pages
between mapping is probably possible if somebody were sufficiently
motivated to try to implement it.  Per-page sharing, instead, was widely
agreed to be insane.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Union | 
| Conference | Storage, Filesystem, and Memory-Management Summit/2017 | 
      Posted Mar 26, 2017 22:42 UTC (Sun)
                               by dgc (subscriber, #6611)
                              [Link] (4 responses)
       
The information in the article leads me to believe that this conclusion is wrong, simply because nobody in the discussion has realised what is actually being shared. IMO, this is not a "mapping level" problem as what we are trying to share is a cached range of /physical storage/, not a random set of disjunct page cache pages spread across multiple inodes/mappings. i.e. we want to share information across all references to a physical block, not references between page cache mappings. 
Keep in mind that the page cache is divorced from the physical location and sharing aspects of blocks in disks - it is indexed and accessed entirely via {inode, logical file offset}. It is (clearly) very difficult to retrofit physical sharing information into the structure of the cache because it was specially designed to avoid knowing anything about the physical location and characteristics of the underlying physical data layout. The information about what is shared and how it is managed is hidden /entirely/ by the filesystem. The filesystem is what shares and unshares teh physical storage, so cache sharing really needs to be driven by those filesystem operations, not mm/ level operations and structures. 
So: if we had a filesystem that cached the physical location data indexed by {device, block address}, then we could simply do a lookup in that cache first to find if we can share existing cached pages into the new inode/mapping. If it exists simply take a reference to it and share the pages that object holds. Then things like writes and breaking of COW information (which are managed by the filesystem) only requires mapping invalidation ( which removes the reference to the existing underlying physical cache object) and replacing it with the new physical cache object and reference. 
This doesn't require any changes to the higher layer mm code and even for it to be aware of sharing - it's all filesystem level stuff and is completely managed by existing callouts into the filesystems (i.e. same callouts that result in breaking COW extents on disk). The rest of the system does not even need to be aware that the filesystem is using a physical block layer cache.... 
FWIW, I think I'll call this cache construct by it's traditional name: a "buffer cache". 
 
     
    
      Posted Mar 27, 2017 8:38 UTC (Mon)
                               by mszeredi (guest, #19041)
                              [Link] (3 responses)
       
> If it exists simply take a reference to it and share the pages that object holds. 
So how is this supposed to work on the level of vm structures (inode->i_mapping, file->f_mapping, mapping->i_mmap, etc)?     
 
     
    
      Posted Mar 27, 2017 23:42 UTC (Mon)
                               by dgc (subscriber, #6611)
                              [Link] (2 responses)
       
Now, of course, if you are doing de-duplication then that's a different matter, but that's easily handled at lookup time - if the inode file offset is different between two shared data extents, then simply treat them as "not sharable". Hence we can simply ignore that case altogether. 
As for inode->i_mapping/page->mapping/file->f_mapping, all I figured was that we use a shadow mechanism. page->mapping/file->f_mapping always points to the inode->i_mapping, but we add a second layer to it. The inode->i_mapping points to all it's private pages, and contains a pointer to a mapping containing shared pages. That "shared page" mapping would be the "new buffer cache" address space inode->i_mapping. We always look up private pages first, and only if nothing is found we do a shared mapping page lookup. 
Pages in the shared mapping are always read-only. Hence when we get a write fault on a shared page, we COW the page and insert the new page into the inode->mapping private space, then call ->fault and the filesystem does a COW operation to break the extent sharing. Lookups on that file offset will now find the private writable page rather than the shared page. Lookups on other inodes that share the original page won't even know that this happened. 
Also, from the mmap side of things, we get told by the mm/ subsystem what the "private" inode is (via file_inode(vmf->vma->vm_file) which is set to the "private" inode when the region is mmap()d) so the page fault paths should never need to care that page->mapping->host points at a shared address space. And for things like invalidation, we already have hooks into the filesystems to handle internal references to pages (e.g. buffer heads), 
Sure, there's some dragons to be defeated in all this, but when I looked at it a year ago there weren't any obvious showstopper issues.... 
 
 
 
 
 
     
    
      Posted Mar 28, 2017 4:26 UTC (Tue)
                               by roc (subscriber, #30627)
                              [Link] (1 responses)
       
     
    
      Posted Mar 28, 2017 21:57 UTC (Tue)
                               by dgc (subscriber, #6611)
                              [Link] 
       
These operations are, however, quite rare and are not used for setting up exact directory tree clones for containers and so aren't really a significant optimisation target that people are concerned about.... 
     
      Posted Mar 27, 2017 5:09 UTC (Mon)
                               by ncm (guest, #165)
                              [Link] (3 responses)
       
The reporting seems to refer to something else entirely, but it is hard to see any difference.  What am I missing?  Is everyone in kernel-land talking about the same thing, or are they as confused as we are? 
     
    
      Posted Mar 27, 2017 5:29 UTC (Mon)
                               by dgc (subscriber, #6611)
                              [Link] (2 responses)
       
Same thing goes for sparse directory tree clones generated with reflink - the VFS/OS thinks they are separate files and only the underlying filesystem knows that those cloned files share their data blocks with the other files..... 
 
     
    
      Posted Mar 27, 2017 16:33 UTC (Mon)
                               by ms-tg (subscriber, #89231)
                              [Link] (1 responses)
       
     
    
      Posted Apr 3, 2017 3:49 UTC (Mon)
                               by kingdon (guest, #4526)
                              [Link] 
       
Docker (and similar systems) use a union mount so they don't need every container to have its own copy of all the files in the operating system. 
     
    Sharing pages between mappings
      
Sharing pages between mappings
      
> Then things like writes and breaking of COW information (which are managed
> by the filesystem) only requires mapping invalidation ( which removes the
> reference to the existing underlying physical cache object) and replacing it
> with the new physical cache object and reference.
Sharing pages between mappings
      
If you do something that changes the file offset of a shared extent (which can only be done by special fallocate() modes) then the first thing the filesystem does is invalidate the page cache over that range on that inode, then it breaks the extent level sharing. The next data access will repopulate the page cache with unshared pages. IOWs, it breaks any sharing that has previously occurred. 
so the majority of mm/ level operations really won't need to care what we do with the mappings.
Sharing pages between mappings
      
Sharing pages between mappings
      
On drugs, or confused reporting?
      
On drugs, or confused reporting?
      
On drugs, or confused reporting?
      
On drugs, or confused reporting?
      
 
           