Sharing pages between mappings

By Jonathan Corbet
March 26, 2017

LSFMM 2017

In the memory-management subsystem, the term "mapping" refers to the connection between pages in memory and their backing store — the file that represents them on disk. One of the fundamental assumptions in the kernel is that a given page in the page cache belongs to exactly one mapping. But, as Miklos Szeredi explained in a plenary session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, there are situations where it would be desirable to associate the same page with multiple mappings. Achieving this goal may not be easy, though.

Szeredi is working with the overlayfs filesystem, which works by stacking a virtual filesystem on top of another filesystem to provide a modified view of that lower filesystem. When pages from the real file in the lower filesystem are read, they show up in the page cache. When the upper filesystem is accessed, the virtual file at that level is a separate mapping, so the same pages show up a second time in the page cache. The same sort of problem can arise in a single copy-on-write (COW) filesystem like Btrfs; different files can share the same data on disk, but that data is duplicated in the page cache. At best, the result of this duplication is wasted memory.

Kirill Shutemov noted that anonymous memory (program data that does not have a file behind it) has similar semantics; a page can appear in many different address spaces. For anonymous pages, the anon_vma mechanism allows the kernel to keep track of everything and provides proper COW semantics. Perhaps something similar could be done with file-backed pages.

James Bottomley said that the important questions were how much it would cost to maintain these complicated mappings, and how coherence would be maintained. He pointed out that pages could be shared, go out of sharing for a while, then become shared again. Perhaps, he said, the KSM mechanism could be used to keep things in order. Szeredi said he hadn't really thought about all of those issues yet.

On the question of cost, Josef Bacik said that his group had tried to implement this sort of feature and found it to be "insane". There are a huge number of places in the code that would need to be audited for correct behavior. There would be a lot of real-world benefits, he said, but he decided that it simply wasn't worth it.

Matthew Wilcox suggested a scheme where there would be a master inode on each filesystem with other inodes sharing pages linked off of it. But Al Viro responded that this approach has its own challenges, since the inodes involved do not all have to be on the same filesystem. Given that, he asked, where would this master inode be? Bacik agreed, saying that he had limited his exploration to single-filesystem sharing; things get "even more bonkers" if multiple filesystems are involved. If this is going to be done at all, he said, it should be done on a single filesystem first.

Bottomley said that the problems come from attempting to manage the sharing at the page level. If it were done at the inode level instead, things would be easier. Viro said that inodes can actually share data now, but it's an all-or-nothing deal; there is no way to share only a subset of pages. At that level, this functionality has worked for the last 15 years. But, since the entire file must be shared, Szeredi pointed out, the scheme falls down if the sharing must be broken at some point — if the file is written, for example. Viro suggested trying to steal all of the pages when that happens, but Szeredi said that memory mappings would still point to the shared pages.

Bottomley then suggested stepping back and considering the use cases for this feature. Users with lots of containers, he said, want to transparently share a lot of the same files between those containers; this sort of feature would be useful in such settings. Bacik added that doing this sharing at the inode level would lose a lot of flexibility, but it might be enough for the container case which, he said, might be the most important case. Jan Kara suggested simply breaking the sharing when a file is opened for write, or even requiring that users explicitly request sharing, but Bottomley responded that container users would not do that.

The conclusion from the discussion is that per-inode sharing of pages between mapping is probably possible if somebody were sufficiently motivated to try to implement it. Per-page sharing, instead, was widely agreed to be insane.

Index entries for this article
Kernel	Filesystems/Union
Conference	Storage, Filesystem, and Memory-Management Summit/2017

Sharing pages between mappings

Posted Mar 26, 2017 22:42 UTC (Sun) by dgc (subscriber, #6611) [Link] (4 responses)

> Per-page sharing, instead, was widely agreed to be insane.

The information in the article leads me to believe that this conclusion is wrong, simply because nobody in the discussion has realised what is actually being shared. IMO, this is not a "mapping level" problem as what we are trying to share is a cached range of /physical storage/, not a random set of disjunct page cache pages spread across multiple inodes/mappings. i.e. we want to share information across all references to a physical block, not references between page cache mappings.

Keep in mind that the page cache is divorced from the physical location and sharing aspects of blocks in disks - it is indexed and accessed entirely via {inode, logical file offset}. It is (clearly) very difficult to retrofit physical sharing information into the structure of the cache because it was specially designed to avoid knowing anything about the physical location and characteristics of the underlying physical data layout. The information about what is shared and how it is managed is hidden /entirely/ by the filesystem. The filesystem is what shares and unshares teh physical storage, so cache sharing really needs to be driven by those filesystem operations, not mm/ level operations and structures.

So: if we had a filesystem that cached the physical location data indexed by {device, block address}, then we could simply do a lookup in that cache first to find if we can share existing cached pages into the new inode/mapping. If it exists simply take a reference to it and share the pages that object holds. Then things like writes and breaking of COW information (which are managed by the filesystem) only requires mapping invalidation ( which removes the reference to the existing underlying physical cache object) and replacing it with the new physical cache object and reference.

This doesn't require any changes to the higher layer mm code and even for it to be aware of sharing - it's all filesystem level stuff and is completely managed by existing callouts into the filesystems (i.e. same callouts that result in breaking COW extents on disk). The rest of the system does not even need to be aware that the filesystem is using a physical block layer cache....

FWIW, I think I'll call this cache construct by it's traditional name: a "buffer cache".

Sharing pages between mappings

Posted Mar 27, 2017 8:38 UTC (Mon) by mszeredi (guest, #19041) [Link] (3 responses)

Dave, I think the big issue is not how we find pages to share or unshare (which is what you are talking about), but how to deal with assumptions in the mm code. E.g. having a page->mapping leads to the assumption that a page can only belong to a single inode (okay, we have inode->i_mapping which is what viro was referring to, but that only helps if we never want to break the sharing). Having a page->index leads to the assumption that the page can only be at a certain offset in the file. Whatever you do in your filesystem, those assumptions must be dealt with in the mm, which is what the major pain is, I think.

> If it exists simply take a reference to it and share the pages that object holds.
> Then things like writes and breaking of COW information (which are managed
> by the filesystem) only requires mapping invalidation ( which removes the
> reference to the existing underlying physical cache object) and replacing it
> with the new physical cache object and reference.

So how is this supposed to work on the level of vm structures (inode->i_mapping, file->f_mapping, mapping->i_mmap, etc)?

Sharing pages between mappings

Posted Mar 27, 2017 23:42 UTC (Mon) by dgc (subscriber, #6611) [Link] (2 responses)

When you are talking about reflink or snapshots, the shared pages in shared files are /always/ at the same file offset.
If you do something that changes the file offset of a shared extent (which can only be done by special fallocate() modes) then the first thing the filesystem does is invalidate the page cache over that range on that inode, then it breaks the extent level sharing. The next data access will repopulate the page cache with unshared pages. IOWs, it breaks any sharing that has previously occurred.

Now, of course, if you are doing de-duplication then that's a different matter, but that's easily handled at lookup time - if the inode file offset is different between two shared data extents, then simply treat them as "not sharable". Hence we can simply ignore that case altogether.

As for inode->i_mapping/page->mapping/file->f_mapping, all I figured was that we use a shadow mechanism. page->mapping/file->f_mapping always points to the inode->i_mapping, but we add a second layer to it. The inode->i_mapping points to all it's private pages, and contains a pointer to a mapping containing shared pages. That "shared page" mapping would be the "new buffer cache" address space inode->i_mapping. We always look up private pages first, and only if nothing is found we do a shared mapping page lookup.

Pages in the shared mapping are always read-only. Hence when we get a write fault on a shared page, we COW the page and insert the new page into the inode->mapping private space, then call ->fault and the filesystem does a COW operation to break the extent sharing. Lookups on that file offset will now find the private writable page rather than the shared page. Lookups on other inodes that share the original page won't even know that this happened.

Also, from the mmap side of things, we get told by the mm/ subsystem what the "private" inode is (via file_inode(vmf->vma->vm_file) which is set to the "private" inode when the region is mmap()d) so the page fault paths should never need to care that page->mapping->host points at a shared address space. And for things like invalidation, we already have hooks into the filesystems to handle internal references to pages (e.g. buffer heads),
so the majority of mm/ level operations really won't need to care what we do with the mappings.

Sure, there's some dragons to be defeated in all this, but when I looked at it a year ago there weren't any obvious showstopper issues....

Sharing pages between mappings

Posted Mar 28, 2017 4:26 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

BTRFS_IOC_CLONE_RANGE lets you share pages at different file offsets.

Sharing pages between mappings

Posted Mar 28, 2017 21:57 UTC (Tue) by dgc (subscriber, #6611) [Link]

BTRFS_IOC_CLONE_RANGE to different offsets (a.k.a. FICLONERANGE at the VFS level) is exactly the same case as extents shared by FIDEDUPERANGE - page->index does not match for the two files, so don't share cached pages.

These operations are, however, quite rare and are not used for setting up exact directory tree clones for containers and so aren't really a significant optimisation target that people are concerned about....

On drugs, or confused reporting?

Posted Mar 27, 2017 5:09 UTC (Mon) by ncm (guest, #165) [Link] (3 responses)

A hundred processes routinely share each of hundreds of pages of a file, and hundreds of the next file, mapped at random starting addresses. That is what .so files are for.

The reporting seems to refer to something else entirely, but it is hard to see any difference. What am I missing? Is everyone in kernel-land talking about the same thing, or are they as confused as we are?

On drugs, or confused reporting?

Posted Mar 27, 2017 5:29 UTC (Mon) by dgc (subscriber, #6611) [Link] (2 responses)

Containers often don't share library files, so don't share file data mappings in the traditional way. e.g. two containers using snapshot-based filesystem images. The read-only files in each container image may share the same disk blocks, but to the user/VFS/mm subsytems they are different files because they live in different paths, in different mount namespaces accessed through different superblocks. Only the layer that is aware of the snapshot knows that the data extents the two inodes point to are actually the same blocks on disk and so share data....

Same thing goes for sparse directory tree clones generated with reflink - the VFS/OS thinks they are separate files and only the underlying filesystem knows that those cloned files share their data blocks with the other files.....

On drugs, or confused reporting?

Posted Mar 27, 2017 16:33 UTC (Mon) by ms-tg (subscriber, #89231) [Link] (1 responses)

Thanks for an informative and concise answer

On drugs, or confused reporting?

Posted Apr 3, 2017 3:49 UTC (Mon) by kingdon (guest, #4526) [Link]

Here's some of the background on union mounts: https://lwn.net/Articles/396020/

Docker (and similar systems) use a union mount so they don't need every container to have its own copy of all the files in the operating system.