Page pinning and filesystems
File-backed pages, naturally, have a filesystem behind them that manages the movement of data to and from persistent storage. The root of the problem with page pinning is that the kernel uses it to operate on the contents of pages outside of the filesystem's purview; that can lead to unpleasant surprises when those contents change at times when the filesystem is not expecting it. If filesystems were aware of pinned pages then they could at least attempt to take evasive action, but pinning is generally invisible to filesystems.
The approach that has been taken is to try to make pinning explicit and
visible; to that end, the new pin_user_pages()
API was added. The effect is about the same as with
get_user_pages(), but this interface attempts to mark pages to
show that they have been pinned.
There is still one little problem, though: there are no
spare bits in struct page to track the number of times a given
page has been pinned, so the developers had to hack things by using
a bias value (1024) in the page reference count. As a result, if a
page has at least that many references, it will appear to be pinned even if
it is not. For that reason, the function to query whether a page is
pinned is called page_maybe_dma_pinned().
Hubbard complained about the "maybe" in the name; it seems inadequate, but
it is the best that the development community has been able to achieve so
far.
Matthew Wilcox said that the folio work might be able to provide a few extra bits for a pin counter, but it is probably not enough. There was some talk of putting this count into a side structure, moving it out of struct page entirely. David Howells noted that the sorts of accesses that pin pages (DMA and direct I/O, primarily) are not that common, so the side-structure idea might be the best approach. David Hildenbrand wasn't sure of the scope of the problem, since the bias is 1024, it takes a lot of references for a page to appear to be pinned. Wilcox pointed out that frequently mapped pages, such as those containing the C library, will have high reference counts.
Hubbard returned to the status of the pinning work, saying that developers need to think about why they need access to the pages in question. If they are doing something that will touch a page's contents, then pin_user_pages() should be used. In other cases, often where the intent is to make changes to the underlying page structures, get_user_pages() is the right function to call. The process of converting filesystems to deal with pinning is ongoing; it is not a small job. There are also a lot of cases where kernel code uses set_page_dirty() to mark a page as having been modified, then unpins the page. He made a helper for that case, but it feels wrong to him; each one is a place where pages are being marked "dirty" outside of the filesystem that is responsible for them, but which is still not actively involved in (or aware of) the operation. At least the evil is concentrated now, he said.
At this point, Hubbard noted that he had 12 minutes left in his session to come up with a finished design that fully solves the pinning problem. Even after everything has been converted to the new pinning API, he said, the problems that drove all this work in the first place will remain unfixed. His suggestion was to add an API allowing kernel code to take a lease on a range within a file, and to require leases to be taken before pinning file pages. There is a significant advantage to this approach: it is a correct solution that would connect the filesystem and memory-management code. There has to be some way of communicating information about changes to file-backed code to filesystems, and this is the only proposal he has seen so far.
Ira Weiny objected that leases are hard to get right; there are lots of roadblocks that come up. He agreed, though, that some way of making filesystems aware of pins is needed. But filesystems, he said, don't like letting go of pages they are responsible for. Ext4 maintainer Ted Ts'o answered, though, that the prospect doesn't worry him. A lease (or equivalent mechanism) would be telling the filesystem that the pages in the affected range may be marked dirty at some point in the future; that implies that if the indicated pages don't have blocks allocated in the filesystem, that must be rectified immediately. A copy-on-write filesystem might have to copy the whole range, even if the pages are never dirtied in the end. If the pages are dirtied all of this is fine, and perhaps even better, since allocating for the whole range at once might lead to better layout.
Howells said that he did an implementation of file leases for network filesystems, but he ended up abandoning it. There were just too many problems with truncate(), and with direct I/O; it is hard to get it right. Hubbard suggested just ignoring truncate(), to general laughter.
Chris Mason said that, for Btrfs, marking the page dirty is not the only problem. Btrfs has to lock pages before doing I/O so that their contents can't change; among other things, a checksum of the page's contents must be taken, and that checksum must continue to match those contents. Josef Bacik said, in his classic way, that the situation was bad in general, and that this problem is the biggest barrier to the sharing of pages in the page cache. The memory-management subsystem going behind a filesystem's back is a huge problem, he said, that has to be fixed. He is not thrilled with the lease idea, though; it would conflict with the way that Btrfs manages ranges of dirty pages.
Kent Overstreet answered, though, that user space has always been able to modify pages at inopportune times. Direct I/O, for example, bypasses the page cache, and a buffer used for direct I/O can be mapped into a file as well. This can actually lead to deadlocks in some situations, and could be an attack vector. Bacik said that Btrfs has a special path for just this case.
Hubbard, having not really gotten the 12-minute design he was after, closed
the session by noting that fixing these problems may cause performance
regressions. There may be objections but, he said, the higher performance
was an "illusion" and the system was not correct. There was some brief
discussion of ways to mitigate some of any future performance loss, but
developers may still find themselves having to explain to users why their
I/O should not have been as fast as it was before the system was made to
work correctly.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/get_user_pages() |
| Conference | Storage, Filesystem, Memory-Management and BPF Summit/2022 |
