|
|
Log in / Subscribe / Register

Page pinning and filesystems

By Jonathan Corbet
May 10, 2022

LSFMM
It would have been surprising indeed if the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) did not include a session working toward solutions to the longstanding problems with get_user_pages(), an internal function that locks user-space pages in memory for access by the kernel. The issue has, after all, come up numerous times over the years. This year's event duly contained a session in the joint filesystem and memory-management track, led by John Hubbard, with a focus on page pinning and how it interacts with filesystems.

File-backed pages, naturally, have a filesystem behind them that manages the movement of data to and from persistent storage. The root of the problem with page pinning is that the kernel uses it to operate on the contents of pages outside of the filesystem's purview; that can lead to unpleasant surprises when those contents change at times when the filesystem is not expecting it. If filesystems were aware of pinned pages then they could at least attempt to take evasive action, but pinning is generally invisible to filesystems.

The approach that has been taken is to try to make pinning explicit and visible; to that end, the new pin_user_pages() API was added. The effect is about the same as with get_user_pages(), but this interface attempts to mark pages to show that they have been pinned. There is still one little problem, though: there are no spare bits in struct page to track the number of times a given page has been pinned, so the developers had to hack things by using a bias value (1024) in the page reference count. As a result, if a page has at least that many references, it will appear to be pinned even if [John Hubbard] it is not. For that reason, the function to query whether a page is pinned is called page_maybe_dma_pinned(). Hubbard complained about the "maybe" in the name; it seems inadequate, but it is the best that the development community has been able to achieve so far.

Matthew Wilcox said that the folio work might be able to provide a few extra bits for a pin counter, but it is probably not enough. There was some talk of putting this count into a side structure, moving it out of struct page entirely. David Howells noted that the sorts of accesses that pin pages (DMA and direct I/O, primarily) are not that common, so the side-structure idea might be the best approach. David Hildenbrand wasn't sure of the scope of the problem, since the bias is 1024, it takes a lot of references for a page to appear to be pinned. Wilcox pointed out that frequently mapped pages, such as those containing the C library, will have high reference counts.

Hubbard returned to the status of the pinning work, saying that developers need to think about why they need access to the pages in question. If they are doing something that will touch a page's contents, then pin_user_pages() should be used. In other cases, often where the intent is to make changes to the underlying page structures, get_user_pages() is the right function to call. The process of converting filesystems to deal with pinning is ongoing; it is not a small job. There are also a lot of cases where kernel code uses set_page_dirty() to mark a page as having been modified, then unpins the page. He made a helper for that case, but it feels wrong to him; each one is a place where pages are being marked "dirty" outside of the filesystem that is responsible for them, but which is still not actively involved in (or aware of) the operation. At least the evil is concentrated now, he said.

At this point, Hubbard noted that he had 12 minutes left in his session to come up with a finished design that fully solves the pinning problem. Even after everything has been converted to the new pinning API, he said, the problems that drove all this work in the first place will remain unfixed. His suggestion was to add an API allowing kernel code to take a lease on a range within a file, and to require leases to be taken before pinning file pages. There is a significant advantage to this approach: it is a correct solution that would connect the filesystem and memory-management code. There has to be some way of communicating information about changes to file-backed code to filesystems, and this is the only proposal he has seen so far.

Ira Weiny objected that leases are hard to get right; there are lots of roadblocks that come up. He agreed, though, that some way of making filesystems aware of pins is needed. But filesystems, he said, don't like letting go of pages they are responsible for. Ext4 maintainer Ted Ts'o answered, though, that the prospect doesn't worry him. A lease (or equivalent mechanism) would be telling the filesystem that the pages in the affected range may be marked dirty at some point in the future; that implies that if the indicated pages don't have blocks allocated in the filesystem, that must be rectified immediately. A copy-on-write filesystem might have to copy the whole range, even if the pages are never dirtied in the end. If the pages are dirtied all of this is fine, and perhaps even better, since allocating for the whole range at once might lead to better layout.

Howells said that he did an implementation of file leases for network filesystems, but he ended up abandoning it. There were just too many problems with truncate(), and with direct I/O; it is hard to get it right. Hubbard suggested just ignoring truncate(), to general laughter.

Chris Mason said that, for Btrfs, marking the page dirty is not the only problem. Btrfs has to lock pages before doing I/O so that their contents can't change; among other things, a checksum of the page's contents must be taken, and that checksum must continue to match those contents. Josef Bacik said, in his classic way, that the situation was bad in general, and that this problem is the biggest barrier to the sharing of pages in the page cache. The memory-management subsystem going behind a filesystem's back is a huge problem, he said, that has to be fixed. He is not thrilled with the lease idea, though; it would conflict with the way that Btrfs manages ranges of dirty pages.

Kent Overstreet answered, though, that user space has always been able to modify pages at inopportune times. Direct I/O, for example, bypasses the page cache, and a buffer used for direct I/O can be mapped into a file as well. This can actually lead to deadlocks in some situations, and could be an attack vector. Bacik said that Btrfs has a special path for just this case.

Hubbard, having not really gotten the 12-minute design he was after, closed the session by noting that fixing these problems may cause performance regressions. There may be objections but, he said, the higher performance was an "illusion" and the system was not correct. There was some brief discussion of ways to mitigate some of any future performance loss, but developers may still find themselves having to explain to users why their I/O should not have been as fast as it was before the system was made to work correctly.

Index entries for this article
KernelMemory management/get_user_pages()
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2022


to post comments

Page pinning and filesystems

Posted May 11, 2022 0:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Wilcox pointed out that frequently mapped pages, such as those containing the C library, will have high reference counts.

What is the downside of that? It's not like C library can be swapped out or reclaimed meaningfully.

Page pinning and filesystems

Posted May 11, 2022 3:25 UTC (Wed) by nickodell (subscriber, #125165) [Link] (1 responses)

Page pinning means that a page cannot be unloaded, but it also means that the page cannot be moved to a different location in physical memory. The kernel might want to move a page if it's trying to allocate a physically contiguous region of memory, for example for hugepages. If there are false positives (pages that appear to be pinned but aren't) then the kernel will have a hard time allocating this memory.

Page pinning and filesystems

Posted May 11, 2022 16:37 UTC (Wed) by awww (guest, #122021) [Link]

How well does moving a page with 1024 mappings work? That's a lot of tlb flushes and mmap locks to take...


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds