The ongoing trouble with get_user_pages()
The problem
The get_user_pages() API comes in a number of variants; this API family is often referred to as "GUP". Its purpose is to provide the kernel with direct access to user-space memory; that involves ensuring that the relevant pages are resident in RAM and cannot be evicted for as long as that access is needed. The root of the problem with get_user_pages() is that it creates a situation where there are two separate paths to the memory in question.
User-space processes access their memory by way of virtual addresses mapped in their page tables. Those addresses are only valid within the process owning the memory. The page tables provide a convenient handle when the kernel needs to control access to specific ranges of user-space memory for a while. A common example is writing dirty file pages back to persistent store. A filesystem will mark those pages (in the relevant page table) as read-only, preventing further modification while the writeback is underway. If the owning process attempts to write to those pages, it will be blocked until writeback completes; thereafter, the read-only protection will cause a page fault, allowing the filesystem to be notified that the page has been dirtied again.
A call to get_user_pages() will return pointers to the kernel's page structures representing the physical pages holding the user-space memory, which can be used to obtain kernel-virtual addresses for those pages. Those addresses are in the kernel's address space, usually in the kernel's direct map that covers all of physical memory (on 64-bit systems). They are not the same as the user-space addresses, and are not subject to the same control. Direct-mapped memory that does not hold executable text is (almost) always writable by the kernel.
User space can use mmap() to map a file into its address space, creating a range of file-backed pages. Those pages will be initially marked read-only, even if mapped for write access, so that the filesystem can be notified when one of them is changed. If the kernel uses get_user_pages() to obtain write access to file-backed pages, the underlying filesystem will be duly notified that the pages have been made writable. At some future time, that filesystem will write the pages back to persistent storage, making them read-only. That protection change, though, applies only to the user-space addresses for those pages. The mapping in the kernel's address space remains writable.
That is where the problem arises: if the kernel writes to its mapping after the filesystem has written the pages back, the filesystem will be unaware that those pages have changed. Kernel code can mark the pages dirty, possibly leading to filesystem-level surprises when a seemingly read-only page has been written to. There are also a few scenarios in which the pages may never get marked as dirty, despite having been written to, in which case the written data may never find its way out to disk. Either way, the consequences are unfortunate.
This problem has been the subject of a long series of LSFMM discussions and an equally interminable set of LWN articles, but it is not an easy one to solve. There are times when the kernel simply needs access to user-space memory, often for performance purposes. A frequently repeated example is using RDMA to read data directly into file-backed pages. Allowing a DMA-capable device to write data directly into a user-space page requires pinning that page, perhaps for a long time. Finding a reliable way to enable this kind of back channel into user-space has proved difficult.
A partial solution?
In late April, Stoakes decided to face part of the problem head-on, posting
a
patch that would simply disallow get_user_pages() calls that
request write access to file-backed pages. Recognizing, though, that
there are some cases that require exactly this kind of mapping, he also
included a new flag, FOLL_ALLOW_BROKEN_FILE_MAPPING, to override
the prohibition; some InfiniBand controllers were updated to use that flag.
Making this change, Stoakes said, "is an important step towards a more
reliable GUP, and explicitly indicates which callers might encounter issues
moving forward
".
Over the following week or so, the series went through several revisions. The most significant, perhaps, was to drop the FOLL_ALLOW_BROKEN_FILE_MAPPING flag and, instead, only prohibit get_user_pages() calls that provide the FOLL_LONGTERM flag (and which request write access to file-backed pages), indicating that the mapping is likely to persist for a long time. Shorter-term mappings are not immune to the problem but, by virtue of being short-lived, they are much less likely to trigger it. This change was an acknowledgment that it is still not possible to fully solve — or even block — the problem.
This proposal has provoked a fair amount of discussion. Christoph Hellwig
worried that
it would break users who are using direct I/O to write into file-backed
mappings, but Jason Gunthorpe questioned whether
any such users exist, saying that people who tried it "didn't get very
far before testing uncovered data corruption and kernel crashes
".
David Hildenbrand, instead, suggested
that some virtualization setups could be broken by the change; once again,
Gunthorpe doubted that any
such use cases could be working successfully now.
Hildenbrand had
some other concerns about the patch, including the fact that it does
not solve the full problem: "If we want to sell this as a security
thing, we have to block it *completely* and then CC stable.
Everything else sounds like band-aids to me.
". He complained that it
does not address the "GUP-fast" subset of get_user_pages() APIs —
an omission that Stoakes later fixed. He suggested that bringing the topic
to this year's LSFMM+BPF conference (which starts on May 8) would be a
logical next step.
Ted Ts'o described
an ext4 bug that had resulted from this problem; the filesystem was not
prepared for pages to be marked dirty at unexpected times and could be made
to crash. A fix
was merged into 5.18 to prevent the crash but, Ts'o said, that might not
have been the right thing to do, since it "has apparently removed some
of the motivation of really fixing the problem instead of papering over
it
". He stated that, in the view of the filesystem
developers, writing to file-backed pages via get_user_pages() is a
bug and "you get to keep both pieces
".
Gunthorpe took Ts'o's words as yet another reason to block write access to file-backed pages:
This alone is enough reason to block it. I'm tired of this round and round and I think we should just say enough, the mm will work to enforce this view point. Files can only be written through PTEs. If this upsets people they can work on fixing it, but at least we don't have these kernel problems and inconsistencies to deal with.
There is still not a complete agreement, though, that even the partial
block that is on the table should be merged. The worries that it could end
up breaking user-space applications, or that merging the relatively easy
fix could delay the implementation of a complete solution, are not going to
just vanish. So it seems that yet another LSFMM+BPF discussion is
inevitable; indeed, Stoakes seems
to be looking forward to it: "I think discussion at LSF/MM is also a
sensible idea, also you know, if beers were bought too it could all work
out nicely :]
". So this long-term discussion is, it seems, not
over yet.
Index entries for this article | |
---|---|
Kernel | Memory management/get_user_pages() |