|
|
Log in / Subscribe / Register

get_user_pages(), pinned pages, and DAX

By Jonathan Corbet
May 7, 2019

LSFMM
The problems associated with the kernel's internal get_user_pages() function have been a topic of discussion at the Linux Storage, Filesystem, and Memory-Management Summit for a few years. At the 2019 event, Jan Kara began a plenary session by saying that it would be "like last year's session". It turned out rather differently, though, perhaps due to the plenary setting; this discussion (along with the related session that followed) turned out to be one of the most heated at the entire conference.

Pinned pages

Kara said that he didn't really want to waste a lot of time explaining the problem yet again, so he went over the background in a hurry; readers who are not familiar with the problem can learn more in this article. In the end, it comes down to filesystems that are unprepared for pages of data to be modified behind their back, leading to problems like data corruption and kernel crashes.

He has developed a list of frequently asked questions over the years of working on this problem. Could get_user_pages() simply be disabled for file-backed pages? No, because real applications use it. They map shared buffers for direct I/O, for example; these applications were written before the addition of the splice() system call, but they are still being used. Could the PageLocked flag be used to lock the pages entirely? That would lead to pages being locked for long periods of time, and could deadlock the system in a number of scenarios. How about using MMU notifiers to get callers to drop their [Jan Kara] references to pages on demand? "Good luck converting over 300 call sites" to the new scheme, Kara said; that would also increase the overhead for short-lived mappings that are used for tasks like direct I/O.

So something else must be done. The plan, Kara said, is to create a new type of page reference for get_user_pages() called a "pin". It would be obtained by calling get_pin(), which would increase the target page's reference count by a large bias value. Then, any page with a reference count greater than the bias value will be known to have pin references to it. There are some details to deal with, including the possible overflow of the reference count, though it shouldn't be a problem for real use cases. A large number of references could occasionally cause false positives, but that would not create problems either.

Kirill Shutemov suggested just making the PagePinned flag reliable, but Kara responded that doing so would require adding another reference count for each page. Space is not readily available for such a count. He had actually looked at schemes involving taking pinned pages out of the least-recently-used (LRU) lists, at which point the list pointers could be used, but it was not an easy task and conflicted with what the heterogeneous memory management code is doing.

The next step is to convert get_user_pages() users to release their pages with put_user_page(), which has already been merged into the mainline. There are a lot of call sites, so it will take a while to get this job done.

Christoph Hellwig jumped in to say that there are other problems with get_user_pages(), and that this might be a good time to do a general spring cleaning. Many of the users, he said, are "bogus"; many of the callers never actually look at the pages they have mapped. For the cases where a scatter/gather list is needed for DMA I/O, a helper could be provided. But Jérôme Glisse said that it would not, in the end, be possible to remove that many callers.

Once the system can tell which pages are pinned, it's just a matter of deciding what to do with that information. Kara talked mostly about the writeback code; in some cases, it will simply skip pages that are pinned. But there are cases where those pages must be written out — "somebody has called fsync(), and they expect something to be saved". In this case, pinned pages will be written, but they will not be marked clean at the end of the operation; they will still be write-protected in the page tables while writeback is underway, though. In situations where stable pages are required (when a checksum of the data must be written with the data, for example), bounce buffers will be used. There are over 40 places to fix up, not all of which are trivial.

DAX and long-term pins

At this point, Ira Weiny took over the leadership of the discussion to talk about the problems related to long-lasting page pins and persistent memory in particular. The DAX interface allows user space to operate directly on the memory in which files are stored (when the filesystem is on a persistent-memory device), which brings some interesting challenges. Writeback is not normally a problem with DAX, he said, but other operations, such as truncation and hole-punching are. When pages are removed from a file, they are normally taken out of the page cache as well, but with DAX those pages were never in the page cache to begin with. Instead, users are dealing directly with the storage media.

As a result, a truncate operation may have removed pages from a file while some of the pages in that file are pinned in memory. In this case, the filesystem has to leave the pages in place on the device. When pages are [Ira Weiny] pinned for a long time (as RDMA will do, for example), those pages can be left there indefinitely.

Weiny said that what is needed to solve this problem is some sort of a lease mechanism so that pinned pages can be unpinned on demand. He has a prototype implementation working now; it implements leases with get_user_pages(), with individual call sites being able to indicate whether they support this mechanism. If a user has a file mapped and pages are removed from it, that user will get a SIGBUS signal to indicate that this has happened. Weiny asked whether the group thought this approach was reasonable.

One attendee noted that NFS seems to handle this case now; it can lose a file delegation on a truncate event. Perhaps the kernel should use the NFS (or the SMB direct) mechanism? There are challenges to implementing that, Weiny said, and in any case it's not the approach that people have been looking at. DAX is fundamentally different, he said, in that it uses a filesystem to control memory semantics.

Boaz Harrosh said that the approach was wrong, and that the truncate or hole-punch operation should simply fail when pinned pages are present. Others agreed that it wasn't right to just randomly kill processes because somebody truncated a file somewhere. Whether that is truly "random" was a matter of debate, but that was a secondary issue.

The rest of the session was an increasingly loud argument between those who favored sending SIGBUS and those who thought that the file operation should fail. At one point Hellwig suggested that people who were yelling really didn't need to use the microphone as well. Your editor will confess to having simply stopped taking notes partway through; it was one of those disagreements where opinions are strong and nobody is prepared to change their mind. About the only point of agreement was that the current behavior in this situation, wherein a call to truncate() will simply hang, is not good.

Toward the end, Ted Ts'o said that it would probably prove necessary to implement both options sooner or later. The session ended, rather later than scheduled, with no agreement as to what the right solution is. This topic seems likely to light up the mailing lists in the future.

Index entries for this article
KernelMemory management/get_user_pages()
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

get_user_pages(), pinned pages, and DAX

Posted May 7, 2019 21:42 UTC (Tue) by roc (subscriber, #30627) [Link] (3 responses)

> Others agreed that it wasn't right to just randomly kill processes because somebody truncated a file somewhere.

Seems to me that truncating a file is likely to already "randomly kill processes" that are using it via mmap.

get_user_pages(), pinned pages, and DAX

Posted May 7, 2019 23:29 UTC (Tue) by jgg (subscriber, #55211) [Link] (2 responses)

For executable mmaps the ftruncate can return ETXTBUSY. For others you could get SIGBUS on next access. So POSIX endorses both options.

The SIGBUS idea here is different, it would install kill other processes. It also isn't clear how it would know which processes to kill since all things doing this link the pin lifetime to a FD lifetime..

get_user_pages(), pinned pages, and DAX

Posted Jun 28, 2019 5:16 UTC (Fri) by shentino (guest, #76459) [Link]

Accessing a range of an MMAP'ed file that's been truncated at the filesystem level is analogous to trying to access a range of physical memory that does not exist.

In this case it is very much the virtual equivalent of a bus error so a SIGBUS is completely appropriate given the insanity already present of whatever got the file truncated to begin with.

get_user_pages(), pinned pages, and DAX

Posted Jun 28, 2019 5:17 UTC (Fri) by shentino (guest, #76459) [Link]

SIGBUS shouldn't instakill processes, however it does, by default, terminate the receiving process with a core dump just like SIGSEGV

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 0:57 UTC (Wed) by dgc (subscriber, #6611) [Link] (8 responses)

> One attendee noted that NFS seems to handle this case now; it can lose a file delegation
> on a truncate event. Perhaps the kernel should use the NFS (or the SMB direct)
> mechanism?

Yup, that's exactly what I've been saying for the past ~3 years or so.

> There are challenges to implementing that, Weiny said, and in any
> case it's not the approach that people have been looking at.

<sigh>

Here's my rant on /exactly this topic/ from a day or two before LSFMM:

https://lore.kernel.org/linux-fsdevel/20190501234740.GN14...

I've said "file layout leases solve the revocation problem" and used the pNFS remote direct block device access example so many times now that I'm quite peeved that high level pmem/DAX developers are /still/ saying "people are not looking at it".

> DAX is fundamentally
> different, he said, in that it uses a filesystem to control memory semantics.

From the filesystem perspective it's no different to any other persistent storage - it's just that the backing store is closer to the CPU than DMA based storage devices.

This is the fundamental disagreement DAX causes: some people say it's storage whose access is arbitrated by
the filesystem and the filesystem syscall API (it's called FS-DAX for a reason!). The people who want to do RDMA, peer-to-peer DMA, etc all see it as roughly equivalent to page cache memory and so they can largely ignore the filesystem.

IOWs, this is all an argument over /who controls direct access to the storage/.

And the really dumb thing in this argument?

The FS-DAX core code itself depends on the filesystem layout break notification mechanisms to break and wait on DAX page table mappings to be released before truncate/hole punching can go ahead (see BREAK_UNMAP usage w/ xfs_break_layouts()). Yup, kernel FS-DAX effectively already uses the direct access arbitration model and interface we want userspace applications to use.....

IOWs, I'm not at all surprised by the fact this session turned into a yelling match....

-Dave.

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 5:36 UTC (Wed) by willy (subscriber, #9762) [Link]

Many of us walked out when the yelling started. It was clear this was not a productive use of anybody's time.

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 14:34 UTC (Wed) by djbw (subscriber, #78104) [Link] (1 responses)

Hi dgc, I don't know where the "people are not looking at it" interpretation is coming from, and I can't parse that "not the approach that people have been looking at" quote because the fact is Ira and I were the only ones in the room besides Christoph trying to advance the XFS position. Ira has taken it further with a proof-of-concept implementation posted to the list, but the conversation devolved into a shouting match before we could even get to those details.

Given the strong feelings, and that POSIX seems to allow both behaviors, I'm of the opinion that the fsdax core should implement the lease capability as library functionality that a filesystem can opt-in to, and let the RDMA folks take it up directly with the filesystems if they disagree with the local fs policy, or otherwise leave the status quo of simply blocking indefinite pinning on dax mappings. I'm personally of the opinion that failing truncate() is an abdication of the kernel's responsibility to try to ensure forward progress of the most recently requested system state, but it's clear there will not be wider consensus on that opinion any time soon. So, in the interest of moving forward with *any* lease implementation for this problem I'd suggest just starting with XFS for now.

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 16:35 UTC (Wed) by jgg (subscriber, #55211) [Link]

You can't implement leases at the FS layer alone - there is no way to implement a generic 'revoke' that is guaranteed to trigger the kernel to release the GUP. Ira should understand this problem well now.

This was mentioned during the session, but was also shouted down as 'broken'.

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 16:24 UTC (Wed) by jgg (subscriber, #55211) [Link] (3 responses)

I think layout leases and some mmap of the block device could certainly address some use cases, and it would be better than the nothing we have today..

But the entire objection to any sort of lease has always been that the plan to SIGKILL userspace after a timeout is horrible and unworkable in the real world. If someone came up with a better alternative I haven't heard it..

I gather the way pNFS/etc handle layout lease revoke in-kernel is generally OK, but those techniques don't translate to userspace??

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 19:11 UTC (Wed) by djbw (subscriber, #78104) [Link]

pNFS puts the revocation into the protocol so the lease holder is prepared to get out of the way. This was the motivation for the original lease proposal to make it an explicit interface rather than implicit so it is required of the registrant to be prepared to get out of the way. Then the secondary discussion is what happens when the application does not respond to revocation, and we fallback to the "unpin mechanism of last resort".

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 19:39 UTC (Wed) by iweiny (guest, #129274) [Link] (1 responses)

Layout leases in the kernel are different from a layout lease being taken by user space. While it is easy to force user space to take a Lease prior to a GUP (be it through RDMA, DirectIO, etc) it is more difficult to ensure they react to revocation of that lease. Basically there were a couple of simple scenarios I thought of.

1) User takes lease, GUP's pages, removes lease
2) User takes lease, GUP's pages, ignores SIGIO
3) User doesn't take a lease at all.

Number 3 is a nicety to have so I will accept ignoring it. But 1 and 2 are more serious because now the lease means nothing. The application is free to "run away" with this memory. Or can we allow the truncate to hang/fail?

Therefore, I came up with taking the lease in the GUP code (Which also supported 3 above). The use of SIGBUS was the way to ensure truncate would not fail. I'm not sure I agree that failing truncate blocks forward progress as some filesystems already do this.

The SIGBUS solution is all shown in the prototype[1]. However, during the session Jason brought up the fact that with RDMA it would be difficult, if not impossible, to properly ID the process which holds the GUP pin. At the time I was thinking of the FD of the file being mmap'ed rather than the FD representing the RDMA context. After the session, Jason and I spoke and he clued me into what he was talking about. This does put a wrinkle in the ability to send a SIGBUS to the proper application. I've spent some time looking into how difficult it would be to get this right and it would be difficult.

So if we don't (or can't) send a SIGBUS to the application holding the pin. We have a few options. (Potentially implemented as a "library" as Dan mentioned.)

1) Allow each file system to determine what their truncate behavior is. For example, ext4 does return EBUSY on truncate. So it may be easiest there to detect the GUP pin and just return EBUSY.

2) For other file systems which want to pursue the lease path we can...
2a) force a lease to be taken, and don't allow the lease to be removed if pages have been pinned
2b) and allow a user to ignore SIGIO
2c) Hang and/or fail the truncate

After a couple of days thinking about it I don't see a way that the truncate behavior for FS DAX "longterm" pinned pages is not going to be different from a page cache backed FS.

So one possible path forward would be to force the user to take, and maintain, a lease and let truncate hang (or return EBUSY) under these conditions. This ensures that only applications which specifically "opt in" to this behavior are allowed to do this.

Would that be acceptable?

[1] https://lwn.net/ml/linux-fsdevel/20190429045359.8923-1-ir...

get_user_pages(), pinned pages, and DAX

Posted May 9, 2019 12:04 UTC (Thu) by jgg (subscriber, #55211) [Link]

I think anything done with leases has to interact with GUP - so the lease is not fully released until any GUP on the constituent DAX pages of the block device is also released. Otherwise we have user-triggered data corruption problems if the lease goes away while DMA is in progress, and real apps could unknowingly hit these corner cases..

ie without a linkage process exit might mis-order lease release and DMA fence in the kernel creating data corruption races!

IMHO this brings us right back to the start of the discussion where the FS is blocked on progress because of GUP - just the blocked progress is lease revoke now.

Dan's long-ago original idea of having the GUP caller provide an in-kernel revoke GUP callback is, IMHO, the only way to make userspace leases work. Keep revoke in the kernel.

The problem with Dan's idea is that we couldn't find any way to actually implement revoke in vfio and rdma that wasn't also data corrupting. :(

RDMA *might* have some hope here, at least for some cases, on a driver by driver basis, but it requires someone to convince the popular driver vendors to implement a special rereg MR verb.

Which is where I disagree with DaveC's assessment - IMHO, the real disagreement is that the two good solutions requires either XFS or RDMA to do something *really hard*.

Boaz was right, the FS could solve this by orphaning GUP'd pages, Dan said this was very very hard for XFS.

Dan is right, RDMA could solve this by supporting ODP or MR revoke on all hardware, however this requires new standards and HW.

get_user_pages(), pinned pages, and DAX

Posted May 12, 2019 22:58 UTC (Sun) by dgc (subscriber, #6611) [Link]

<sigh>

And what we have is a range of responses showing why this topic ends up with people shouting - everyone (except Dan) has (again) missed the point I'm trying to make. Layout leases have /nothing/ to do with specific technology issues (RDMA, GUP pinning, whether SIGBUS is appropriate or not, etc). I'm concerned about how we fulfill the primary responsibility of the kernel: managing user access to hardware robustly and safely.

We've been doing remote direct access to storage in the filesystem/storage world for a long, long time. e.g. clustered SAN filesystems. They have a protocol for access rights, and the node responsible for arbitration can kick out and fence any node that goes rogue or doesn't play by the rules. File layout leases are a similar abstract management protocol to co-ordinate 3rd party access to block devices and to provide notifications that collisions have occurred. The only difference is that layout leases don't do any of the enforcement side of the protocol - they are just tracking and notification and everything else is left up to the application.

So you can all argue over how hard it is to revoke GUP, but that completely misses the point - the point is that without a /notification/ that an access collision has occurred and we need to recall and re-let the leases to resolve the issue, we can't even begin solve this co-ordination problem.

Indeed, what happens when we have multiple different technologies all doing direct access to the storage hardware and they have to be co-ordinated? e.g. pNFS clients push data into the local filesystem, local GPUs or ASICs pull the data direct from the storage, then push the result back direct to teh storage, then another remote node pulls the result direct from the storage via pNFS... How are these all supposed to interoperate safely and correctly?

The fact we have new technologies like DAX, pmem, p2p DMA, etc does not change the fact we still need to co-ordinate all the different direct accesses being made by applications to the storage. If anything, the new technologies make access arbitration even more important to get right.

So, please, if you have an access co-ordination mechanism that can tie pNFS, SMB-direct, DAX, p2p hardware DMA, RDMA push and pull direct to/from storage (e.g. NVMeOF), etc togther with the local filesystem that manages the storage allocations with a coherent, sane, workable management strategy, I'm all ears. Nobody else is suggesting any sort of solution to these high level architecture issues....

If we are going to end up with systems that can make use of all these direct storage access technologies in a reliable, robust manner then we have to solve the general access arbitration and notification problem before anything else. Otherwise we'll just keep going down the road of "nothing works properly with anything else" as we have been doing for the past few years.

-Dave.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds