|
|
Log in / Subscribe / Register

get_user_pages(), pinned pages, and DAX

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 16:24 UTC (Wed) by jgg (subscriber, #55211)
In reply to: get_user_pages(), pinned pages, and DAX by dgc
Parent article: get_user_pages(), pinned pages, and DAX

I think layout leases and some mmap of the block device could certainly address some use cases, and it would be better than the nothing we have today..

But the entire objection to any sort of lease has always been that the plan to SIGKILL userspace after a timeout is horrible and unworkable in the real world. If someone came up with a better alternative I haven't heard it..

I gather the way pNFS/etc handle layout lease revoke in-kernel is generally OK, but those techniques don't translate to userspace??


to post comments

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 19:11 UTC (Wed) by djbw (subscriber, #78104) [Link]

pNFS puts the revocation into the protocol so the lease holder is prepared to get out of the way. This was the motivation for the original lease proposal to make it an explicit interface rather than implicit so it is required of the registrant to be prepared to get out of the way. Then the secondary discussion is what happens when the application does not respond to revocation, and we fallback to the "unpin mechanism of last resort".

get_user_pages(), pinned pages, and DAX

Posted May 8, 2019 19:39 UTC (Wed) by iweiny (guest, #129274) [Link] (1 responses)

Layout leases in the kernel are different from a layout lease being taken by user space. While it is easy to force user space to take a Lease prior to a GUP (be it through RDMA, DirectIO, etc) it is more difficult to ensure they react to revocation of that lease. Basically there were a couple of simple scenarios I thought of.

1) User takes lease, GUP's pages, removes lease
2) User takes lease, GUP's pages, ignores SIGIO
3) User doesn't take a lease at all.

Number 3 is a nicety to have so I will accept ignoring it. But 1 and 2 are more serious because now the lease means nothing. The application is free to "run away" with this memory. Or can we allow the truncate to hang/fail?

Therefore, I came up with taking the lease in the GUP code (Which also supported 3 above). The use of SIGBUS was the way to ensure truncate would not fail. I'm not sure I agree that failing truncate blocks forward progress as some filesystems already do this.

The SIGBUS solution is all shown in the prototype[1]. However, during the session Jason brought up the fact that with RDMA it would be difficult, if not impossible, to properly ID the process which holds the GUP pin. At the time I was thinking of the FD of the file being mmap'ed rather than the FD representing the RDMA context. After the session, Jason and I spoke and he clued me into what he was talking about. This does put a wrinkle in the ability to send a SIGBUS to the proper application. I've spent some time looking into how difficult it would be to get this right and it would be difficult.

So if we don't (or can't) send a SIGBUS to the application holding the pin. We have a few options. (Potentially implemented as a "library" as Dan mentioned.)

1) Allow each file system to determine what their truncate behavior is. For example, ext4 does return EBUSY on truncate. So it may be easiest there to detect the GUP pin and just return EBUSY.

2) For other file systems which want to pursue the lease path we can...
2a) force a lease to be taken, and don't allow the lease to be removed if pages have been pinned
2b) and allow a user to ignore SIGIO
2c) Hang and/or fail the truncate

After a couple of days thinking about it I don't see a way that the truncate behavior for FS DAX "longterm" pinned pages is not going to be different from a page cache backed FS.

So one possible path forward would be to force the user to take, and maintain, a lease and let truncate hang (or return EBUSY) under these conditions. This ensures that only applications which specifically "opt in" to this behavior are allowed to do this.

Would that be acceptable?

[1] https://lwn.net/ml/linux-fsdevel/20190429045359.8923-1-ir...

get_user_pages(), pinned pages, and DAX

Posted May 9, 2019 12:04 UTC (Thu) by jgg (subscriber, #55211) [Link]

I think anything done with leases has to interact with GUP - so the lease is not fully released until any GUP on the constituent DAX pages of the block device is also released. Otherwise we have user-triggered data corruption problems if the lease goes away while DMA is in progress, and real apps could unknowingly hit these corner cases..

ie without a linkage process exit might mis-order lease release and DMA fence in the kernel creating data corruption races!

IMHO this brings us right back to the start of the discussion where the FS is blocked on progress because of GUP - just the blocked progress is lease revoke now.

Dan's long-ago original idea of having the GUP caller provide an in-kernel revoke GUP callback is, IMHO, the only way to make userspace leases work. Keep revoke in the kernel.

The problem with Dan's idea is that we couldn't find any way to actually implement revoke in vfio and rdma that wasn't also data corrupting. :(

RDMA *might* have some hope here, at least for some cases, on a driver by driver basis, but it requires someone to convince the popular driver vendors to implement a special rereg MR verb.

Which is where I disagree with DaveC's assessment - IMHO, the real disagreement is that the two good solutions requires either XFS or RDMA to do something *really hard*.

Boaz was right, the FS could solve this by orphaning GUP'd pages, Dan said this was very very hard for XFS.

Dan is right, RDMA could solve this by supporting ODP or MR revoke on all hardware, however this requires new standards and HW.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds