LWN: Comments on "get_user_pages(), pinned pages, and DAX" https://lwn.net/Articles/787636/ This is a special feed containing comments posted to the individual LWN article titled "get_user_pages(), pinned pages, and DAX". en-us Fri, 10 Oct 2025 13:18:40 +0000 Fri, 10 Oct 2025 13:18:40 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/792296/ https://lwn.net/Articles/792296/ shentino <div class="FormattedComment"> SIGBUS shouldn't instakill processes, however it does, by default, terminate the receiving process with a core dump just like SIGSEGV<br> </div> Fri, 28 Jun 2019 05:17:45 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/792295/ https://lwn.net/Articles/792295/ shentino <div class="FormattedComment"> Accessing a range of an MMAP'ed file that's been truncated at the filesystem level is analogous to trying to access a range of physical memory that does not exist.<br> <p> In this case it is very much the virtual equivalent of a bus error so a SIGBUS is completely appropriate given the insanity already present of whatever got the file truncated to begin with.<br> </div> Fri, 28 Jun 2019 05:16:59 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/788190/ https://lwn.net/Articles/788190/ dgc <div class="FormattedComment"> &lt;sigh&gt;<br> <p> And what we have is a range of responses showing why this topic ends up with people shouting - everyone (except Dan) has (again) missed the point I'm trying to make. Layout leases have /nothing/ to do with specific technology issues (RDMA, GUP pinning, whether SIGBUS is appropriate or not, etc). I'm concerned about how we fulfill the primary responsibility of the kernel: managing user access to hardware robustly and safely. <br> <p> We've been doing remote direct access to storage in the filesystem/storage world for a long, long time. e.g. clustered SAN filesystems. They have a protocol for access rights, and the node responsible for arbitration can kick out and fence any node that goes rogue or doesn't play by the rules. File layout leases are a similar abstract management protocol to co-ordinate 3rd party access to block devices and to provide notifications that collisions have occurred. The only difference is that layout leases don't do any of the enforcement side of the protocol - they are just tracking and notification and everything else is left up to the application.<br> <p> So you can all argue over how hard it is to revoke GUP, but that completely misses the point - the point is that without a /notification/ that an access collision has occurred and we need to recall and re-let the leases to resolve the issue, we can't even begin solve this co-ordination problem.<br> <p> Indeed, what happens when we have multiple different technologies all doing direct access to the storage hardware and they have to be co-ordinated? e.g. pNFS clients push data into the local filesystem, local GPUs or ASICs pull the data direct from the storage, then push the result back direct to teh storage, then another remote node pulls the result direct from the storage via pNFS... How are these all supposed to interoperate safely and correctly? <br> <p> The fact we have new technologies like DAX, pmem, p2p DMA, etc does not change the fact we still need to co-ordinate all the different direct accesses being made by applications to the storage. If anything, the new technologies make access arbitration even more important to get right.<br> <p> So, please, if you have an access co-ordination mechanism that can tie pNFS, SMB-direct, DAX, p2p hardware DMA, RDMA push and pull direct to/from storage (e.g. NVMeOF), etc togther with the local filesystem that manages the storage allocations with a coherent, sane, workable management strategy, I'm all ears. Nobody else is suggesting any sort of solution to these high level architecture issues....<br> <p> If we are going to end up with systems that can make use of all these direct storage access technologies in a reliable, robust manner then we have to solve the general access arbitration and notification problem before anything else. Otherwise we'll just keep going down the road of "nothing works properly with anything else" as we have been doing for the past few years.<br> <p> -Dave.<br> </div> Sun, 12 May 2019 22:58:12 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787911/ https://lwn.net/Articles/787911/ jgg <div class="FormattedComment"> I think anything done with leases has to interact with GUP - so the lease is not fully released until any GUP on the constituent DAX pages of the block device is also released. Otherwise we have user-triggered data corruption problems if the lease goes away while DMA is in progress, and real apps could unknowingly hit these corner cases.. <br> <p> ie without a linkage process exit might mis-order lease release and DMA fence in the kernel creating data corruption races!<br> <p> IMHO this brings us right back to the start of the discussion where the FS is blocked on progress because of GUP - just the blocked progress is lease revoke now.<br> <p> Dan's long-ago original idea of having the GUP caller provide an in-kernel revoke GUP callback is, IMHO, the only way to make userspace leases work. Keep revoke in the kernel.<br> <p> The problem with Dan's idea is that we couldn't find any way to actually implement revoke in vfio and rdma that wasn't also data corrupting. :(<br> <p> RDMA *might* have some hope here, at least for some cases, on a driver by driver basis, but it requires someone to convince the popular driver vendors to implement a special rereg MR verb.<br> <p> Which is where I disagree with DaveC's assessment - IMHO, the real disagreement is that the two good solutions requires either XFS or RDMA to do something *really hard*. <br> <p> Boaz was right, the FS could solve this by orphaning GUP'd pages, Dan said this was very very hard for XFS.<br> <p> Dan is right, RDMA could solve this by supporting ODP or MR revoke on all hardware, however this requires new standards and HW.<br> </div> Thu, 09 May 2019 12:04:40 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787866/ https://lwn.net/Articles/787866/ iweiny <div class="FormattedComment"> Layout leases in the kernel are different from a layout lease being taken by user space. While it is easy to force user space to take a Lease prior to a GUP (be it through RDMA, DirectIO, etc) it is more difficult to ensure they react to revocation of that lease. Basically there were a couple of simple scenarios I thought of.<br> <p> 1) User takes lease, GUP's pages, removes lease<br> 2) User takes lease, GUP's pages, ignores SIGIO<br> 3) User doesn't take a lease at all.<br> <p> Number 3 is a nicety to have so I will accept ignoring it. But 1 and 2 are more serious because now the lease means nothing. The application is free to "run away" with this memory. Or can we allow the truncate to hang/fail?<br> <p> Therefore, I came up with taking the lease in the GUP code (Which also supported 3 above). The use of SIGBUS was the way to ensure truncate would not fail. I'm not sure I agree that failing truncate blocks forward progress as some filesystems already do this.<br> <p> The SIGBUS solution is all shown in the prototype[1]. However, during the session Jason brought up the fact that with RDMA it would be difficult, if not impossible, to properly ID the process which holds the GUP pin. At the time I was thinking of the FD of the file being mmap'ed rather than the FD representing the RDMA context. After the session, Jason and I spoke and he clued me into what he was talking about. This does put a wrinkle in the ability to send a SIGBUS to the proper application. I've spent some time looking into how difficult it would be to get this right and it would be difficult.<br> <p> So if we don't (or can't) send a SIGBUS to the application holding the pin. We have a few options. (Potentially implemented as a "library" as Dan mentioned.)<br> <p> 1) Allow each file system to determine what their truncate behavior is. For example, ext4 does return EBUSY on truncate. So it may be easiest there to detect the GUP pin and just return EBUSY.<br> <p> 2) For other file systems which want to pursue the lease path we can...<br> 2a) force a lease to be taken, and don't allow the lease to be removed if pages have been pinned<br> 2b) and allow a user to ignore SIGIO<br> 2c) Hang and/or fail the truncate<br> <p> After a couple of days thinking about it I don't see a way that the truncate behavior for FS DAX "longterm" pinned pages is not going to be different from a page cache backed FS.<br> <p> So one possible path forward would be to force the user to take, and maintain, a lease and let truncate hang (or return EBUSY) under these conditions. This ensures that only applications which specifically "opt in" to this behavior are allowed to do this.<br> <p> Would that be acceptable?<br> <p> [1] <a href="https://lwn.net/ml/linux-fsdevel/20190429045359.8923-1-ira.weiny@intel.com/">https://lwn.net/ml/linux-fsdevel/20190429045359.8923-1-ir...</a><br> <p> <p> </div> Wed, 08 May 2019 19:39:26 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787865/ https://lwn.net/Articles/787865/ djbw <div class="FormattedComment"> pNFS puts the revocation into the protocol so the lease holder is prepared to get out of the way. This was the motivation for the original lease proposal to make it an explicit interface rather than implicit so it is required of the registrant to be prepared to get out of the way. Then the secondary discussion is what happens when the application does not respond to revocation, and we fallback to the "unpin mechanism of last resort".<br> </div> Wed, 08 May 2019 19:11:47 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787855/ https://lwn.net/Articles/787855/ jgg <div class="FormattedComment"> You can't implement leases at the FS layer alone - there is no way to implement a generic 'revoke' that is guaranteed to trigger the kernel to release the GUP. Ira should understand this problem well now.<br> <p> This was mentioned during the session, but was also shouted down as 'broken'.<br> </div> Wed, 08 May 2019 16:35:04 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787850/ https://lwn.net/Articles/787850/ jgg <div class="FormattedComment"> I think layout leases and some mmap of the block device could certainly address some use cases, and it would be better than the nothing we have today..<br> <p> But the entire objection to any sort of lease has always been that the plan to SIGKILL userspace after a timeout is horrible and unworkable in the real world. If someone came up with a better alternative I haven't heard it..<br> <p> I gather the way pNFS/etc handle layout lease revoke in-kernel is generally OK, but those techniques don't translate to userspace??<br> </div> Wed, 08 May 2019 16:24:43 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787843/ https://lwn.net/Articles/787843/ djbw <div class="FormattedComment"> Hi dgc, I don't know where the "people are not looking at it" interpretation is coming from, and I can't parse that "not the approach that people have been looking at" quote because the fact is Ira and I were the only ones in the room besides Christoph trying to advance the XFS position. Ira has taken it further with a proof-of-concept implementation posted to the list, but the conversation devolved into a shouting match before we could even get to those details.<br> <p> Given the strong feelings, and that POSIX seems to allow both behaviors, I'm of the opinion that the fsdax core should implement the lease capability as library functionality that a filesystem can opt-in to, and let the RDMA folks take it up directly with the filesystems if they disagree with the local fs policy, or otherwise leave the status quo of simply blocking indefinite pinning on dax mappings. I'm personally of the opinion that failing truncate() is an abdication of the kernel's responsibility to try to ensure forward progress of the most recently requested system state, but it's clear there will not be wider consensus on that opinion any time soon. So, in the interest of moving forward with *any* lease implementation for this problem I'd suggest just starting with XFS for now. <br> </div> Wed, 08 May 2019 14:34:12 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787785/ https://lwn.net/Articles/787785/ willy <div class="FormattedComment"> Many of us walked out when the yelling started. It was clear this was not a productive use of anybody's time.<br> </div> Wed, 08 May 2019 05:36:22 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787773/ https://lwn.net/Articles/787773/ dgc <div class="FormattedComment"> <font class="QuotedText">&gt; One attendee noted that NFS seems to handle this case now; it can lose a file delegation</font><br> <font class="QuotedText">&gt; on a truncate event. Perhaps the kernel should use the NFS (or the SMB direct)</font><br> <font class="QuotedText">&gt; mechanism?</font><br> <p> Yup, that's exactly what I've been saying for the past ~3 years or so.<br> <p> <font class="QuotedText">&gt; There are challenges to implementing that, Weiny said, and in any</font><br> <font class="QuotedText">&gt; case it's not the approach that people have been looking at.</font><br> <p> &lt;sigh&gt;<br> <p> Here's my rant on /exactly this topic/ from a day or two before LSFMM:<br> <p> <a href="https://lore.kernel.org/linux-fsdevel/20190501234740.GN1454@dread.disaster.area/">https://lore.kernel.org/linux-fsdevel/20190501234740.GN14...</a><br> <p> I've said "file layout leases solve the revocation problem" and used the pNFS remote direct block device access example so many times now that I'm quite peeved that high level pmem/DAX developers are /still/ saying "people are not looking at it".<br> <p> <font class="QuotedText">&gt; DAX is fundamentally</font><br> <font class="QuotedText">&gt; different, he said, in that it uses a filesystem to control memory semantics. </font><br> <p> From the filesystem perspective it's no different to any other persistent storage - it's just that the backing store is closer to the CPU than DMA based storage devices.<br> <p> This is the fundamental disagreement DAX causes: some people say it's storage whose access is arbitrated by<br> the filesystem and the filesystem syscall API (it's called FS-DAX for a reason!). The people who want to do RDMA, peer-to-peer DMA, etc all see it as roughly equivalent to page cache memory and so they can largely ignore the filesystem.<br> <p> IOWs, this is all an argument over /who controls direct access to the storage/.<br> <p> And the really dumb thing in this argument?<br> <p> The FS-DAX core code itself depends on the filesystem layout break notification mechanisms to break and wait on DAX page table mappings to be released before truncate/hole punching can go ahead (see BREAK_UNMAP usage w/ xfs_break_layouts()). Yup, kernel FS-DAX effectively already uses the direct access arbitration model and interface we want userspace applications to use.....<br> <p> IOWs, I'm not at all surprised by the fact this session turned into a yelling match....<br> <p> -Dave.<br> </div> Wed, 08 May 2019 00:57:49 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787772/ https://lwn.net/Articles/787772/ jgg <div class="FormattedComment"> For executable mmaps the ftruncate can return ETXTBUSY. For others you could get SIGBUS on next access. So POSIX endorses both options.<br> <p> The SIGBUS idea here is different, it would install kill other processes. It also isn't clear how it would know which processes to kill since all things doing this link the pin lifetime to a FD lifetime..<br> </div> Tue, 07 May 2019 23:29:29 +0000 get_user_pages(), pinned pages, and DAX https://lwn.net/Articles/787767/ https://lwn.net/Articles/787767/ roc <div class="FormattedComment"> <font class="QuotedText">&gt; Others agreed that it wasn't right to just randomly kill processes because somebody truncated a file somewhere.</font><br> <p> Seems to me that truncating a file is likely to already "randomly kill processes" that are using it via mmap.<br> </div> Tue, 07 May 2019 21:42:44 +0000