Write-protect for userfaultfd()
The core idea behind userfaultfd() is that, when the process of interest incurs a page fault due to a non-present page, the monitoring process receives a notification and is given the opportunity to provide a page to satisfy that fault. Once pages are present, though, the monitoring process is no longer involved. The new proposal changes that by allowing the monitoring process to mark pages as being write-protected, even though the owning process has write permission to them. That allows writes to be intercepted, and some extra processing performed.
There are a number of use cases for this feature, Arcangeli said. They include:
- Live snapshotting of running processes. QEMU would like this feature so that it can be sure of copying pages in a stable state.
- The Redis database system performs checkpointing by forking, then writing out its memory in the child process, taking advantage of the kernel's copy-on-write behavior to keep the memory stable while it is being written. Among other things, this technique requires disabling transparent huge pages in order to work, but that has a performance impact. It also potentially doubles memory use, since the parent could write to all of its data pages while the child is running. The write-protect feature could eliminate the need to fork to get a stable set of pages to write.
- There are schemes for implementing distributed shared memory that could use it to detect (and distribute) changes. This can, in theory, be done now by using mprotect() and handling segmentation-violation signals, but it's slow and creates vast numbers of virtual memory areas in the kernel.
- The soft dirty feature, used by the checkpoint-restore in user space (CRIU) mechanism, could be replaced by this write-protect mechanism. It would be far more efficient, since it would eliminate the need to scan through the page tables.
- Similarly, it could be used to replace the dirty bits used by language runtime systems, such as the Java virtual machine, to catch changes.
Many other cases that use mprotect() now would likely run faster with userfaultfd(), which never needs to acquire the highly contended mmap_sem lock for writes.
Processes wanting to use this feature must call userfaultfd() to
obtain a file descriptor, then perform an UFFDIO_REGISTER
ioctl() call with the UFFDIO_REGISTER_MODE_WP option.
Then UFFDIO_WRITEPROTECT operations, which supply a start address
and a size, can be used to change the write-protect bit on one or more
pages. Events will show up with the UFFD_PAGEFAULT_FLAG_WP flag
set. For now, write protection only works with anonymous pages.
There are some loose ends that still need to be worked out. For example, the write-protect feature requires a new bit in the page-table entries; it has taken the last available bit for this purpose, which may not be entirely welcome. Arcangeli said that it may be possible to find an alternative to using that bit later on. Dave Hansen said that there might just be a couple of other bits available if one looks hard enough, but Mel Gorman warned that they might not be available on all architectures. Some of the apparently unused bits have been claimed by CPU vendors for uses that have not yet been made public, it seems.
Another issue is that page-fault handlers can fail with a VM_FAULT_RETRY return code, indicating that the memory-management subsystem should restart the process of handling the fault from the beginning. But that can only happen twice for any given fault before the whole thing fails, sending a (probably fatal) signal to the faulting process. This is a problem for the write-protect feature, which can generate more retries than that. Ideally, an unlimited number of retries would be allowed, Arcangeli said.
Currently privilege is required to use userfaultfd(), but the desire is to make it available for unprivileged use as well; a sysctl knob would be used to control whether privilege is needed or not, allowing "paranoid" system administrators to restrict its use. Jérôme Glisse suggested that perhaps the seccomp mechanism would be a better way to control access to this feature.
At the end of the session, Arcangeli asked whether the patches need more
review before being merged. "Always" was the inevitable answer from
memory-management subsystem maintainer Andrew Morton.
Index entries for this article | |
---|---|
Kernel | userfaultfd() |
Conference | Storage, Filesystem, and Memory-Management Summit/2019 |
Posted May 2, 2019 22:56 UTC (Thu)
by unixbhaskar (guest, #44758)
[Link]
Posted May 3, 2019 7:14 UTC (Fri)
by Freeaqingme (guest, #103259)
[Link] (3 responses)
Posted May 3, 2019 11:00 UTC (Fri)
by matthias (subscriber, #94967)
[Link] (1 responses)
Posted May 6, 2019 22:55 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted May 3, 2019 11:01 UTC (Fri)
by corbet (editor, #1)
[Link]
Posted May 16, 2019 23:33 UTC (Thu)
by robert_s (subscriber, #42402)
[Link]
Write-protect for userfaultfd()
Write-protect for userfaultfd()
Write-protect for userfaultfd()
Write-protect for userfaultfd()
The PTE format is defined by the hardware (on many architectures, at least); it's not something that can be readily extended.
PTE bits
Write-protect for userfaultfd()