User-space page fault handling

By Jonathan Corbet
May 14, 2013

Page fault handling is normally the kernel's responsibility. When a process attempts to access an address that is not currently mapped to a location in RAM, the kernel responds by mapping a page to that location and, if needed, filling that page with data from secondary storage. But what if that data is not in a location that is easily reachable by the kernel? Then, perhaps, it's time to outsource the responsibility for handling the fault to user space.

One situation where user-space page fault handling can be useful is for the live migration of virtual machines from one physical host to another. Migration can be done by stopping the machine, copying its full address space to the new host, and restarting the machine there. But address spaces may be large and sparsely used; copying a full address space can result in a lot of unnecessary work and a noticeable pause before the migrated system restarts. If, instead, the virtual machine's address space could be demand-paged from the old host to the new, it could restart more quickly and the copying of unused data could be avoided.

Live migration with KVM is currently managed with an out-of-tree char device. This scheme works, but, once the device takes over a range of memory, that memory is removed from the memory management subsystem. So it cannot be swapped out, transparent huge pages don't work, and so on. Clearly it would be better to come up with a solution that, while allowing user-space handling of demand paging, does not remove the affected memory from the kernel's management altogether. A patch set recently posted by Andrea Arcangeli aims to resolve those issues with a couple of new system call options.

The first of those is to extend the madvise() system call, adding a new command called MADV_USERFAULT. Processes can use this operation to tell the kernel that user space will handle page faults on a range of memory. After this call, any access to an unmapped address in the given range will result in a SIGBUS signal; the process is then expected to respond by mapping a real page into the unmapped space as described below. The madvise(MADV_USERFAULT) call should be made immediately after the memory range is created; user-space fault handling will not work if the kernel handles any page faults before it is told that user space will be doing the job.

The SIGBUS signal handler's job is to handle the page fault by mapping a real page to the faulting address. That can be done in current kernels with the mremap() system call. The problem with mremap() is that it works by splitting the virtual memory area (VMA) structure used to describe the memory range within the kernel. Frequent mremap() calls will result in the kernel having to manage a large number of VMAs, which is an expensive proposition. mremap() will also happily overwrite existing memory mappings, making it harder to detect errors (or race conditions) in user-space handlers. For these reasons, mremap() is not an ideal solution to the problem.

Andrea's answer to this problem is a new system call:

    int remap_anon_pages(void *dest, void *source, unsigned long len);

This call will cause the len bytes of memory starting at source to be mapped into the process's address space starting at dest. At the same time, the source memory range will be unmapped — the pages previously found there will be atomically moved to the dest range.

Andrea has posted a small test program that demonstrates how these APIs are meant to be used.

As one might expect, some restrictions apply: source and dest must be page-aligned, len should be a multiple of the page size, the dest range must be completely unmapped, and the source range must be fully mapped. The mapping requirements exist to catch bugs in user-space fault handlers; remapping pages on top of existing memory has a high risk of causing memory corruption.

One nice feature of the patch set is that, on systems where transparent huge pages are enabled, huge pages can be remapped with remap_anon_pages() without the need to split them apart. For that to work, of course, the length and alignment of the range to move must be compatible with huge pages.

There are a number of limitations in the current patch set. The MADV_USERFAULT option can only be used on anonymous (swap-backed) memory areas. A more complete implementation could conceivably support this feature for file-backed pages as well. The mechanism offers support for demand paging of data into RAM, but there is no user-controllable mechanism for pushing data back out; instead, those pages are swapped with all other anonymous pages. So it is not a complete user-space paging mechanism; it's more of a hook for loading the initial contents of anonymous pages from an outside source.

But, even with those limitations, the feature is useful for the intended virtualization use case. Andrea suggests it could possibly have other uses as well; remote RAM applications come to mind. First, though, it needs to get into the mainline, and that, in turn, suggests that the proposed ABI needs to be reviewed carefully. Thus far, this patch set has not gotten a lot of review attention; that will need to change before it can be considered for mainline merging.

Index entries for this article
Kernel	Memory management/Virtualization
Kernel	remap_anon_pages()

User-space page fault handling

Posted May 16, 2013 1:23 UTC (Thu) by nelhage (subscriber, #59579) [Link] (3 responses)

Perhaps I'm misunderstanding something here, but I don't understand how `MADV_USERFAULT` is different/superior from doing an `mprotect(PROT_NONE)` and then handling the `SIGSEGV`. Can someone help me out?

User-space page fault handling

Posted May 16, 2013 10:49 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> Perhaps I'm misunderstanding something here, but I don't understand how `MADV_USERFAULT` is different/superior from doing an `mprotect(PROT_NONE)` and then handling the `SIGSEGV`. Can someone help me out?

For one there is the uglyness of properly handling SIGSEGVs which requires sigaltstack et al. which is far from easy.

For another, if you would go that way you would need to call mmap() for every single page fault which would probably end up being horrendously expensive since you would end up with thousands of different mmap()s setup which is rather expensive. With the patchset, as far as I understand it, there's just one memory region setup in the kernel and just when it cannot find backing memory it falls back to the userspace page fault handler.

User-space page fault handling

Posted May 16, 2013 11:23 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (1 responses)

First, MADV_USERFAULT will tell userspace if a fault happens, but it has no effect if the page is already present. (Yes, this is unexpected).

Second, mprotect(PROT_NONE) creates a VMA, so you'll end up with a ton of VMAs. MADV_USERFAULT doesn't.

User-space page fault handling

Posted May 20, 2013 15:18 UTC (Mon) by lacos (guest, #70616) [Link]

This is freaking awesome.

> First, MADV_USERFAULT will tell userspace if a fault happens, but it has
> no effect if the page is already present. (Yes, this is unexpected).

Anthony's response in this article helps make sense of this (and also interpret the test program linked from the patch series). The address range (VMA) whose faults we care about (which is the guest's RAM on the target host) is actually allocated, just not yet set up in the page tables / populated by data. Once such a page has been populated during live migration, we really don't care about accesses to it any longer.

So basically we don't want to re-map such a page (and split the containing VMA), we just want to catch the first access (the first page-in) of any page in such a VMA and fill it with real data. During this procedure we're not manipulating the process's address space at all, just its page tables.

(mremap() seems like the opposite approach: make the same backing store available at a potentially different virtual address, for a potentially different size. In this case however we want to cause a different piece of backing store to appear at the same virtual address.)

(BTW I wonder what happens if such an anon page is paged out to swap *after* we filled it with data. When we access it for the second time, will the userspace sighandler be invoked again, or (what seems more correct) will it silently come back form swap?)

... I'm sure I'm about 95% inexact (or wildly raving even :)) in the above, but I'm still in awe of what I believe to understand from the idea.

User-space page fault handling

Posted May 16, 2013 1:24 UTC (Thu) by geofft (subscriber, #59789) [Link]

I'm confused about something basic here -- how is this different from, like, accessing an unmapped page (or a page in an anonymous PROT_NONE range), catching the SIGSEGV that is already today delivered to you, filling in the data, and returning?

User-space page fault handling

Posted May 16, 2013 1:48 UTC (Thu) by aliguori (subscriber, #30636) [Link] (1 responses)

This is not quite right with respect to live migration.

Live migration works perfectly fine today without any out-of-tree drivers. But the algorithm we use is convergent. First we enable dirty tracking of the guests memory, then we transfer all memory, then we check which pages have been dirtied and for the next round, we transfer only the dirty pages.

We keep doing this until there is a sufficiently small number of dirty pages and then we stop the VM and transfer those pages. This algorithm can run into trouble if the guest dirties pages faster than we can transfer them though.

Andrea's patches enabled a new form of migration called "post-copy" which immediately starts the guest on the destination machine and then transfers pages from the source VM to the destination as the destination guest access them.

This is why userspace page fault tracking is needed (to fetch those pages). The advantage of this approach is that migration is deterministic--you don't have to rely on convergence.

That said, there's very little concrete data saying that post-copy migration results in less guest down time compared to various techniques to force convergence with pre-copy migration. I'm not convinced that we should support it at all. So the jury is still out on whether this feature is actually needed for virt.

Nonetheless, it's an interesting kernel interface and the remaining write-up is spot-on.

User-space page fault handling

Posted May 27, 2013 21:51 UTC (Mon) by ccurtis (guest, #49713) [Link]

This algorithm can run into trouble if the guest dirties pages faster than we can transfer them though.

It sounds like you might want an algorithm that enforces some level of determinism, like:

shrink()
{
    if( throttle_cpu )
        cpu_speed *= 0.9;
    if( dirty_pages )
        throttle_cpu = 1;

    <migrate_pages> && break;

    dirty_pages = 1;
}

User-space page fault handling

Posted May 16, 2013 3:31 UTC (Thu) by jmorris42 (guest, #2203) [Link] (6 responses)

I'm sure there is a reason I'm not seeing offhand, but why are we still segregating memory into 'anonymous' pages and file backed?

The main executables and the libs are already mapped to files. Mandate a directory on a real drive that will contain a sparse file for each running process's memory map for it's read/write data. No more out of memory errors or swap partitions/files. As long as you have disk space you can malloc. Depending on how you use it you might bring the system down in swap death so you still need something like the out of memory killer but things should generally get simpler. Would need to adjust the actual writeback of these files to be a very very low priority unless there was memory pressure but that wouldn't change the mental model or the api.

This new feature under discussion would become a way to detect when a sparse file was trying to read a hole.

Hibernation would get easier. Stop all processes, write the kernel's memory to disk, sync dirty blocks, put devices to sleep, sync the drive's buffer and kill power. For all intents and purposes ram would be treated as just be another layer of cache between the on die cache and the hard disk/ssd.

User-space page fault handling

Posted May 16, 2013 11:20 UTC (Thu) by Funcan (subscriber, #44209) [Link] (5 responses)

Many programs allocate larger memory ranges than they need... way, way larger. They then sparsely fill them. The amount of drive space necessary to support some scientific codes would be uneconomical if not impossible... Petabyte sparse matrices are far from unknown.

User-space page fault handling

Posted May 16, 2013 15:42 UTC (Thu) by dgm (subscriber, #49227) [Link] (2 responses)

Yet this is a problem that would be better solved in user space. There are plenty of data structures for storing sparse matrices.

User-space page fault handling

Posted May 16, 2013 17:34 UTC (Thu) by ejr (subscriber, #51652) [Link]

Sparse matrices themselves, yes. For sparse direct factorization, however, having a region of memory with a large hole in the middle can save 5x-10x in time over the "optimal" representation that requires careful repacking. But these examples are relying on over-committing memory and not the difference between anonymous disk residency on swap v. named disk residency with page-out.

IMHO, the two cases differ in traditional policy more than anything. Swap is written only when absolutely necessary (in theory). Disk images often are written far more frequently to keep the disk version sanely representative of some consistent state (in theory). Non-volatile memory, if it really becomes the norm, could remove the difference.

User-space page fault handling

Posted May 17, 2013 13:49 UTC (Fri) by Tuna-Fish (guest, #61751) [Link]

This is true. However, the software already exists, and Linux just can't break large classes of software like that.

There is a certain elegance to file-backing everything, but it frankly can't be done because of backwards compatibility.

User-space page fault handling

Posted May 16, 2013 23:06 UTC (Thu) by jmorris42 (guest, #2203) [Link] (1 responses)

Which is why the files would be sparse. If actual physical ram wouldn't be allocated now, actual disk space wouldn't be committed either.

not distinguishing anonymous and file-backed memory

Posted May 19, 2013 17:06 UTC (Sun) by giraffedata (guest, #1954) [Link]

Which is why the files would be sparse. If actual physical ram wouldn't be allocated now, actual disk space wouldn't be committed either.

And that's why I don't get the claim "no more out of memory errors."

User-space page fault handling

Posted May 16, 2013 16:32 UTC (Thu) by alkbyby (subscriber, #61687) [Link]

It appears to be a nice way to fix performance issue of mremap/mprotect. But it looks like it is unable to handle a case where just writes need to be trapped. Which underlying hardware does support. And which is very handy e.g. for generational GC.

User-space page fault handling

Posted May 18, 2013 13:24 UTC (Sat) by civodul (guest, #58311) [Link]

This reminds of what Mach and co. have supported since their inception decades ago...

User-space page fault handling

Posted May 20, 2013 1:03 UTC (Mon) by nteon (subscriber, #53899) [Link] (1 responses)

MADV_USERFAULT sounds like it could be useful for compacting GCs too

User-space page fault handling

Posted May 20, 2013 2:04 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

One page is a little bit granular for that. There were attempts to make a high-throughput concurrent compacting GC with the help of memory protection. They work, somewhat, but it's not really worth it. Most collectors simply use memory protection to stop mutators from accessing a heap that is being compacted and that doesn't require user-level page fault handling.