|
|
Log in / Subscribe / Register

A last-minute MMU notifier change

By Jonathan Corbet
September 5, 2017
One does not normally expect to see significant changes to an important internal memory-management mechanism in the time between the ‑rc7 prepatch and the final release for a development cycle, but that is exactly what happened just before 4.13 was released. A regression involving the memory-management unit (MMU) notifier mechanism briefly threatened to delay this release, but a last-minute scramble kept 4.13 on schedule and also resulted in a cleanup of that mechanism. This seems like a good time to look at a mechanism that Linus Torvalds called "a badly designed mistake" and how it was made to be a bit less mistaken.

MMU Notifiers

A computer's memory-management unit handles the mapping between virtual and physical addresses, tracks the presence of physical pages in memory, handles memory-access permissions, and more. Much of the work of the memory-management subsystem is concerned with keeping the MMU properly configured in response to workload changes on the system. The details of MMU management are nicely hidden, so that the rest of the kernel does not (most of the time) have to worry about it, and neither does user space.

Things have changed over the last ten years or so in ways that have rendered the concept of "the MMU" rather more fuzzy. The initial driver of this change was virtualization; a mechanism like KVM must ensure that the host and the guest's view of the MMU are consistent. That typically involves managing a set of shadow page tables within the guest. More recently, other devices have appeared on the memory bus with their own views of memory; graphics processing units (GPUs) have led this trend with technologies like GPGPU, but others exist as well. To function properly, these non-CPU MMUs must be updated when the memory-management subsystem makes changes, but the memory-management code is not able (and should not be able) to make changes directly within the subsystems that maintain those other MMUs.

To address this problem, Andrea Arcangeli added the MMU notifier mechanism during the 2.6.27 merge window in 2008. This mechanism allows any subsystem to hook into memory-management operations and receive a callback when changes are made to a process's page tables. One could envision a wide range of callbacks for swapping, protection changes, etc., but the actual approach was simpler. The main purpose of an MMU notifier callback is to tell the interested subsystem that something has changed with one or more pages; that subsystem should respond by simply invalidating its own mapping for those pages. The next time a fault occurs on one of the affected pages, the mapping will be re-established, reflecting the new state of affairs.

There are a few ways of signaling the need for invalidation, though, starting with the invalidate_page() callback:

    void (*invalidate_page)(struct mmu_notifier *mn, struct mm_struct *mm,
			    unsigned long address);

This callback can be invoked after the page-table entry for the page at address in the address space indicated by mm has been removed, but while the page itself still exists. That is not the only notification mechanism, though; larger operations can be signaled with:

    void (*invalidate_range_start)(struct mmu_notifier *mn, struct mm_struct *mm,
				   unsigned long start, unsigned long end);
    void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm,
				 unsigned long start, unsigned long end);

In this case, invalidate_range_start() is called while all pages in the affected range are still mapped; no more mappings for pages in the region should be added in the secondary MMU after the call. When the unmapping is complete and the pages have been freed, invalidate_range_end() is called to allow any necessary cleanup to be done.

Finally, there is also:

    void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
			     unsigned long start, unsigned long end);

This callback is invoked when a range of pages is actually being unmapped. It can be called between calls to invalidate_range_start() and invalidate_range_end(), but it can also be called independently of them in some situations. One might wonder why both invalidate_page() and invalidate_range() exist and, indeed, that is where the trouble started.

The end of invalidate_page()

In late August, Adam Borowski reported that he was getting warnings from the 4.13-rc kernel when using KVM, followed by the quick demise of the host system. Others had been experiencing similar strangeness, including a related crash that seemed to be tied to the out-of-memory handler. After testing and bisection, this commit, fixing another bug, was identified as the culprit.

The problem came down to a difference between the invalidate_page() and invalidate_range() callbacks: the former is allowed to sleep, while the latter cannot. The offending commit was trying to fix a problem where invalidate_page() was called with a spinlock held — a context where sleeping is not allowed — by calling invalidate_range() instead. But, as Arcangeli pointed out, that will not lead to joy, since not all users implement invalidate_range(); it is necessary to call invalidate_range_start() and invalidate_range_end() instead.

The real fix turned out to not be quite so simple, though. Among other things, the fact that invalidate_page() can sleep makes it fundamentally racy. It cannot be called while the page-table spinlock affecting the page to be invalidated is held, meaning that the page-table entry can change before or during the call. This sort of issue is why Torvalds complained about the MMU notifiers in general and stated that they simply should not be able to sleep at all. But, as Jérôme Glisse pointed out, some use cases absolutely require the ability to sleep:

There is no way around sleeping if we ever want to support thing like GPU. To invalidate page table on GPU you need to schedule commands to do so on GPU command queue and wait for the GPU to signal that it has invalidated its page table/tlb and caches.

We had this discussion before. Either we want to support all the new fancy GPGPU, AI and all the API they rely on or we should tell them sorry guys not on linux.

Torvalds later backed down a little, making a distinction between two cases. Anything dealing with virtual addresses and the mm_struct structure can sleep, while anything dealing with specific pages and page-table entries cannot. Thus, the invalidate_range_start() and invalidate_range_end() callbacks, which deal with ranges of addresses and are called without any spinlocks held, can sleep. But invalidate_range() and invalidate_page() cannot.

That, in turn, suggests that invalidate_page() is fundamentally wrong by design. After some discussion, Torvalds concluded that the best thing to do would be to remove it entirely. But, as the bug that started the discussion showed, replacing it with invalidate_range() calls is not a complete solution to the problem. To make things work again in all settings, including those that need to be able to sleep, the invalidate_range() calls must always be surrounded by calls to invalidate_range_start() and invalidate_range_end().

Glisse quickly implemented that idea and, after a round of review, his patch set was fast-tracked into the 4.13 kernel three days before its release. So, as a last-minute surprise, the invalidate_page() MMU notifier is gone; out-of tree modules that used it will not work with 4.13 until they are updated. It is rare to see a change of this nature merged so late in the development cycle, but the alternative was to release with real regressions and the confidence in the fix was high. With luck, this fix will prevent similar problems from occurring in the future.

There is still one problem related to MMU notifiers in the 4.13 kernel, though: it turns out that the out-of-memory reaper, which tries to recover memory more quickly from processes that have been killed in an out-of-memory situation, does not invoke the notifiers. That, in turn, can lead to corruption on systems where notifiers are in use and memory runs out. Michal Hocko has responded with a patch to disable the reaper on processes that have MMU notifiers registered. He took that approach because the notifier implementations are out of the memory-management subsystem's control, and he worried about what could happen in an out-of-memory situation, where the system is already in a difficult state. This patch has not been merged as of this writing, but something like it will likely get in soon and find its way into the stable trees.

Notifier callbacks have a bit of a bad name in the kernel community. Kernel developers like to know exactly what will happen in response to a given action, and notifiers tend to obscure that information. As can be seen in the original bug and the reaper case, notifiers may also not be called consistently throughout a subsystem. But they can be hard to do without, especially as the complexity of the system grows. Sometimes the best that can be done is to be sure that the semantics of the notifiers are clear from the outset, and to be willing to make fundamental changes when the need becomes clear — even if that happens right before a release.

Index entries for this article
KernelMemory management/MMU notifiers


to post comments

Managed memory on GPU

Posted Sep 6, 2017 1:02 UTC (Wed) by jnareb (subscriber, #46500) [Link] (1 responses)

I wonder if this is, or would be a mechanism to implement the system side of the "managed memory" on GPU for GPGPU purposes, which (for devices that support it) make GPU memory support on-demand paging and on-demand copy from CPU memory. This is an extension of unified virtual memory addressing for GPU; the driver side of this "managed memory" appeared, if I remember it correctly, in CUDA 8.0.

Managed memory on GPU

Posted Sep 6, 2017 19:08 UTC (Wed) by mwsealey (subscriber, #71282) [Link]

Yep this is exactly one use case for it. It requires that the GPU have an MMU of it's own, though, which most do these days - the idea being that when the kernel changes a mapping allocated to a GPU, the GPU driver is notified, makes the appropriate changes so it'll fault in a similar way. When the GPU runs over that memory region, it can then trap into the GPU driver, and the GPU driver can then cause the MM subsystem to pull the memory back in.

I have some code I wrote half a decade ago that does it with very early MMU notifiers and it somewhat works.. but it's not really covered by the IOMMU on the GPU I chose. ARM SMMU, for example, have a "stall fault mode" which will block a transaction from the GPU from getting to the interconnect while a translation is fixed up by the kernel.

Actually, what we're talking about here is "uncore MMUs" in a very generic way, really, rather than GPUs in particular.

A last-minute MMU notifier change

Posted Sep 6, 2017 7:38 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (2 responses)

Due to maintainers on vacation, the patch is wrong for KVM. (The code from the invalidate page callback should have been moved to the range end callback). Oh well, we'll fix it for 4.13.1...

A last-minute MMU notifier change

Posted Sep 6, 2017 23:00 UTC (Wed) by glisse (guest, #44837) [Link] (1 responses)

Place that use to call invalidate_page are now surrounded by call to mmu_notifier_invalidate_range_start/end() so i don't think there is any need to create invalidate_range callback for kvm

A last-minute MMU notifier change

Posted Sep 7, 2017 6:50 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Yes, it is just the part that recognized the APIC access page's address in kvm_arch_mmu_invalidate_page and asked the vCPUs to update the physical address. That must be placed in one of the range-based callbacks. I need to review your patch more closely first though, I am not actually sure it's range_end.

A last-minute MMU notifier change

Posted Sep 6, 2017 17:48 UTC (Wed) by atelszewski (guest, #111673) [Link] (1 responses)

Hi,

In "Adam Borowsk", "Borowsk" should be "Borowski" (note the "i" at the end).
;-)

--
Best regards,
Andrzej Telszewski (with "i" at the end;-))

A last-minute MMU notifier change

Posted Sep 6, 2017 19:11 UTC (Wed) by jake (editor, #205) [Link]

> In "Adam Borowsk", "Borowsk" should be "Borowski" (note the "i" at the end).

Indeed it should ... and that got pointed out in review, but somehow it still slipped through ... fixed now, thanks.

jake


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds