Memory management notifiers

By Jonathan Corbet
January 23, 2008

Virtualized guests running under Linux like to think that they are doing their own memory management. The truth of the matter, though, is that the host system cannot allow guests to directly modify the page tables used by the hardware; allowing that sort of access would compromise the security of the host. So, somehow, the host must be involved in the guest's memory management. One common technique is through the use of shadow page tables. Guest systems maintain their own page tables, but they are not the tables used by the memory management unit. Instead, whenever the guest makes a change to its tables, the host system intercepts the operation, checks it for validity, then mirrors the change in the real page tables, which "shadow" those maintained by the guest.

One problem with this technique, as implemented in Linux currently, is that there is no easy way for the host to feed page table changes back to the guest. In particular, if the host system decides that it wants to push a given page out to swap, it can't tell the guest that the page is no longer resident. So virtualization mechanisms like KVM avoid the problem altogether by pinning pages in memory when they are mapped in shadow page tables. That solves the problem, but it makes it impossible to swap processes running KVM-based virtual machines out of main memory.

This seems like a good thing to fix. And a fix exists, in the form of the MMU notifiers patch posted by Andrea Arcangeli (from his shiny new Qumranet address). This patch allows an interested subsystem to be notified whenever specific memory management events take place. The process starts by setting up a set of callbacks:

    struct mmu_notifier_ops {
	void (*release)(struct mmu_notifier *mn,
			struct mm_struct *mm);
	int (*age_page)(struct mmu_notifier *mn,
			struct mm_struct *mm,
			unsigned long address);
	void (*invalidate_page)(struct mmu_notifier *mn,
				struct mm_struct *mm,
				unsigned long address);
	void (*invalidate_range)(struct mmu_notifier *mn,
				 struct mm_struct *mm,
				 unsigned long start, unsigned long end);
    };

These callbacks are bundled into an mmu_notifier structure:

    struct mmu_notifier {
	struct hlist_node hlist;
	const struct mmu_notifier_ops *ops;
    };

The interested code then registers its notifier with:

    void mmu_notifier_register(struct mmu_notifier *mn, 
                               struct mm_struct *mm);

Here, mm is the mm_struct structure associated with a given address space. It is not expected that anybody will be interested in all memory management events, so notifiers are associated with specific address spaces. Once the notifier is in place, the callbacks will be invoked when interesting things happen:

release() is called when the relevant mm_struct is about to go away. So it will be the last callback made to that notifier.
age_page() indicates that the memory management subsystem wants to clear the "referenced" flag on the page associated with the given address. This callback should return the previous value of the referenced bit, or the closest approximation available on the host architecture.
invalidate_page() and invalidate_range() are both ways of telling the guest that the given address(es) are no longer valid - the page has been reclaimed. Upon return from this callback, the affected address range should not be referenced by the guest.

For the curious, the KVM patches (showing how these notifiers are used there) have also been posted.

While this patch set is aimed at KVM, there has been some interest from other directions as well - virtual machines are not the only places where separate (but related) page tables are maintained. Graphical processing units on contemporary video cards are an example - they have their own memory management units and have some interesting management issues of their own. Remote DMA (RDMA) engines are another possible user. So these patches have attracted comments from a few potential users, and have changed significantly since their first posting. The discussion is still ongoing, so further changes may come about before the notifiers find their way into the mainline.

Index entries for this article
Kernel	KVM
Kernel	Memory management/Virtualization

Memory management notifiers

Posted Jan 24, 2008 18:39 UTC (Thu) by bronson (subscriber, #4806) [Link] (2 responses)

This would only benefit guest kernels that have been modified to take avantage of it, right?

Memory management notifiers

Posted Jan 25, 2008 20:11 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

This would only benefit guest kernels that have been modified to take avantage of it, right?

I think that's obvious in, "The interested code then registers its notifier with:"

But what the article doesn't say is why a guest kernel would be interested. It says that because the guest kernel can't know when the host has invalidated a page, the host must never invalidate a page (i.e. keep the memory pinned). I guess I don't know how KVM works, but I've worked with virtual machines that don't have this issue.

That swapped out page should still be virtually resident. The guest's page table says so, and, consistent with that, when the guest does a load from its virtual address, the instruction completes without the guest seeing any page fault (because the host takes a page fault, reads the data in, and updates the real page table).

Memory management notifiers

Posted Jan 30, 2008 0:49 UTC (Wed) by roelofs (guest, #2599) [Link]

But what the article doesn't say is why a guest kernel would be interested.

Seems like primarily a performance issue to me. If the guest kernel doesn't know when its "RAM" is really swap, it's not going to be able to manage its memory as effectively as it might like. For example, it might be able to predict memory-usage patterns where the host kernel can't. Wasn't there a recent article(s) about a patch to do speculative read-in of swapped-out memory, specifically for the use-case where some automated overnight process pushes out OpenOffice/Firefox/etc., causing the user significant delays upon his/her return in the morning? (Perhaps even one of Con Kolivas' patches?)

Greg