Tweaks to madvise() and posix_fadvise()

[Posted February 14, 2006 by corbet]

A couple of Linux-specific additions to the memory-related system call API have recently found their way into the -mm tree. There is a bit of pressure to get them into 2.6.16, though that may be unlikely at this late date. This may be a good time to look at the proposed changes, however, along with the pressures which motivated them.

Prepare yourself, as your editor is about to inflict his primitive drawing skills upon the world again. Consider a situation which, with some [Diagram] imagination, could be described by the diagram to the right. A process has a particular memory page of interest, pointed to by a page table entry. That process has arranged with a device driver to exchange data through this page; as a result, the driver has a pointer to the associated page structure, possibly obtained with get_user_pages(). At this stage, all is working well.

But then the process decides to reproduce. The resulting call to fork() has a number of consequences beyond the creation of a child process. That call will attempt to avoid copying the parent process's [Diagram] memory since, for much of the memory range, there is unlikely to ever be a reason to do so. Instead, both parent and child will be set up with page table entries pointing to the same physical page in memory, but that page will now be write protected. As long as neither process attempts to write to the page, the situation can remain as shown in the diagram to the left. Both processes - and the driver - can share the same physical page. If either process calls fork() again, the result will be a third process also sharing that page, and so on. Often, no process will attempt to write to the page for as long as it is in this shared state, and no copy will ever have to be performed.

Life is not always so easy, however. If the parent process makes a change to the page - writing some new data to be passed through to the driver, for example - the hardware will trap the write attempt. The kernel will respond by allocating a new page, copying the old page's contents there, [Diagram] and pointing the parent process's page table entry to the new, write-enabled page. At that point, the write attempt can go forward, and everybody will be happy.

Or maybe not. The copy-on-write operation described above will break the parent process's connection with the old page. But there is no way to inform the driver of that change. The result will be the situation shown on the right: the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other. Additionally, if the parent had used mlock() to lock the original page into memory, that lock, too, will remain with the original page. The page which the parent had thought was pinned into RAM will become pageable, with potentially bad effects on performance and security.

One could try to address this problem by changing the copy-on-write logic to always maintain the connection between the parent process and its original pages. That would require the COW code to find any other processes with references to the page, however, and assign the copied page to them. That change would slow the code and invite interesting race conditions, however; remember that there could be a large number of other processes with references to the page. So the solution proposed by Michael Tsirkin takes a different approach.

If a process has pages which it has locked into memory or set up to be shared with a device driver, chances are that it never wants its children to have access to that memory in the first place. So Michael's patch adds a couple of new flags to the madvise() system call. A process with special memory can call madvise() with the new MADV_DONTFORK flag; the kernel will respond by setting the VM_DONTCOPY flag in the associated virtual memory area structure; thereafter, any newly-created child process simply will not see that part of the address space. There is also a MADV_DOFORK flag which cancels the effect of MADV_DONTFORK.

Meanwhile, another change found in current -mm came as a result of this complaint about the behavior of the msync() system call, which is used to flush modified parts of a memory-mapped file back to disk. In particular, the complainer, whose real name is unclear, just noticed that msync() changed its semantics between 2.4 and 2.6. In the 2.4 kernel, a call to msync(..., MS_ASYNC) would mark the indicated memory range as being dirty and begin the process of writing those pages to disk. In 2.6, instead, no I/O is started directly from msync(); instead, the pages will remain dirty in the page cache until the virtual memory subsystem gets around to flushing them out.

The original complainer asked that the old behavior be restored, but that seems unlikely to happen. For most workloads, the best performance is achieved by letting the kernel decide just when to write each part of the file back to disk. But there was also some recognition that an option to start I/O immediately (without necessarily waiting for it) would be a useful thing in some situations. The answer, as implemented by Andrew Morton, leaves the msync() call alone, however; instead, Andrew has added a couple of new options to the posix_fadvise() system call:

LINUX_FADV_ASYNC_WRITE will start write I/O on the given range of pages. If some of those pages are already under I/O, the operation will not be restarted, leaving open the possibility that late changes might not make it to disk.
LINUX_FADV_WRITE_WAIT will wait for any I/O currently in progress on the given range of pages, but does not actually start any I/O.

In practice, these calls will often need to be made in combinations. An application which needs to assure itself that all modified pages are on disk must first perform a wait call (thus ensuring that all pages under I/O are written), a write call (to start I/O on remaining dirty pages), and a second wait call (to allow that I/O to complete). But any application wanting the 2.4 msync() behavior can get it with a single LINUX_FADV_ASYNC_WRITE call.

Chances are good that both of these changes could land in the mainline in the 2.6.17 time frame.

Index entries for this article
Kernel	Memory management
Kernel	posix_fadvise()

Tweaks to madvise() and posix_fadvise()

Posted Feb 16, 2006 12:46 UTC (Thu) by mst@mellanox.co.il (guest, #27097) [Link]

Good article. Two minor nits:
1. The article says:
"Additionally, if the parent had used mlock() to lock the original page into memory, that lock, too, will remain with the original page. The page which the parent had thought was pinned into RAM will become pageable, with potentially bad effects on performance and security."

AFAIK this is not 100% true: I think the page stays locked for parent, too.
However
- Parent will still get a fault on write access.
- Child has a copy of the page, along with any secret information
parent kept there.

2. There's another possible use for MADV_DONTFORK: to speed up fork
by not copying the irrelevant vmas, ptes etc.
This might become more important if plans to add support for early-copy
on fork materialize.

Tweaks to madvise() and posix_fadvise()

Posted Feb 17, 2006 3:36 UTC (Fri) by jgsack@san.rr.com (guest, #33287) [Link] (2 responses)

I think I'm missing a point, somewhere in the following para:

"""
The copy-on-write operation described above will break the parent process's connection with the old page. But there is no way to inform the driver of that change. The result will be the situation shown on the right: the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other.
"""

Forgive my naivety, if there is something unstated that I _should have known_, but.. it seems to me that the parent/child haven't lost anything. They never _were_ able to actually communicate, were they? -- since neither could write something new and have it visible to the other.

..jim

Tweaks to madvise() and posix_fadvise()

Posted Feb 17, 2006 3:53 UTC (Fri) by roelofs (guest, #2599) [Link] (1 responses)

I think I'm missing a point, somewhere in the following para:

I think the only thing you missed is that driver != child. The article speaks of parent-driver communication; you conflated that to parent-child communication (or so it appears).

Apologies if I've misunderstood your misunderstanding. ;-)

Greg

Tweaks to madvise() and posix_fadvise()

Posted Feb 17, 2006 4:14 UTC (Fri) by jgsack@san.rr.com (guest, #33287) [Link]

Ohhh .. then, nevermind <grin>.
Thx,
..jim

Tweaks to madvise() and posix_fadvise()

Posted Apr 6, 2006 3:59 UTC (Thu) by sazzala (guest, #31506) [Link]

Lots of drivers do dma to pinned down pages. This cow problem should be pervasive, and could have resulted in many corruption related bugs. Since the driver lost the connection to the parent page, the driver is now copying data into the child's page. Loss of data to the parent can be looked at as a corruption bug.

I ran into this problem with the 2.4 kernel. I would think that this problem is quite widespread. However, since this bug was not addressed for such a long time, I will have to assume that this is not such a common problem.