Tweaks to madvise() and posix_fadvise()
Prepare yourself, as your editor is about to inflict his primitive drawing
skills upon the world again. Consider a situation which, with some
imagination, could be described by the diagram to the right. A process has
a particular memory page of interest, pointed to by a page table entry.
That process has arranged with a device driver to exchange data through
this page; as a result, the driver has a pointer to the associated
page structure, possibly obtained with get_user_pages().
At this stage, all is working well.
But then the process decides to reproduce. The resulting call to fork() has
a number of consequences beyond the creation of a child process. That
call will attempt to avoid copying the parent process's
memory since, for much of the memory range, there is unlikely to ever be a
reason to do so. Instead, both parent and child will be set up with page table
entries pointing to the same physical page in memory, but that page will
now be write protected. As long as neither process attempts to write to
the page, the situation can remain as shown in the diagram to the left.
Both processes - and the driver - can share the same physical page. If
either process calls fork() again, the result will be a third
process also sharing that page, and so on. Often, no process will attempt
to write to the page for as long as it is in this shared state, and no copy
will ever have to be performed.
Life is not always so easy, however. If the parent process makes a change
to the page - writing some new data to be passed through to the driver, for
example - the hardware will trap the write attempt. The kernel will
respond by allocating a new page, copying the old page's contents there,
and pointing the parent process's page table entry to the new,
write-enabled page. At that point, the write attempt can go forward, and
everybody will be happy.
Or maybe not. The copy-on-write operation described above will break the parent process's connection with the old page. But there is no way to inform the driver of that change. The result will be the situation shown on the right: the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other. Additionally, if the parent had used mlock() to lock the original page into memory, that lock, too, will remain with the original page. The page which the parent had thought was pinned into RAM will become pageable, with potentially bad effects on performance and security.
One could try to address this problem by changing the copy-on-write logic to always maintain the connection between the parent process and its original pages. That would require the COW code to find any other processes with references to the page, however, and assign the copied page to them. That change would slow the code and invite interesting race conditions, however; remember that there could be a large number of other processes with references to the page. So the solution proposed by Michael Tsirkin takes a different approach.
If a process has pages which it has locked into memory or set up to be shared with a device driver, chances are that it never wants its children to have access to that memory in the first place. So Michael's patch adds a couple of new flags to the madvise() system call. A process with special memory can call madvise() with the new MADV_DONTFORK flag; the kernel will respond by setting the VM_DONTCOPY flag in the associated virtual memory area structure; thereafter, any newly-created child process simply will not see that part of the address space. There is also a MADV_DOFORK flag which cancels the effect of MADV_DONTFORK.
Meanwhile, another change found in current -mm came as a result of this complaint about the behavior of the msync() system call, which is used to flush modified parts of a memory-mapped file back to disk. In particular, the complainer, whose real name is unclear, just noticed that msync() changed its semantics between 2.4 and 2.6. In the 2.4 kernel, a call to msync(..., MS_ASYNC) would mark the indicated memory range as being dirty and begin the process of writing those pages to disk. In 2.6, instead, no I/O is started directly from msync(); instead, the pages will remain dirty in the page cache until the virtual memory subsystem gets around to flushing them out.
The original complainer asked that the old behavior be restored, but that seems unlikely to happen. For most workloads, the best performance is achieved by letting the kernel decide just when to write each part of the file back to disk. But there was also some recognition that an option to start I/O immediately (without necessarily waiting for it) would be a useful thing in some situations. The answer, as implemented by Andrew Morton, leaves the msync() call alone, however; instead, Andrew has added a couple of new options to the posix_fadvise() system call:
- LINUX_FADV_ASYNC_WRITE will start write I/O on the given
range of pages. If some of those pages are already under I/O, the
operation will not be restarted, leaving open the possibility that
late changes might not make it to disk.
- LINUX_FADV_WRITE_WAIT will wait for any I/O currently in progress on the given range of pages, but does not actually start any I/O.
In practice, these calls will often need to be made in combinations. An application which needs to assure itself that all modified pages are on disk must first perform a wait call (thus ensuring that all pages under I/O are written), a write call (to start I/O on remaining dirty pages), and a second wait call (to allow that I/O to complete). But any application wanting the 2.4 msync() behavior can get it with a single LINUX_FADV_ASYNC_WRITE call.
Chances are good that both of these changes could land in the mainline in
the 2.6.17 time frame.
Index entries for this article | |
---|---|
Kernel | Memory management |
Kernel | posix_fadvise() |
Posted Feb 16, 2006 12:46 UTC (Thu)
by mst@mellanox.co.il (guest, #27097)
[Link]
AFAIK this is not 100% true: I think the page stays locked for parent, too.
2. There's another possible use for MADV_DONTFORK: to speed up fork
Posted Feb 17, 2006 3:36 UTC (Fri)
by jgsack@san.rr.com (guest, #33287)
[Link] (2 responses)
"""
Forgive my naivety, if there is something unstated that I _should have known_, but.. it seems to me that the parent/child haven't lost anything. They never _were_ able to actually communicate, were they? -- since neither could write something new and have it visible to the other.
..jim
Posted Feb 17, 2006 3:53 UTC (Fri)
by roelofs (guest, #2599)
[Link] (1 responses)
I think the only thing you missed is that driver != child. The article speaks of parent-driver communication; you conflated that to parent-child communication (or so it appears).
Apologies if I've misunderstood your misunderstanding. ;-)
Greg
Posted Feb 17, 2006 4:14 UTC (Fri)
by jgsack@san.rr.com (guest, #33287)
[Link]
Posted Apr 6, 2006 3:59 UTC (Thu)
by sazzala (guest, #31506)
[Link]
I ran into this problem with the 2.4 kernel. I would think that this problem is quite widespread. However, since this bug was not addressed for such a long time, I will have to assume that this is not such a common problem.
Good article. Two minor nits:Tweaks to madvise() and posix_fadvise()
1. The article says:
"Additionally, if the parent had used mlock() to lock the original page into memory, that lock, too, will remain with the original page. The page which the parent had thought was pinned into RAM will become pageable, with potentially bad effects on performance and security."
However
- Parent will still get a fault on write access.
- Child has a copy of the page, along with any secret information
parent kept there.
by not copying the irrelevant vmas, ptes etc.
This might become more important if plans to add support for early-copy
on fork materialize.
I think I'm missing a point, somewhere in the following para:Tweaks to madvise() and posix_fadvise()
The copy-on-write operation described above will break the parent process's connection with the old page. But there is no way to inform the driver of that change. The result will be the situation shown on the right: the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other.
"""
I think I'm missing a point, somewhere in the following para:
Tweaks to madvise() and posix_fadvise()
Ohhh .. then, nevermind <grin>.Tweaks to madvise() and posix_fadvise()
Thx,
..jim
Lots of drivers do dma to pinned down pages. This cow problem should be pervasive, and could have resulted in many corruption related bugs. Since the driver lost the connection to the parent page, the driver is now copying data into the child's page. Loss of data to the parent can be looked at as a corruption bug.Tweaks to madvise() and posix_fadvise()