Tweaks to madvise() and posix_fadvise()
[Posted February 14, 2006 by corbet]
A couple of Linux-specific additions to the memory-related system call API
have recently found their way into the -mm tree. There is a bit of
pressure to get them into 2.6.16, though that may be unlikely at this late
date. This may be a good time to look at the proposed changes, however,
along with the pressures which motivated them.
Prepare yourself, as your editor is about to inflict his primitive drawing
skills upon the world again. Consider a situation which, with some
imagination, could be described by the diagram to the right. A process has
a particular memory page of interest, pointed to by a page table entry.
That process has arranged with a device driver to exchange data through
this page; as a result, the driver has a pointer to the associated
page structure, possibly obtained with get_user_pages().
At this stage, all is working well.
But then the process decides to reproduce. The resulting call to fork() has
a number of consequences beyond the creation of a child process. That
call will attempt to avoid copying the parent process's
memory since, for much of the memory range, there is unlikely to ever be a
reason to do so. Instead, both parent and child will be set up with page table
entries pointing to the same physical page in memory, but that page will
now be write protected. As long as neither process attempts to write to
the page, the situation can remain as shown in the diagram to the left.
Both processes - and the driver - can share the same physical page. If
either process calls fork() again, the result will be a third
process also sharing that page, and so on. Often, no process will attempt
to write to the page for as long as it is in this shared state, and no copy
will ever have to be performed.
Life is not always so easy, however. If the parent process makes a change
to the page - writing some new data to be passed through to the driver, for
example - the hardware will trap the write attempt. The kernel will
respond by allocating a new page, copying the old page's contents there,
and pointing the parent process's page table entry to the new,
write-enabled page. At that point, the write attempt can go forward, and
everybody will be happy.
Or maybe not. The copy-on-write operation described above will break the
parent process's connection with the old page. But there is no way to
inform the driver of that change. The result will be the situation shown
on the right: the driver retains a reference to the page which now belongs
exclusively to the child process(es). The parent process and the driver
will no longer be able to communicate with each other. Additionally, if
the parent had used mlock() to lock the original page into memory,
that lock, too, will remain with the original page. The page which the
parent had thought was pinned into RAM will become pageable, with
potentially bad effects on performance and security.
One could try to address this problem by changing the copy-on-write logic
to always maintain the connection between the parent process and its
original pages. That would require the COW code to find any other
processes with references to the page, however, and assign the copied
page to them. That change would slow the code and invite interesting race
conditions, however; remember that there could be a large number of other
processes with references to the page. So the solution proposed by Michael Tsirkin takes a
different approach.
If a process has pages which it has locked into memory or set up to be
shared with a device driver, chances are that it never wants its children
to have access to that memory in the first place. So Michael's patch adds
a couple of new flags to the madvise() system call. A process
with special memory can call madvise() with the new
MADV_DONTFORK flag; the kernel will respond by setting the
VM_DONTCOPY flag in the associated virtual memory area structure;
thereafter, any newly-created child process simply will not see that part
of the address space. There is also a MADV_DOFORK flag which
cancels the effect of MADV_DONTFORK.
Meanwhile, another change found in current -mm came as a result of this complaint about the behavior of the
msync() system call, which is used to flush modified parts of a
memory-mapped file back to disk. In particular, the
complainer, whose real name is unclear, just noticed that msync()
changed its semantics between 2.4 and 2.6. In the 2.4 kernel, a call to
msync(..., MS_ASYNC) would mark the indicated memory range as
being dirty and begin the process of writing those pages to disk. In 2.6,
instead, no I/O is started directly from msync(); instead, the
pages will remain dirty in the page cache until the virtual memory
subsystem gets around to flushing them out.
The original complainer asked that the old behavior be restored, but that
seems unlikely to happen. For most workloads, the best performance is
achieved by letting the kernel decide just when to write each part of the
file back to disk. But there was also some recognition that an option to
start I/O immediately (without necessarily waiting for it) would be a
useful thing in some situations. The answer, as implemented by Andrew Morton, leaves the
msync() call alone, however; instead, Andrew has added a couple of
new options to the posix_fadvise() system call:
- LINUX_FADV_ASYNC_WRITE will start write I/O on the given
range of pages. If some of those pages are already under I/O, the
operation will not be restarted, leaving open the possibility that
late changes might not make it to disk.
- LINUX_FADV_WRITE_WAIT will wait for any I/O currently in
progress on the given range of pages, but does not actually start any
I/O.
In practice, these calls will often need to be made in combinations. An
application which needs to assure itself that all modified pages are on
disk must first perform a wait call (thus ensuring that all pages under I/O
are written), a write call (to start I/O on remaining dirty pages), and a
second wait call (to allow that I/O to complete). But any application
wanting the 2.4 msync() behavior can get it with a single
LINUX_FADV_ASYNC_WRITE call.
Chances are good that both of these changes could land in the mainline in
the 2.6.17 time frame.
(
Log in to post comments)