By Jonathan Corbet
August 21, 2013
Back in 2007, the kernel developers
realized that the maintenance of the last-accessed
time for files ("atime") was a significant performance problem. Atime
updates turned every read operation into a write, slowing the I/O subsystem
significantly. The response was to add the "relatime" mount option that
reduced atime updates to the minimum frequency that did not risk breaking
applications. Since then, little thought has gone into the performance
issues associated with file timestamps.
Until now, that is. Unix-like systems actually manage three timestamps
for each file: along with atime, the system maintains the time of the last
modification of the file's contents ("mtime") and the last metadata change
("ctime"). At a first glance, maintaining these times would appear to be
less of a performance problem; updating mtime or ctime requires writing the
file's inode back to disk, but the operation that causes the time to be
updated will be causing a write to happen anyway. So, one would think, any
extra cost would be lost in the noise.
It turns out, though, that there is a situation where that is not the
case — uses where a file is written through a mapping created with
mmap(). Writable memory-mapped files are a bit of a challenge for
the operating system: the application can change any part of the file with
a simple memory reference without notifying the kernel. But the kernel
must learn about the write somehow so that it can eventually push the
modified data back to persistent storage. So, when a file is mapped for
write access and a page is brought into memory, the kernel will mark that
page (in the hardware) as being read-only. An attempt to write that page
will generate a fault, notifying the kernel that the page has been
changed. At that point, the page can be made writable so that further
writes will not generate any more faults; it can stay writable until the
kernel cleans the page by writing it back to disk. Once the page is clean,
it must be marked read-only once again.
The problem, as explained by Dave Chinner,
is this: as soon as the kernel receives the page fault and makes the page
writable, it must update the file's timestamps, and, for some filesystem
types, an associated
revision counter as well. That update is done synchronously in a filesystem
transaction as part of the process of handling the page fault and allowing
write access. So a quick operation to make a page writable turns into a
heavyweight filesystem operation, and it happens every time the application
attempts to write to a clean page. If the application writes large numbers
of pages that have been mapped into memory, the result will be a painful
slowdown. And most of that effort is wasted; the timestamp updates
overwrite each other, so only the last one will persist for any useful
period of time.
As it happens, Andy Lutomirski has an application that is affected badly by
this problem. One of his
previous attempts to address the associated performance problems —
MADV_WILLWRITE — was covered here
recently. Needless to say, he is not a big fan of the current behavior
associated with mtime and ctime updates. He also asserted that the current
behavior violates the Single Unix Specification, which states that those
times must be updated between any write to a page and either the next
msync() call or the writeback of the data in question. The
kernel, he said, does not currently implement the required behavior.
In particular,
he pointed out that the timestamp updates happen after the first
write to a given page. After that first reference, the page is
left writable and the kernel will be unaware of any subsequent modifications until
the page is written back. If the page remains in memory for a long time
(multiple seconds) before being written back — as is often the case — the
timestamp update will incorrectly reflect the time of the first write, not
the last one.
In an attempt to fix both the performance and correctness issues, Andy has
put together a patch set that changes the
way timestamp updates are handled. In the new scheme, timestamps are not
updated when a page is made writable; instead, a new flag
(AS_CMTIME) is set in the associated address_space
structure. So there is no longer a filesystem transaction that must be done when
the page is made writable. At some future time, the kernel will call the
new flush_cmtime() address space operation to tell the filesystem
that an inode's times should be updated; that call will happen in response
to a writeback operation or an msync() call. So, if thousands of
pages are dirtied before writeback happens, the timestamp updates will be
collapsed into a single transaction, speeding things considerably.
Additionally, the timestamp will reflect the time of the last update
instead of the first.
There have been some quibbles with this approach. One concern is that
there are tight requirements around the handling of timestamps and revision
numbers in filesystems that are exported via NFS. NFS clients use those
timestamps to learn when cached copies of file data have gone stale; if the
timestamp updates are deferred, there is a risk that a client could work
with stale data for some period of time. Andy claimed that, with the current scheme, the
timestamp could be wrong for a far longer period, so, he said, his patch
represents
an improvement, even if it's not perfect. David Lang suggested that perfection could be reached by
updating the timestamps in memory on the first fault but not flushing that
change to disk; Andy saw merit in the idea, but has not implemented it thus
far.
As of this writing, the responses to the patch set itself have mostly been
related
to implementation details. Andy will have a number of things to change in
the patch; it also needs filesystem implementations beyond just ext4 and a
test for the xfstests package to show that things work correctly. But the
core idea no longer seems to be controversial. Barring a change of opinion
within the community, faster write fault handling for file-backed pages
should be headed toward a mainline kernel sometime soon.
(
Log in to post comments)