Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Can you give a reference for that? Thanks.
I'd also be interested to hear whether you believe that I should also opendir() and fsync() the directory, for portable code.
btrfs fscked up, too?
Posted Mar 16, 2009 14:29 UTC (Mon) by masoncl (subscriber, #47138)
The contents of those files are determined by what other operations you've done on them, and in the past we've always expected the fsync.
The main places that list data integrity in the standard are the fsync page and O_SYNC/O_DSYNC portions of the write page. They are the only ways to make sure things are on disk.
The close man page explicitly states the data isn't on disk yet when the close happens unless you run fsync.
I do understand why app writers want a rename that works differently from what we're providing today, and it is important for filesystems to be able to grow new APIs to work with today's applications.
Re: do you need an fsync on the directory? I honestly don't know what other operating systems require. The last time I looked through various mail servers, the directory fsync was under a linux-is-evil kind of #ifdef.
Posted Mar 16, 2009 16:58 UTC (Mon) by endecotp (guest, #36428)
But I don't care whether the data is on the disk.
I just care that any changes to the disk, as I observe them after a crash, happened in the order that I did them. As far as I'm aware POSIX says nothing about this, hence my request for a reference.
Posted Mar 17, 2009 0:23 UTC (Tue) by bojan (subscriber, #14302)
The best you can find is rename manual page, which talks about processes that are currently running on the system always seeing the file. That's it. No further guarantees are made.
Now, given that a directory is a file, you have to fsync that if you want to see what you wrote there on disk (i.e. rename). Similarly, you have to fsync the data of the file if you want to see it on disk. Combine the two with the fact that nothing is specified as to the order of commits in the absence of fsync and you get the unordered rename (when it comes to the picture on disk), which is still atomic when it comes to running processes. But, you'll never know if your process is seeing the buffers or what's on disk (for both the data and the directory) unless you actually fsync beforehand.
Posted Mar 16, 2009 17:10 UTC (Mon) by forthy (guest, #1525)
do you need an fsync on the directory?
If you have synchronous metadata updates, you don't "need" to fsync
the directory - it is updated synchronously, anyways (same with sync
mounted devices: no fsync needed, either ;-). I can understand the "linux
is evil" #ifdef, when you consider how ext2 works: Gather all data and
metadata updates for some seconds, and then flush them out in random
order. If you don't sync anything, you have a good chance that the
atomicy is maintained (unless, of course, the crash happens during the
short write period). If you sync data and directory, you have a very good
chance that durability is maintained (unless the whole ext2 file system
exploded, and now half of the files are in lost+found, and the others are
BTW mail server: A mail server needs to fsync, because durability is
required. If you receive a mail, you write it to the inbox (or indir in
case of an mdir storage system), fsync, and then reply to the smtp client
that the message has been accepted. The smtp client now considers the
message as passed, and can remove it from its spool - if it doesn't get
an ok, it has to retry later.
The question now is: Should you sync the directory? In an mdir case
(that's where the directory matters, mboxes keep their name), you create
a new entry in inbox/new. fsync only writes out the data, thus it
reorders create-write-close on disk into write-close-...-create. Only the
inode-related metadata is flushed with fsync. From the man-page it looks
like POSIX cares only about durability in fsync, not about atomicy.
Therefore, fsync is allowed to reorder operations (fsynced files end up
earlier on the disk). To maintain atomicy in the operations on the mbox,
first fsync the directory, then the file. If you are a mail reader, and
e.g. take files out of new/ and move them to cur/, first fsync cur, then
new. Otherwise, your mails may be orphaned (duplicates may be annoying,
Would all be a lot easier if the filesystem had a transaction monitor
behind it. You would say "new transaction, create new/msgid, write data,
close, make durable, close transaction" for the delivery and "new
transation, rename new/msgid -> cur/msgid, add line to index file, make
durable, close transaction" for the IMAP client. If the transaction
succeeds, tell the client ok, if it fails or is incomplete after a crash,
it will be rolled back. Note that a transaction monitor only needs to
maintain ordering within a transaction, and can reorder transactions as
it sees fit (and it can even abort transactions).
Posted Mar 16, 2009 18:07 UTC (Mon) by masoncl (subscriber, #47138)
In ext3, ext4, reiserfs, xfs, and btrfs (and probably jfs), you only need to fsync the file. The journals include the directory data because the directory mods happened along with the file creation, and it actually isn't possible to get one without the other.
The btrfs log is a little different, but it explicitly goes out and finds the directory changes to make sure they are logged with the file during the fsync.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds