|
|
Log in / Subscribe / Register

btrfs fscked up, too?

btrfs fscked up, too?

Posted Mar 16, 2009 13:34 UTC (Mon) by endecotp (guest, #36428)
In reply to: btrfs fscked up, too? by masoncl
Parent article: Garrett: ext4, application expectations and power management

> skipping the fsync is explicitly what the standard says won't work

Hi Chris,

Can you give a reference for that? Thanks.

I'd also be interested to hear whether you believe that I should also opendir() and fsync() the directory, for portable code.


to post comments

btrfs fscked up, too?

Posted Mar 16, 2009 14:29 UTC (Mon) by masoncl (subscriber, #47138) [Link] (4 responses)

I think the real disconnect here is which operations people expect to be atomic. The rename is atomic because after the operation is complete the directory entry either points to the old file or the new file.

The contents of those files are determined by what other operations you've done on them, and in the past we've always expected the fsync.

The main places that list data integrity in the standard are the fsync page and O_SYNC/O_DSYNC portions of the write page. They are the only ways to make sure things are on disk.

The close man page explicitly states the data isn't on disk yet when the close happens unless you run fsync.

I do understand why app writers want a rename that works differently from what we're providing today, and it is important for filesystems to be able to grow new APIs to work with today's applications.

Re: do you need an fsync on the directory? I honestly don't know what other operating systems require. The last time I looked through various mail servers, the directory fsync was under a linux-is-evil kind of #ifdef.

btrfs fscked up, too?

Posted Mar 16, 2009 16:58 UTC (Mon) by endecotp (guest, #36428) [Link] (1 responses)

> The close man page explicitly states the data isn't on disk yet when
> the close happens unless you run fsync.

But I don't care whether the data is on the disk.

I just care that any changes to the disk, as I observe them after a crash, happened in the order that I did them. As far as I'm aware POSIX says nothing about this, hence my request for a reference.

btrfs fscked up, too?

Posted Mar 17, 2009 0:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

How are people supposed to find a reference if POSIX doesn't say anything about the behaviour you desire? Exactly - it isn't there - it is unspecified. It can happen any which way.

The best you can find is rename manual page, which talks about processes that are currently running on the system always seeing the file. That's it. No further guarantees are made.

Now, given that a directory is a file, you have to fsync that if you want to see what you wrote there on disk (i.e. rename). Similarly, you have to fsync the data of the file if you want to see it on disk. Combine the two with the fact that nothing is specified as to the order of commits in the absence of fsync and you get the unordered rename (when it comes to the picture on disk), which is still atomic when it comes to running processes. But, you'll never know if your process is seeing the buffers or what's on disk (for both the data and the directory) unless you actually fsync beforehand.

btrfs fscked up, too?

Posted Mar 16, 2009 17:10 UTC (Mon) by forthy (guest, #1525) [Link] (1 responses)

do you need an fsync on the directory?

If you have synchronous metadata updates, you don't "need" to fsync the directory - it is updated synchronously, anyways (same with sync mounted devices: no fsync needed, either ;-). I can understand the "linux is evil" #ifdef, when you consider how ext2 works: Gather all data and metadata updates for some seconds, and then flush them out in random order. If you don't sync anything, you have a good chance that the atomicy is maintained (unless, of course, the crash happens during the short write period). If you sync data and directory, you have a very good chance that durability is maintained (unless the whole ext2 file system exploded, and now half of the files are in lost+found, and the others are completely missing).

BTW mail server: A mail server needs to fsync, because durability is required. If you receive a mail, you write it to the inbox (or indir in case of an mdir storage system), fsync, and then reply to the smtp client that the message has been accepted. The smtp client now considers the message as passed, and can remove it from its spool - if it doesn't get an ok, it has to retry later.

The question now is: Should you sync the directory? In an mdir case (that's where the directory matters, mboxes keep their name), you create a new entry in inbox/new. fsync only writes out the data, thus it reorders create-write-close on disk into write-close-...-create. Only the inode-related metadata is flushed with fsync. From the man-page it looks like POSIX cares only about durability in fsync, not about atomicy. Therefore, fsync is allowed to reorder operations (fsynced files end up earlier on the disk). To maintain atomicy in the operations on the mbox, first fsync the directory, then the file. If you are a mail reader, and e.g. take files out of new/ and move them to cur/, first fsync cur, then new. Otherwise, your mails may be orphaned (duplicates may be annoying, but harmless).

Would all be a lot easier if the filesystem had a transaction monitor behind it. You would say "new transaction, create new/msgid, write data, close, make durable, close transaction" for the delivery and "new transation, rename new/msgid -> cur/msgid, add line to index file, make durable, close transaction" for the IMAP client. If the transaction succeeds, tell the client ok, if it fails or is incomplete after a crash, it will be rolled back. Note that a transaction monitor only needs to maintain ordering within a transaction, and can reorder transactions as it sees fit (and it can even abort transactions).

btrfs fscked up, too?

Posted Mar 16, 2009 18:07 UTC (Mon) by masoncl (subscriber, #47138) [Link]

We're mixing up a bunch of concepts here, but for a mailserver workload, in ext2 you have to fsync both the directory and the file in order to make sure a newly created file is on disk.

In ext3, ext4, reiserfs, xfs, and btrfs (and probably jfs), you only need to fsync the file. The journals include the directory data because the directory mods happened along with the file creation, and it actually isn't possible to get one without the other.

The btrfs log is a little different, but it explicitly goes out and finds the directory changes to make sure they are logged with the file during the fsync.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds