|
|
Log in / Subscribe / Register

btrfs fscked up, too?

btrfs fscked up, too?

Posted Mar 16, 2009 13:19 UTC (Mon) by masoncl (subscriber, #47138)
In reply to: btrfs fscked up, too? by forthy
Parent article: Garrett: ext4, application expectations and power management

Btrfs isn't log structured in the traditional fill up a segment at a time sense. It shares many of the properties of log structured filesystems in that it does copy on write for all writes of both metadata and data.

Filesystems tend to break operations up into relatively large transactions. These include all the metadata changes to the FS over a 5 or 30 second interval. A big part of controlling the latencies of FS operations is to control the latencies of transaction commits. Regardless of how much of the commit you try to do in the background, there are always corner cases that break down to: wait for commit X to finish. Every FS has these, including ext34 (such as when the ext log wraps).

In the ext3 data=ordered model, the commit waits for all of the data writes during that transaction. If we assume the worst case of applications doing random data writes on slow spinning media, writing out all the data can take a very long time. This is what people noticed in the now famous firefox-fsync bug.

What btrfs does to limit transaction latencies is it only updates file metadata after file data IO is complete. This allows us to make atomic extent replacements in the file without having to flush all the data writes for a transaction before the commit can complete. xfs does something similar, but it only needs to make sure i_size updates are done after the extent is on disk.

This works well for single file overwrites. The rename case is different because the operations are between two different files.

I agree with Ted that fsync is the right answer, not just because it is what the standard says to do, but because skipping the fsync is explicitly what the standard says won't work. Adding these kinds of undocumented tricks to the filesystems today is sure to cause many problems for the next set of filesystem developers, who probably won't remember the famous firefox fsync bug or its evil twin the ubuntu gamer data loss on crash.

With all of that said, btrfs can give the ext3 behavior with little practical performance impact. fsyncs in btrfs almost always use a dedicated logging mechanism and don't have to wait for the full transaction commit.

So, I'll have patches in 2.6.30 that fix things in btrfs. This way we as a linux community can either document the new rename requirements or change the applications, and btrfs can move on to other problems ;)


to post comments

btrfs fscked up, too?

Posted Mar 16, 2009 13:34 UTC (Mon) by endecotp (guest, #36428) [Link] (5 responses)

> skipping the fsync is explicitly what the standard says won't work

Hi Chris,

Can you give a reference for that? Thanks.

I'd also be interested to hear whether you believe that I should also opendir() and fsync() the directory, for portable code.

btrfs fscked up, too?

Posted Mar 16, 2009 14:29 UTC (Mon) by masoncl (subscriber, #47138) [Link] (4 responses)

I think the real disconnect here is which operations people expect to be atomic. The rename is atomic because after the operation is complete the directory entry either points to the old file or the new file.

The contents of those files are determined by what other operations you've done on them, and in the past we've always expected the fsync.

The main places that list data integrity in the standard are the fsync page and O_SYNC/O_DSYNC portions of the write page. They are the only ways to make sure things are on disk.

The close man page explicitly states the data isn't on disk yet when the close happens unless you run fsync.

I do understand why app writers want a rename that works differently from what we're providing today, and it is important for filesystems to be able to grow new APIs to work with today's applications.

Re: do you need an fsync on the directory? I honestly don't know what other operating systems require. The last time I looked through various mail servers, the directory fsync was under a linux-is-evil kind of #ifdef.

btrfs fscked up, too?

Posted Mar 16, 2009 16:58 UTC (Mon) by endecotp (guest, #36428) [Link] (1 responses)

> The close man page explicitly states the data isn't on disk yet when
> the close happens unless you run fsync.

But I don't care whether the data is on the disk.

I just care that any changes to the disk, as I observe them after a crash, happened in the order that I did them. As far as I'm aware POSIX says nothing about this, hence my request for a reference.

btrfs fscked up, too?

Posted Mar 17, 2009 0:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

How are people supposed to find a reference if POSIX doesn't say anything about the behaviour you desire? Exactly - it isn't there - it is unspecified. It can happen any which way.

The best you can find is rename manual page, which talks about processes that are currently running on the system always seeing the file. That's it. No further guarantees are made.

Now, given that a directory is a file, you have to fsync that if you want to see what you wrote there on disk (i.e. rename). Similarly, you have to fsync the data of the file if you want to see it on disk. Combine the two with the fact that nothing is specified as to the order of commits in the absence of fsync and you get the unordered rename (when it comes to the picture on disk), which is still atomic when it comes to running processes. But, you'll never know if your process is seeing the buffers or what's on disk (for both the data and the directory) unless you actually fsync beforehand.

btrfs fscked up, too?

Posted Mar 16, 2009 17:10 UTC (Mon) by forthy (guest, #1525) [Link] (1 responses)

do you need an fsync on the directory?

If you have synchronous metadata updates, you don't "need" to fsync the directory - it is updated synchronously, anyways (same with sync mounted devices: no fsync needed, either ;-). I can understand the "linux is evil" #ifdef, when you consider how ext2 works: Gather all data and metadata updates for some seconds, and then flush them out in random order. If you don't sync anything, you have a good chance that the atomicy is maintained (unless, of course, the crash happens during the short write period). If you sync data and directory, you have a very good chance that durability is maintained (unless the whole ext2 file system exploded, and now half of the files are in lost+found, and the others are completely missing).

BTW mail server: A mail server needs to fsync, because durability is required. If you receive a mail, you write it to the inbox (or indir in case of an mdir storage system), fsync, and then reply to the smtp client that the message has been accepted. The smtp client now considers the message as passed, and can remove it from its spool - if it doesn't get an ok, it has to retry later.

The question now is: Should you sync the directory? In an mdir case (that's where the directory matters, mboxes keep their name), you create a new entry in inbox/new. fsync only writes out the data, thus it reorders create-write-close on disk into write-close-...-create. Only the inode-related metadata is flushed with fsync. From the man-page it looks like POSIX cares only about durability in fsync, not about atomicy. Therefore, fsync is allowed to reorder operations (fsynced files end up earlier on the disk). To maintain atomicy in the operations on the mbox, first fsync the directory, then the file. If you are a mail reader, and e.g. take files out of new/ and move them to cur/, first fsync cur, then new. Otherwise, your mails may be orphaned (duplicates may be annoying, but harmless).

Would all be a lot easier if the filesystem had a transaction monitor behind it. You would say "new transaction, create new/msgid, write data, close, make durable, close transaction" for the delivery and "new transation, rename new/msgid -> cur/msgid, add line to index file, make durable, close transaction" for the IMAP client. If the transaction succeeds, tell the client ok, if it fails or is incomplete after a crash, it will be rolled back. Note that a transaction monitor only needs to maintain ordering within a transaction, and can reorder transactions as it sees fit (and it can even abort transactions).

btrfs fscked up, too?

Posted Mar 16, 2009 18:07 UTC (Mon) by masoncl (subscriber, #47138) [Link]

We're mixing up a bunch of concepts here, but for a mailserver workload, in ext2 you have to fsync both the directory and the file in order to make sure a newly created file is on disk.

In ext3, ext4, reiserfs, xfs, and btrfs (and probably jfs), you only need to fsync the file. The journals include the directory data because the directory mods happened along with the file creation, and it actually isn't possible to get one without the other.

The btrfs log is a little different, but it explicitly goes out and finds the directory changes to make sure they are logged with the file during the fsync.

btrfs fscked up, too?

Posted Mar 16, 2009 13:54 UTC (Mon) by forthy (guest, #1525) [Link] (3 responses)

I still think Ted misreads the standard. fsync is about durability, rename is about atomicy. That's two different things, fsync is not necessary to make rename atomic, because POSIX file system metadata operations are already atomic. Atomic metadata operations are a poor man's transaktion, but reordering them and data operations breaks that promise, even though only during a crash (outside the scope of POSIX).

Note that collecting lots of atomic operations and performing them all in one go is not necessarily breaking the order of all these updates. A true log-structured file system collects all operations in order, and writes them in one go - atomic and delayed. btrfs should share most of these properties, even though the internal design is quite different. As it shows, implementing it "right" is not costly. Thanks, Chris, for being responsive.

What we might need further are real transactions. Now, real transactions are harder; with a filesystem like btrfs which a snapshot facility, we get a step closer (but only one step). It's not that easy, in a real transaction monitor, you create a private "snapshot" at the start of a transaction, perform the transaction, and then commit this snapshot. If the commit finds a conflict (e.g. a file changed during the transaction has been changed by somebody else in the meantime), the transaction will be aborted. Also, if another transaction has already been merged and changed a file that is read accessed during the transaction, this transaction will be aborted, too.

btrfs fscked up, too?

Posted Mar 16, 2009 18:48 UTC (Mon) by tytso (✭ supporter ✭, #9993) [Link] (2 responses)

Your mistake is assuming that the atomicity of rename() is about anything other than the directory pathnames. If you read the rename specification, you will see that it is talking explicitly about directory entries, and nothing at all about the the contents of the inodes involved. For example, here's just a tiny sample from the rename(2) specification:
If the old argument points to the pathname of a file that is not a directory, the new argument shall not point to the pathname of a directory. If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began. Write access permission is required for both the directory containing old and the directory containing new.

To understand the history of the comment in Rationale section of Posix's rename() specification regarding atomicity, it's helpful to understand how rename functionality had been implemented in V7 Unix --- via a combination the link() and unlink() commands. Back in the bad old days, it was possible while renaming a directory to end up having two links to a directory, if the system crashed after link()'ing the new name of the directory, and before the old name of the directory was unlink()'ed.

But to say that this atomicity requirement, which was only about the functionality of rename(2) system call being atomic, would somehow extend to a open-write-close-rename sequence, is a gross misreading of the POSIX specification. And given that I implemented POSIX TTY Job Control from the specification back in the 0.12 days of Linux in fall of 1991, I rather suspect that I have a bit more experience reading the POSIX specification than you do...

btrfs fscked up, too?

Posted Mar 17, 2009 1:43 UTC (Tue) by bojan (subscriber, #14302) [Link]

When I tried to suggest exactly this in another thread, bullshit was called: http://lwn.net/Articles/323430/. So, thank you very much for posting this explanation here.

btrfs fscked up, too?

Posted Mar 17, 2009 10:22 UTC (Tue) by forthy (guest, #1525) [Link]

Sorry, you still refer to the fact that POSIX "allows" to replace all files with pumpkins in case of a crash (especially for squashfs, this is the "obviously right" action ;-). That's not the issue, that's not what we discuss here - we are talking about file systems doing something actually reasonable in case of a crash, which is similar to well specified behavior under normal operation. If you rename a file during operation, and at the same time open that file in another process, and read it, you either read the old data or the new data, but no empty files, no garbage files, no pumpkins (unless, of course, the file deliberately contains a pumpkin image). It is obvious that file metadata is closely tied to actual file data.

BTW reading: I've implemented two Forth compilers from a standard back in the early 90s, and I've been doing my best to implement reasonable behavior in those corner cases where the standard says "an ambiguous condition exists...". The Forth community was quite picky back then about all those ambiguous conditions in the standard, because many people were used to well-defined behavior from their particular systems they used - however, this well-defined behavior might not have been portable. The result of the discussion was that I first started to make one of these two Forth compilers a "model implementation", which had well-defined behavior on those parts where the standard was just sloppy without proper reason. This continued over time, and now the community is revising the standard, and we are now trying to be more precise and less ambiguous (the draft standard document now even includes a test suite). So now, I'm not just reading standard documents, I'm writing them.

What resulted of this activity on my side is a different view upon standard documents, and how to read them. Standards encode common practice. People have not always been careful when implementing things. A standard document is a compromise between different systems. If you implement your system, it's not your job to find excuses for unreasonable behavior, it's your job to find reasonable ways to deal with ambiguous conditions. And if you are really good at it, it's your job to implement these things in a way that can serve as model for others (it's always the duty of those who are good to serve as example). Take the compiler example again: If a symbol encountered by the compiler is neither a number nor a pre- or user-defined function or variable, this is an ambiguous condition. The compiler is "allowed" by the standard to transform the user into a pumpkin (by magic, of course), teaching him a final lesson about proper programming. The reasonable action on a syntax error however is to print a message which states file, source line and position within that line, plus a meaningful error message about the problem. No language standard will define this action. Yet, most compilers in the world (regardless of the language) stick to that behavior, and even use a similar output format to make IDEs happy.

I hope you now understand why I say you didn't read POSIX, but you duck behind it. With "reading" I mean: Try to find out what best practice would be in a case where POSIX indeed does not really define how it should be. And "best practice" is both what your users will be happy with and what serves as good example for other file system writers (pumpkins are no option). If you raise the bar of expectation, do it.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds