LWN.net Logo

JLS2009: A Btrfs update

JLS2009: A Btrfs update

Posted Nov 8, 2009 1:27 UTC (Sun) by butlerm (subscriber, #13312)
In reply to: JLS2009: A Btrfs update by giraffedata
Parent article: JLS2009: A Btrfs update

There would also need to be synchronous fadvise call or the equivalent that
had the semantics of "wait on all the pseudo-synchronous fsync operations
that were just initiated". Otherwise the semantics wouldn't be fsync like
at all.

For example suppose you want to do a write rename replace for a set of
files. On many filesystems, the rename meta data operation will commit
before the data from the previous write commits, so the only safe way to do
this is fsync the new version before calling rename. Otherwise, on a crash
you may get no version at all, not the old version, not the new version,
just a zero length file.

If you are doing this with lots of files, a synchronous commit (or the
equivalent) of the data for the whole group prior to the renames for the
whole group is the only efficient way to go. Short of that you would need
to spawn a large number of threads, issue fsync rename operations in each
one and wait for them all to finish.


(Log in to post comments)

JLS2009: A Btrfs update

Posted Nov 8, 2009 1:35 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

There would also need to be synchronous fadvise call or the equivalent that had the semantics of "wait on all the pseudo-synchronous fsync operations that were just initiated"

All you need is fsync. Do it on each file in turn, after having done the fadvise on every file. The last fsync will complete at the same time as a single hypothetical "wait on all these files" would.

JLS2009: A Btrfs update

Posted Nov 8, 2009 2:52 UTC (Sun) by butlerm (subscriber, #13312) [Link]

I understand what you mean now, and that would be a considerable improvement
over serial fsyncs alone. I think you can more or less do the same thing now
on Linux with sync_file_range(...,SYNC_FILE_RANGE_WRITE). Without additional
flags that schedules asynchronous write out of the specified part of the
file. Then when you are all done, call fsync on every fd in the list, as you
say.

That is still somewhat problematic though, since sync_file_range will not
initiate write out of the metadata, which could be significant. Depending on
the way the filesystem handles metadata you could have a very similar
problem, with a journal write and synchronous wait for every fsync...So
something like fadvise options that schedules data and/or metadata for
immediate writeout would be helpful there.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds