User: Password:
|
|
Subscribe / Log in / New account

That massive filesystem thread

That massive filesystem thread

Posted Apr 1, 2009 0:36 UTC (Wed) by bojan (subscriber, #14302)
Parent article: That massive filesystem thread

I love Linus' so called reality check:

> You may wish that was what they did, but reality is that "open(filename, O_TRUNC | O_CREAT, 0666)" thing.

Which is exactly what ext4 already works around. In line with reality.

> Harsh, I know. And in the end, even the _good_ applications will decide that it's not worth the performance penalty of doing an fsync(). In git, for example, where we generally try to be very very very careful, 'fsync()' on the object files is turned off by default.

Ah, thinking of doing fsync() after all, are we?

> Why? Because turning it on results in unacceptable behavior on ext3.

Chuckle :-)

And then, the real reality:

> Now, admittedly, the git design means that a lost new DB file isn't deadly, just potentially very very annoying and confusing - you may have to roll back and re-do your operation by hand, and you have to know enough to be able to do it in the first place.

Meaning, make your apps in such a way that an odd crash here and there cannot take out the whole thing.


(Log in to post comments)

That massive filesystem thread

Posted Apr 1, 2009 3:39 UTC (Wed) by ajross (guest, #4563) [Link]

Don't be snide, it wrecks the S/N ratio of this site. No doubt you've already made yourself heard in the other flame wars on the subject.

And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.

That massive filesystem thread

Posted Apr 1, 2009 4:19 UTC (Wed) by bojan (subscriber, #14302) [Link]

> Don't be snide, it wrecks the S/N ratio of this site. No doubt you've already made yourself heard in the other flame wars on the subject.

What is your point here exactly? That I should not post because you may not like reading it? If you are a moderator of the site, please feel free to remove my post.

I make no apologies for my snideness - I think it was well deserved. Essentially, just because one file system does something in an idiotic way, we should now drop a perfectly good system call. Shouldn't we instead FIX what's broken so that all system calls and all file systems can be used as designed?

Similarly, we have seen heaps of new system calls introduced into Linux in recent times (dup3 and friends + other, backup related stuff from Ulrich Drepper), which all have to do with files. Why? Because they were needed. No complaints there. I thought the deal was that they would never get used? (see, being snide again).

> And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.

And to be fair, there is difference in designing around complete system lockups for a number of seconds and committing data when required.

That massive filesystem thread

Posted Apr 1, 2009 8:34 UTC (Wed) by nix (subscriber, #2304) [Link]

Those new system calls (notably *at()) are present because 1) they fill a real hole in the API without which it is *impossible* to read files in particular directories (e.g. deeply nested ones) in a threaded app, and because 2) they allow real security holes in e.g. 'rm -r' to be fixed. Also they already existed in Solaris (hence the horrible misnaming of some of them, also inherited from Solaris). They are new syscalls hence there are no problems with people seeing old examples of their misuse.

They're not really intended for use by everyman, anyway.

The problem with what one might call the fsync() RANDOMLY_LOSE option is that it is something which must be used by everyman to avoid data loss, which if you get it wrong there is no sign unless you lose power at exactly the right time, and which nearly all programs you might clap eyes on other than Emacs have historically got wrong, and which many utility programs *cannot* get right no matter what, because there's no way they can tell if the data they are operating on is 'important', and thus should be fsync()ed, or not. (Sure, you could add a new command-line option to tell them, but that option is not in POSIX so portable applications can't rely on it for a long long time).

That's a big difference.

That massive filesystem thread

Posted Apr 1, 2009 10:24 UTC (Wed) by bojan (subscriber, #14302) [Link]

> They're not really intended for use by everyman, anyway.

You are kidding, right? dup3() is not for general use?

> That's a big difference.

Look, I'm not really bent on a particular mechanism of actually making sure that programmers have a reliable interface for doing this. Using fsync() before close() is the only portable solution now, but it is far from optimal. I think there is very little doubt about that. And we all know it sucks to high heaven on ext3 in ordered mode.

I don't know what the best way is: new call, some kind of flag to open that says O_ALWAYSDATABEFOREMETADATA, rename2(), close_with_magic() or whatever. But, saying that application programmers cannot grok this kind of stuff is just not true. They can and they will, only if given the tools. Just like they did dup3() and friends (and as you point out, there is little danger of misuse - these are new calls).

As I said many times before, overloading current pattern with non-portable behaviour is dangerous, because it provides false sense of robustness and ties one up to a particular FS and kernel. If we can get POSIX updated so that rename() actually means "always data before metadata, but don't put on disk now", then it may even fly. But, I don't know how that's going to make guarantees retroactively, when even Linux features file systems that don't do that (e.g. ext3 in writeback mode).

Also, having things like delayed allocation, where metadata can legitimately be committed before data, is really useful. Most short lived temporary files will never see disk platters, therefore making things faster and disks last longer. Meaning, keeping the old cruft around ain't that bad.

As for utility programs that are called from scripts, you can use dd with conv=fsync or conv=fdatasync in your pipe to commit files to disk today. On FreeBSD, they already have standalone fsync program for that. Yeah, I know. It sucks. But, your usual tools don't have to make any decisions on fsync()-ing - you can.

That massive filesystem thread

Posted Apr 1, 2009 18:09 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

By your logic, we should never fix bugs. Remember the 25 year old readdir bug? Don't you agree it was good to fix that? What if a program, somewhere, depended on that behavior? In reality, programs use rename for atomic replacement. POSIX doesn't say anything about guarantees after a hard system crash, and it's just disingenuous to think that by punishing application authors by giving them as little robustness as possible, you're doing them some kind of portability favor.

That massive filesystem thread

Posted Apr 1, 2009 20:55 UTC (Wed) by bojan (subscriber, #14302) [Link]

I will just answer this one comment, so that nobody gets "offended".

Quite the opposite. I'm all for fixing bugs and giving application programmers the _right_ tools for the job. If some Linux developers took a second to lift their noses out of the specifics of Linux and actually looked around, this could be fixed for _everyone_, not just for some Linux specific file systems. That is my point, in case you didn't get it by now.

That massive filesystem thread

Posted Apr 1, 2009 21:37 UTC (Wed) by man_ls (guest, #15091) [Link]

It is a worthless effort. Each filesystem must keep its house clean. Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal? Everything is working now fine in ext3, and if it doesn't work right in ext4 people will just look for a different filesystem.

After reading that Linus is not pulling from Mr Tso's trees made me suspect. Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot. I wonder if the next ext4 head maintainer will learn from this painful experience and just do the right thing.

ext4 trees

Posted Apr 1, 2009 21:46 UTC (Wed) by corbet (editor, #1) [Link]

I'm confused. The article said that Ted's trees had not been pulled yet. In fact, that happened today; a bunch of ext4 work went into the mainline, including a number of patches which increase robustness for applications which don't use fsync(). I dunno what you were trying to link to, but it didn't work. I've not seen anything about revocation of commit rights. (It's hard to "revoke commit rights" in a distributed system in any case; at worst you can refuse to pull from somebody else's repository.)

Maybe it's an April 1 post that went over my head?

Recursive linking

Posted Apr 2, 2009 6:21 UTC (Thu) by man_ls (guest, #15091) [Link]

Sorry, it was a stupid attempt from a foreigner at an April Fools' prank :D I was hoping that the recursive link would give it away, but maybe it was too plausible altogether.

Will try to do better next time :D)

That massive filesystem thread

Posted Apr 1, 2009 22:38 UTC (Wed) by bojan (subscriber, #14302) [Link]

Just a few points, so please don't get offended. I apologise in advance to all sensitive LWN readers for any injury caused by this post.

> Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal?

Even if there was some kind of magical law that said that you could not order commits on the non-journaled file system this way, it can always be trivially implemented through - wait for it - fsync(), which has acceptable performance characteristics on such file systems.

> Everything is working now fine in ext3

Sure. Except fsync(), which locks the whole system for a few seconds. Hopefully, this will get fixed (or at least its effect reduced) as a result of the hoopla.

> Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot.

Now you are really making a fool of yourself.

That massive filesystem thread

Posted Apr 2, 2009 23:16 UTC (Thu) by anton (subscriber, #25547) [Link]

The problem with what one might call the fsync() RANDOMLY_LOSE option is that it is something which must be used by everyman to avoid data loss, which if you get it wrong there is no sign unless you lose power at exactly the right time, and which nearly all programs you might clap eyes on other than Emacs have historically got wrong
s/other than/including/. However, I don't agree that this application behaviour is wrong; if the application wants to jump through hoops to get a little bit of extra safety on low-quality file systems, that's ok, but if it doesn't, that's also ok. It's up to the users to chose which applications they run and on which file system.

The end of LWN comment dialog?

Posted Apr 1, 2009 5:31 UTC (Wed) by ncm (subscriber, #165) [Link]

This is what the beginning of the end for unmoderated LWN commmenting looks like: "Please be polite." "You can't make me." It's really astonishing that LWN has lasted this long. It's not an accident. bojan, you are striking a beautiful, fragile object with a hammer. If you don't understand how destructive you are being, please stop and think until you do understand it.

The end of LWN comment dialog?

Posted Apr 1, 2009 6:07 UTC (Wed) by bojan (subscriber, #14302) [Link]

If you read my original post in this thread, you will find that I am pointing at inconsistencies of what Linus describes as reality check. So, I ridicule (among other things) his conclusion that: ext3 sucks at doing fsync(), hence we should drop fsync().

What exactly is not polite about that? Is sarcasm now verboten on LWN? I see plenty of it. Daily.

In a post not so long ago, someone accused me of hiding behind Ted's authority (although I actually used documentation to support my case - which many don't bother to read, of course). This time, I point out what to me is nonsense coming from an even bigger authority, but that's no good either. I'm not sure what position of mine would satisfy fragile sensibilities here. Only silence, I guess.

This time I was being accused of making snide remarks. So, I replied to ajross using his terminology, although I do not actually agree with that qualification (which you can see from my sarcastic: "see, being snide again" remark) and I should have used "so called snideness" in my reply instead. I am really just being sarcastic, because we are all supposed to rally behind the high priest or something.

Sure, Linus is a genius, but that doesn't mean that whatever he says is beyond criticism. And, I do not see how I am not being polite by exercising criticism with a hint of sarcasm.

What is it exactly that you have the issue with in my posts? What exactly is impolite?

Yup. It's the beginning of the end.

Posted Apr 1, 2009 7:54 UTC (Wed) by khim (subscriber, #9252) [Link]

If you read my original post in this thread, you will find that I am pointing at inconsistencies of what Linus describes as reality check.

Nope. You are being 100% smart-ass. Linus's reality check is not inconsistent. It's description of reality and reality is not consistent. Whenever it was? You have different factors and in different but quite real situations different factors prevail.

So, I ridicule (among other things) his conclusion that: ext3 sucks at doing fsync(), hence we should drop fsync().

That's different facet of reality. When you consider reality from kernel developer POV what the applications are doing is your "unchangeable fact", your "speed of light", when you consider reality from application developer POV what the kernel does is "unchangeable fact" and you should deal with it. This is true even if kernel developer and application developer is the same person. You can only think differently if your application is designed to only be used "in-house" and you can always guarantee control over both kernel and userspace - and git was not designed to only be used "in-house"...

And, I do not see how I am not being polite by exercising criticism with a hint of sarcasm.

You are exercising ignorance with a hint of sarcasm. That's different.

Yup. It's the beginning of the end.

Posted Apr 1, 2009 8:29 UTC (Wed) by bojan (subscriber, #14302) [Link]

> When you consider reality from kernel developer POV what the applications are doing is your "unchangeable fact", your "speed of light", when you consider reality from application developer POV what the kernel does is "unchangeable fact" and you should deal with it.

Let me review.

When another Unix kernel (or Linux) holds your data in buffers and commits metadata only (because it is allowed to), you, as an application developer, deal with it by ignoring that fact.

And, when your file system does crazy things with the perfectly good system call, you also ignore it as a kernel developer.

WOW, is that now the new "very special relativity"? We pick whichever behaviour is the most narrow to a specific file system and go with that?

Yup. It's the beginning of the end.

Posted Apr 1, 2009 14:22 UTC (Wed) by drag (subscriber, #31333) [Link]

> When another Unix kernel (or Linux) holds your data in buffers and commits metadata only (because it is allowed to), you, as an application developer, deal with it by ignoring that fact.

POSIX allows you never to write data to disk at all. That will make your file system very fast. After all you can have a POSIX-compliant file system that operates off of ramdisk quite easily.

POSIX file system access is designed to describe the interface layer between userland and the file system. It leaves the actual integration between the file system and the hardware, as well as the internals to the file system itself is left up to the developer of the OS.

It is like if you discovered all of a sudden a network service provided by a Apache-based web app uses SSL badly so that all usernames and passwords are transmitted over the Web in plain text... then you complain about it and the developer says back to you that his application's behavior is allowed by TCP/HTTP/SSL and that you should be changing your password with each usage, like people who use his app correctly do. Then he emails you some documentation from a security expert that says you should change your password frequently and that many other protocols like telnet or ftp send your username and password over the network in plain text.

Yup. It's the beginning of the end.

Posted Apr 1, 2009 16:10 UTC (Wed) by foom (subscriber, #14868) [Link]

This is starting to get very repetitive...all these points have been made already at least once in one
of the other article's threads. I'd like to suggest that it might be in everyone's interest to move on to
more useful pass-times than rehashing the same arguments over and over again every time there's
an update on the subject.

sticks & stones

Posted Apr 2, 2009 23:17 UTC (Thu) by xoddam (subscriber, #2322) [Link]

> In a post not so long ago, someone accused me of hiding behind Ted's authority

I plead guilty and I apologise. That was immediately after replying to someone else's post the gist of which was "Ted wrote ext2 and ext3 in the first place, he is therefore above criticism." It concluded with the words "Know your place", which got me riled.

[proverb: in the midst of great anger, never answer anyone's letter]

Your words were not so condescending but they had much the same emphasis: all ur filesystems are belong to POSIX (not users) 'cos POSIX is the law, and by the way Ted's interpretation is the only correct one because he's the primary implementor.

I hope you understand where I was coming from. Forgive me.

sticks & stones

Posted Apr 2, 2009 23:56 UTC (Thu) by bojan (subscriber, #14302) [Link]

Nothing to forgive. All is perfectly fine. I enjoy a robust discussion.

The end of LWN comment dialog?

Posted Apr 8, 2009 0:05 UTC (Wed) by jschrod (subscriber, #1646) [Link]

Well, I just decided to give you feedback, from someone who is subscribed to LWN quite a bit longer than you and who did not participate in this topic after you took over all its discussion threads: You showed that LWN really needs a KILL file feature where one can put a poster in it; you, in particular. Others have succinctly explained why, no need to repeat this.

But your self-rightousness doesn't allow to understand this, obviously. Luckily, there are still some discussion threads where you don't try to take over. I hope the likes of you will remain few on LWN in the future, this is not Slashdot, after all.

The end of LWN comment dialog?

Posted Apr 1, 2009 15:46 UTC (Wed) by GreyWizard (guest, #1026) [Link]

The comment from ajross above was not some gentle plea for polite discourse. What he actually said included this: "No doubt you've already made yourself heard in the other flame wars on the subject." A more accurate summary would have been, "Be polite you jerk."

People get nasty in the comments here all the time. If there's something beautiful and fragile here it's already in a thousand jagged pieces. But people hector one another about being polite all the time too. That also wrecks the signal-to-noise ratio and solves nothing.

The end of LWN comment dialog?

Posted Apr 4, 2009 9:05 UTC (Sat) by jospoortvliet (subscriber, #33164) [Link]

Hmmm. Just say whatever your brain produces, and if somebody has a problem
with what comes out, it's on their own plate.

Living in a country where that mode of thinking is the norm, I can tell
you it also has disadvantages... If only because the resulted hurt
feelings can muddy the discussion more than you might think. Besides, it
chases people away who would otherwise have contributed constructively -
it's not acceptable behavior in all cultures. Ever wondered why the FOSS
community is still predominantly western, despite many smart developers in
countries like India?

A little decency now and then doesn't hurt. I know people who, knowing how
blunt they can be, ask someone else to read certain emails before sending
them. After all, reality is that people DO have feelings.

The end of LWN comment dialog?

Posted Apr 5, 2009 3:34 UTC (Sun) by GreyWizard (guest, #1026) [Link]

Pardon me for saying so but you have not understood what I wrote. I did not urge anyone to "say whatever [their] brain produces" or anything equivalent. Nor did I issue a rallying cry against decency. Elevating the level of discourse would be a wonderful thing and if you've got an effective suggestion for how to do so I would be glad to read it.

But saying "be polite you jerk" merely drags things even further down into the muck.

The end of LWN comment dialog?

Posted Apr 5, 2009 12:43 UTC (Sun) by jospoortvliet (subscriber, #33164) [Link]

I disagree on your argument that saying 'be polite you jerk' merely drags things down, for two reasons.

First of all, some people don't notice their behavior is unnecessarily impolite. Pointing it out can help them (if they are willing to be reasonable in the first place). Never pointing out somebodies failures will make them fail forever.

Second, it shows you care about being polite. If others show they care too, a culture of 'you should be polite' can be maintained. As you might have noticed from the differences between FOSS communities, culture is important and heavily influential. And it can be changed.

Some things to note:
- people DO care about what others think of them. No matter how much they scream 'no I don't', they do. It is our nature.
- people should know their arguments are not supported by being mean - it is the other way around.
- I agree that a 'be polite you yerk' might not always be the best way to correct someone. A personal mail can do more. However, it won't show up in public (unless an apology is made), thus it does not much to influence others who might think it is acceptable behavior because the guy got away with it. Of course, giving a good example is better than anything else.
- Of course discussing without end whether somebody was polite enough or not muddies the discussion and lowers the SNR.

The end of LWN comment dialog?

Posted Apr 5, 2009 15:42 UTC (Sun) by GreyWizard (guest, #1026) [Link]

Both of your arguments could reasonably be applied to a comment that says "please be polite" but both fail for "be polite you jerk." Someone who is accidentally rude is much more likely to respond to the "you jerk" part than the "be polite" part. The contradiction immediately destroys the credibility of the person posting because he or she is not willing to be held to the standard set for others.

A truly polite request for more courtesy might help but it's difficult to be sure because such things are quite rare. Giving in to the temptation to scold even just a little makes the comment worse than useless. Unless you are absolutely certain you can do it right it's better to focus on substantive issues and avoid appointing yourself a courtesy cop.

The end of LWN comment dialog?

Posted Apr 5, 2009 16:20 UTC (Sun) by jospoortvliet (subscriber, #33164) [Link]

True, if you meant to point out that 'be polite YOU JERK' isn't very effective, I agree. I do however think that it's better than nothing. Nothing changes nothing.

The end of LWN comment dialog?

Posted Apr 5, 2009 16:27 UTC (Sun) by GreyWizard (guest, #1026) [Link]

It's better to change nothing than to make the situation worse.

The end of LWN comment dialog?

Posted Apr 5, 2009 17:11 UTC (Sun) by jospoortvliet (subscriber, #33164) [Link]

Maybe. Probably for this one response. However, as I pointed out, there are wider repercussions to be expected from such behavior, and it is worth to show, as a community, that we disapprove of such ways of communicating. Even if that is done in a rather unfriendly manner.

On re-reading the thread, I think you are right in that ajross was more impolite than bojan, which often leads to a downward spiral and isn't helpful... bojan's post wasn't that far off from the normal tone on this site.

Anyway. This is went pretty far off-topic, and I think we mostly agree. For as far as we don't, we at least agree on that ;-)

That massive filesystem thread

Posted Apr 1, 2009 6:27 UTC (Wed) by njs (guest, #40338) [Link]

> Meaning, make your apps in such a way that an odd crash here and there cannot take out the whole thing.

Well, yes, it's a nice goal. The problem is that *you can't* without calling fsync. When the guy who wrote the system calls it "very very annoying and confusing", then it's not really a great example of how we can make all our apps more awesome and usable in general. Unfortunately.

(During the whole ext4 discussion I spent some time trying to figure out how to abuse Ted's patches to create a transactional system that doesn't require rewriting whole databases on every write, and uses rename(2) for its write barrier instead of fsync(2). But I think block layer reordering makes it impossible. Maybe if there were an ioctl to trigger an immediate roll-over of the journal transaction.)

That massive filesystem thread

Posted Apr 1, 2009 7:15 UTC (Wed) by bojan (subscriber, #14302) [Link]

> The problem is that *you can't* without calling fsync.

fsync() sucks because it is a "commit now" thing. Not everyone wants to commit now - I fully understand that. I'm a notebook user and I don't want my disk being spun up unnecessarily. But, current semantics are what they are, so ignoring them is looking for trouble elsewhere. Sucks - yes, but living in denial doesn't help either. And, as you say, not a great way to make our apps more awesome. Just a necessary evil right now. Some of it can be avoided with backup files, but the underlying badness will persist.

It would be nice to have a system call that guarantees "data before metadata, but not necessarily now", so that other systems interested in it may also implement it. Then the apps could comfortably rely on this when renaming files over the top of other ones. I was even thinking that we should ask POSIX to standardise that fsync(-fd) means exactly that (because fd is always positive, but we use int for it, which can also have negative values), but this may confuse things even more and is probably stupid.

Sure, some Linux file systems will continue making it more comfortable even with the undefined order of current semantics, which will please users (BTW, this is really interesting: http://lwn.net/Articles/326583/). But, the long term solution in the world of Unix should probably be a bit more inclusive.

PS. To be fair to fsync(), it is an indispensable tool for databases, so making it work as fast as possible is most definitely a good thing. What ext3 in ordered mode does with it an abomination.

That massive filesystem thread

Posted Apr 1, 2009 7:50 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

POSIX/UNIX semantics guarantees that renames are atomic.

POSIX/UNIX semantics do not make guarantees about the filesystem state after an OS crash.

Not having to do fsck after a filesystem crash gives the illusion that the filesystem is not corrupted.

It turns out that at least with extN after a crash we see filesystem
states that are illegal during normal operations. That is despite not
needing to run fsck the filesystem was corrupted.

It would be nice if there was a filesystem that could guarantee the visible state of the filesystem if fsck did not need to be run was:
- A legal state for the filesystem in normal operation.
- Everything that was fsynced was available.

Does anyone know of a journaling filesystem that guarantees not to give me a corrupt filesystem if fsck does not need to be run?

That massive filesystem thread

Posted Apr 1, 2009 8:05 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

Yes, fsync does look like too blunt an instrument for many purposes. Your particular problem could be solved though, if the system owner (i.e. you) was able to take decisions about whether fsync should be honoured or not or partly or whatever, rather than having one filesystem do it, one not. Having said that, you as the system owner are also in a position to choose a filesystem that works well with the behaviour you need...

That massive filesystem thread

Posted Apr 1, 2009 8:39 UTC (Wed) by nix (subscriber, #2304) [Link]

fsync() sucks not because it is a 'commit now' operation (that would be fbarrier()) but because it is a 'commit and force to disk now' operation.

Actually on many OSes it's a 'start a background force to disk now and return before it's done' operation; on Linux it's a 'lob it at the disk controller so it can cache it instead' operation. Still not necessarily useful (although that is changing to optionally emit a barrier to the disk controller too.)

(Speaking as the owner of an Areca RAID card with a quarter-gig of battery-backed cache, using non-RAIDed filesystems purely as an fs-cache storage area, I *like* the ability to turn off barriers: all they do is slow my system down with no reliability gain at all.)

That massive filesystem thread

Posted Apr 1, 2009 8:55 UTC (Wed) by bojan (subscriber, #14302) [Link]

By commit now, I meant force to disk. I think that was clear from the "disk being spun up unnecessarily" bit.

fbarrier vs. fsync

Posted Apr 1, 2009 12:02 UTC (Wed) by butlerm (guest, #13312) [Link]

"fbarrier(fd)" is not a "commit now" operation - that would make it
indistinguishable from fsync. It is a "commit data before metadata"
request.

The real technical problem here is that from the application perspective,
the meta data update must take place immediately, i.e. before the system
call returns. However, from a recovery perspective, it is highly desirable
that the persistent meta data state not be committed until after the data
has been committed. Unless a filesystem maintains two versions of its
metadata (a la soft updates), that is an unusually difficult requirement to
meet without serious performance problems.

The alternative that I would really like to see is undo records for a few
critical operations like rename replacement, such that the physical data /
meta data ordering requirements are removed, and on recovery the filesystem
un-does rename replacements where the replacement data has not been
committed to disk. That replaces the ideal of point-in-time recovery with
the more practical ideal of consistent version recovery.

That massive filesystem thread

Posted Apr 1, 2009 12:17 UTC (Wed) by butlerm (guest, #13312) [Link]

Or in other words, fbarrier is a completely different kind of barrier than
the one that the "barrier=1" mount option requests. The latter is a low
level block I/O write barrier usually implemented with a full write cache
flush (barring some sort of battery backup), the former is a data before
meta data barrier.

That massive filesystem thread

Posted Apr 2, 2009 20:23 UTC (Thu) by iabervon (subscriber, #722) [Link]

Linus actually overstated git's use of fsync(). There are three relevant cases:

  • git writes to a brand new filename, and then updates an existing file to contain the new filename instead of an old filename. It will optionally do a fsync() between these two operations.
  • git writes all of the data from several existing files to a single new file, and then removes the existing files. It will always do a fsync() between these two operations.
  • git updates an existing file by writing to a temporary file and renaming over the existing file. It will never do a fsync() between these two operations.

That is, git relies on the assumption that a rename() is atomic with respect to the disk and dependent on all operations previously issued on the inode that is being renamed. It uses fsync() only to make sure that operations to different files happen in the order that it wants.

Now, obviously, if you want to be really sure to keep some data, write it once and never replace it at all. That'll do a good job of protecting against everything, including bugs where you do something like "open(), fsync(), close(), rename()" but forget or mess up the "write()". Obviously, this isn't an option for a lot of situations, however, but it's what git does for the most important data.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds