|
|
Log in / Subscribe / Register

Wishful thinking

Wishful thinking

Posted Mar 16, 2009 21:46 UTC (Mon) by bojan (subscriber, #14302)
In reply to: Wishful thinking by jamesh
Parent article: Garrett: ext4, application expectations and power management

> Could you elaborate?

Yes. Say you just renamed a very big temporary file (think GBs) and it just so happens that it would be a good time for the kernel to sync the directory your file is in to disk, because another file in that directory changed and somebody asked to fsync the directory (or some other condition that kernel finds appropriate - doesn't matter). If you guarantee the order with rename(), you then first need to commit a few GBs of data in order to do this (which would otherwise never happen, because the file is temporary and would go away a bit later). If you don't guarantee the order, you then just commit the directory and you are done.

In other words, kernel currently has the freedom to do what it finds most appropriate and is allowed to by POSIX. Ordered renames restrict that freedom.


to post comments

Wishful thinking

Posted Mar 16, 2009 21:57 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (7 responses)

Yes, the kernel has the freedom to do what's appropriate, but uses that freedom to do mind-numbingly stupid things.

You're grasping at straws. First of all, you're most likely never going to see gigabytes of dirty blocks for a single file. They'll have been flushed well before your rename! Second, even if we do end up in your scenario, that file's blocks will be flushed in very short order anyway, so you're going to incur the penalty for that IO whether you do it before or after the rename.

As for the case of immediately unlinking the file --- that's a rather unlikely scenario. Can you give me one non-contrived real-life example where this would actually happen?

Remember:

  • You have a large temporary file with many unflushed data blocks
  • This temporary file has just been renamed
  • This temporary file is about to be unlinked
  • Between the rename of the large temporary file and its unlinking, fsync must be called on the directory
When would this plausibly happen?

Wishful thinking

Posted Mar 16, 2009 22:07 UTC (Mon) by bojan (subscriber, #14302) [Link] (6 responses)

> First of all, you're most likely never going to see gigabytes of dirty blocks for a single file.

You are right. And nobody will ever need more than 640 kB of RAM ;-)

> They'll have been flushed well before your rename!

Not if someone calls fsync on that directory. Or kernel decides (for whatever reason) that this directory must go out to disk.

But look, you obviously don't want to accept that:

1. This happens.
2. POSIX says what it says.
3. Kernel is allowed to do what POSIX says.

That's OK. The documentation is crystal clear on this. It is all in the manual pages for close() and rename(). Unfortunately, people choose to ignore it.

Sure, it would be nice to have a call that guarantees all this, but thinking that rename() is that call is simply false.

Wishful thinking

Posted Mar 16, 2009 22:12 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (5 responses)

And you clearly don't want to accept that we can improve over POSIX with little danger and a large upside. POSIX doesn't make any guarantees about what happens on an unclean shutdown. The kernel could overwrite all your files with pictures of carrots and it'd be POSIX-complaint. Would you be okay with that outcome? After all,
  1. This happens.
  2. POSIX says what it says.
  3. Kernel is allowed to do what POSIX says.

Please stop justifying poor behavior by resorting to POSIX.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by bojan (subscriber, #14302) [Link] (4 responses)

> little danger

Yeah, we've seen that with applications that were losing files on a perfectly good FS.

> Please stop justifying poor behavior by resorting to POSIX.

The behaviour is not poor. It is there for a reason, which you don't want to admit.

Lucky Ted is a nice man, so he put workarounds in place for folks that want to continue using sloppy code on ext4.

Wishful thinking

Posted Mar 16, 2009 22:24 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (3 responses)

Yeah, we've seen that with applications that were losing files on a perfectly good FS.
What are you talking about? A reordering rename is strictly safer than your broken one.

The behaviour is not poor. It is there for a reason, which you don't want to admit.
No, the behavior is dangerous and unintuitive, and there is no sound reason for it to work that way other than to make metadata write-out a little simpler. There is no performance or correctness upside to rename working the way you insist. You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (2 responses)

> There is no performance or correctness upside to rename working the way you insist.

I'll just answer about correctness. If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data. So, there is an upside to programming correctly and according to spec.

I think I answered the performance bit elsewhere, but you don't want to accept it. Which is fine by me.

> You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

POSIX is not nonsensical, it is completely asynchronous and unordered, which is what you don't seem to like. Sure, we could have another mechanism for ordered renames - I don't deny that. It's just that current rename() isn't it, which you don't seem to be able to understand.

Wishful thinking

Posted Mar 16, 2009 22:53 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (1 responses)

If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data.
Adhering to POSIX is not the same as being "perfectly good".

POSIX is not nonsensical, it is completely asynchronous and unordered
No, it's perfectly synchronous, ordered, and atomic with respect to a running system. It's only in the case of an unclean shutdown that we disagree, and POSIX really doesn't say much at all about that scenario. My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly. To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

Wishful thinking

Posted Mar 16, 2009 23:29 UTC (Mon) by bojan (subscriber, #14302) [Link]

> Adhering to POSIX is not the same as being "perfectly good".

I love it how you think that your or my opinion actually matters here. What matters is what's been written in the documentation for years now. That is the _only_ objective thing programmers on both ends of this can rely on.

> No, it's perfectly synchronous, ordered, and atomic with respect to a running system.

Yeah, confuse the issue some more, when you don't have anything new to add.

We are discussing here the "ordering of writes to disk on rename". In respect to this, POSIX is asynchronous and unordered. Heck, you cannot even tell if two consecutive writes will be written in that order to disk.

> My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly.

Your behaviour is truly what you would like it to be. What you are describing as my behaviour is what the documentation actually says, so it's not mine at all.

As or carrots, that is wrong, because you never fsynced carrots to disk, so they should not be there. But sure - implement it ;-)

> To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

And also encourages application writers to keep writing broken code, file system writers to put hacks into the system to work around that broken code etc.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds