|
|
Log in / Subscribe / Register

Garrett: ext4, application expectations and power management

Here is Matthew Garrett's contribution to the ext4 debate. Not for anybody who is offended by profanity. "Dear filesystem writers - application developers like writing lots of tiny files, because it makes a large number of things significantly easier. This is fine because sheer filesystem performance is not high on the list of priorities of a typical application developer. The answer is not 'Oh, you should all use sqlite'. If the only effective way to use your filesystem is to use a database instead, then that indicates that you have not written a filesystem that is useful to typical application developers who enjoy storing things in files rather than binary blobs that end up with an entirely different set of pathological behaviours."

to post comments

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 16:04 UTC (Sun) by smoogen (subscriber, #97) [Link] (30 responses)

I am confused.. when did fsync() not become a standard operation in writing to disks? Even when writing tons of little files? That was pretty much drummed into us in the 1980's and early 1990's. write(),fsync(), write(), fsync(). And always fsync() before a close().

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 16:27 UTC (Sun) by drag (guest, #31333) [Link] (29 responses)

The trick here is that you do NOT want your application to write data to disk all the time.

When you go:
1. Create file 1
2. Write file 1
3. (do proccessing for some amount of time)
4. Create file 2
5. Write file 2
6. Rename file 2 over file 1

This is the application telling the OS that it wants either good old data or good new data. That in the crash the application developer is trying to ensure that all data is not lost.

Also this sort of technique is used for lots of other reasons.. You can run into situations were multiple applications want to write the same file and this sort of technique helps make sure that they don't corrupt each other's data. Also this is useful for applications that may crash themselves. If they segfault or otherwise go apeshit while writing out a new file this allows them to fail ungracefully without corrupting the old file.

So it's not really much to do with how the file system works on the block-to-filesystem layer of things; it's ment to deal with how the OS works on the filesystem-to-application layer of things. The applications should be unaware and not really care a whole lot about the lower levels as long as they don't do any pathalogically bad behavior.

So remember the goal is "either good new data or good old data".

When my desktop is writing out all those .gconf files I don't really care about that data. If I crash my system I expect to loose those preferences. That's normal, but I don't want to end up with a desktop that won't work at all due to a bunch of corrupt "registry" files.

If you use fsync() on all of that then this means that every time I make a change to the UI or some settings then my disk is going to spin up. When using fsync() that is telling the OS that "I only want good new data" and this is a much more strigent requirement and thus has much heavier impact on system as a whole.

The only time, I as a user, expect and want that behavior is when writing out important files or doing important system tasks. Like writing out a file with OO.org or editing /etc files with vim or whatnot.

When that happens I don't care so much about the disk spinning up, or that I am using a entire block of my flash drive to store a 1k file, because that small file is important...

-------------------

It took me a while to understand this. Fsync is just to big of a hammer for what many applications need to do, or desire.

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 16:33 UTC (Sun) by smoogen (subscriber, #97) [Link] (27 responses)

Ah ok. But renaming files should always have an fsync in it.. shouldn' it. I mean thats a point where you WANT the new data not the old data.

On the other hand reading the notes in his blog I can see why people have complained about xfs 'corruptions' with some of their applications that they didn't see with ext3. I wonder what reiserfs and jfs did with this sort of data writes. [I think I know what ntfs does.. its in the same hole as xfs.]

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 16:49 UTC (Sun) by drag (guest, #31333) [Link]

> Ah ok. But renaming files should always have an fsync in it.. shouldn' it. I mean thats a point where you WANT the new data not the old data.

Well it depends. If your data is not that important then it's not that important. If the data your dealing with is not that important then losing the latest changes isn't going to be the end of the world. Who cares really if Epiphany missed that last bit of history in the last 60 seconds or so? Who cares that I just set the default font to Arial in the last 30 seconds? Just as long as my preferences and history is not wiped out completely and is corrupted to the point were nothing will run without me going in and cleaning up the bad files.

It's a trade-off.

If the data is more important then good battery life, good disk performance, well laid out data on the block device, or the long life of your flash drive, (etc), then running fsync() all the time is what you want.

Some data is that important, other data not so much.

If all data was equaly critically important then it's just better to run in "sync" mode all the time and just take the hit.

-------------

And remember that in Linux system with stable drivers and decent hardware then running fsync() before a rename gains you almost nothing.

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 19:00 UTC (Sun) by kasperd (guest, #11842) [Link] (25 responses)

But renaming files should always have an fsync in it.. shouldn' it. I mean thats a point where you WANT the new data not the old data.
No, even at that point it might not be important to get the new data to disk right away. But it is still important to get either the old or the new data. However ext4 leaves the disk contents with something that didn't exist at any point during the changes the process made, this is because ext4 did the rename before writing the data.

The problem is, that there exist no API that guarantee exactly the level of integrity needed in many cases. You used to be able to create a file and then rename it on top of an existing file to get what you wanted. The change forced you to sync to get the guarantee that you needed, but it gave you more than you wanted and was slower because of that.

So there are three possibilities:
  1. You want to get the new data on disk as quickly as possible and wait for it to get there.
  2. You want some good data on disk either the old or the new, but you don't care which you get in case the system crashes in the next minute or so, as long as it doesn't take too long for the data to make it to disk.
  3. You don't care about the data at all, it is perfectly fine for it to get lost or corrupted.
I'd say 3 is unlikely to be desired by many applications, why would you be writing the file in the first place if you didn't care about the data? But right now there is no API to give you 2, the one that used to give you that now give you 3.

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 20:19 UTC (Sun) by drag (guest, #31333) [Link] (10 responses)

I was thinking a bit more about it.

Basically people want 3 levels of data integrity in applications (paraphrasing what you and other people are saying):

1. High priority: Write data _now_. All data is safe in case of system failure.

2. Normal priority: Ensure no corruption of existing data in case of system failure.

3. Low priority: temporary data that will get used for a session. No requirements for preserving data in case of system failure.

Ext4 (as it existed) can only provide 1 or 3, but not 2.

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 20:32 UTC (Sun) by smoogen (subscriber, #97) [Link] (3 responses)

Of the file systems is EXT3 the only one that does give that promise (and only by accident as it was an unintended consequence)?

xfs would seem not to and btrfs not to (going from the original blog post. I don't know about all types of reiserfs or jfs.

I am not saying the 'promise' is not important.. but it might be one that file-system developers should be aware that people want versus what they think people should expect :)

Garrett: ext4, application expectations and power management

Posted Mar 15, 2009 23:41 UTC (Sun) by drag (guest, #31333) [Link] (2 responses)

Ya. It seems to me that Ext3 only works that way by accident.

But it seems that for consumer devices this sort of behavior could actually be a fundamental design improvement over the way file systems have traditionally worked and could be advertised as a actual selling point (that is being able to do promise # 2. reliably.)

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 16:12 UTC (Mon) by jspaleta (subscriber, #50639) [Link]

I've always wondered.. how many of the more important or more impactful improvements in technology in the long view of history were simply uncharacteristically happy accidents versus premeditated "design" decisions.

-jef

Garrett: ext4, application expectations and power management

Posted Mar 19, 2009 23:28 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Not really by accident. I believe the necessary dependence is established by the "data=ordered" mount option. That's pretty much what we need to fix this issue: Make sure that the data is on the disk before you write the updated metadata.

That doesn't mean you need to flush things to the disk early. It just means that things have to happen in a particular order.

The three levels of write priority

Posted Mar 16, 2009 7:41 UTC (Mon) by rvfh (guest, #31018) [Link]

I like the three levels of commit priority you set, and I would rather this was an open() option than a application decision to call fsync() (and when in case 2.?)

1. O_COMMITQUICK commit to disk every 5 seconds
2. O_COMMITNORMAL commit to disk every 30 seconds
3. O_COMMITLAZY commit to disk only if need be, or maybe after 300 seconds

Just my 0.02€

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 9:58 UTC (Mon) by mjthayer (guest, #39183) [Link] (3 responses)

I have asked this a couple of times but not yet got a good answer. I presume that the kernel knows what has been written back and what not. Can't it optionally keep its own log - either in a file on the filesystem or in pre-allocated blocks on a swap device - where it writes details of any transaction which the target filesystem won't write back within a certain maximum timeframe. When the filesystem does do the writeback the transaction can be purged from the log. This could be enabled or disabled for the entire system, regardless of what filesystems are in use, and would not require Ted to add code he doesn't like.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 10:52 UTC (Mon) by MathFox (guest, #6104) [Link] (2 responses)

Michael, Yes, the kernel could do it, but such a log would have to be written to disk... But then it would be more efficient to directly write that log directly to the file system.
You'll create similar issues wrt. performance and commit intervals with a kernel-based log, but with the added overhead of writing data twice.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 10:54 UTC (Mon) by mjthayer (guest, #39183) [Link] (1 responses)

Would that apply even if the blocks for the log were reserved in advance and their location known to the kernel?

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 10:55 UTC (Mon) by mjthayer (guest, #39183) [Link]

I will answer my own question - presumably yes, because the kernel can't assume that the filesystem does a simple block to disk mapping.

Garrett: ext4, application expectations and power management

Posted Mar 18, 2009 17:19 UTC (Wed) by rich0 (guest, #55509) [Link]

I agree with your points. And putting fsyncs all over the place in applications is not very helpful.

My mythtv backend (which does a lot of other stuff as well) used to have lots of problems with ivtv buffer overruns. It turns out that mythtv users a fairly small cache, and when it writes to disk it does an fsync on every write. That means that the disk write cache is almost constantly getting flushed and the ability of the kernel to re-order writes is compromised, which then causes io waiting when the system is busy with other stuff as well.

When I increased the buffer moderately and got rid of the fsync everything worked great. So, if I lose power maybe I might lose an extra 10 seconds of the TV show I was recording. However, before the fix I was getting glitches in the video all the time due to overruns.

The role of the OS should be to allow applications to indicate the sensitivity of data and then the OS should figure out how to balance contention for the disk taking into account this kind of weighting. Applications should not be micro-managing the disk cache - that defeats the ability of the kernel to optimize the cache.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 3:14 UTC (Mon) by Nick (guest, #15060) [Link] (13 responses)

> No, even at that point it might not be important to get the new data to disk right away.
> But it is still important to get either the old or the new data. However ext4 leaves the disk
> contents with something that didn't exist at any point during the changes the process made,
> this is because ext4 did the rename before writing the data.

> The problem is, that there exist no API that guarantee exactly the level of integrity needed in
> many cases. You used to be able to create a file and then rename it on top of an existing file to
> get what you wanted. The change forced you to sync to get the guarantee that you needed, but
> it gave you more than you wanted and was slower because of that.

rename is a metadata operation, which is atomic. If you have not guaranteed the data is on disk
with fsync, then seeing the new file with no data after a crash is one obvious outcome.

And if you rely on btrfs to flush on rename, or ext3 semantics or whatever, then the app is still
broken.

"write, fsync, rename" is the sequence you need for correctness. If you don't need the new data
right away, then defer the fsync,rename part until the point at which you do need it. If you see or
percieve some performance problem with sequence required for correctness, then the answer is
absolutely not to destroy correctness or hope to rely on some undocumented implementation
detail. Raise the issue on lkml, provide details, suggest additional APIs etc.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 3:34 UTC (Mon) by drag (guest, #31333) [Link] (12 responses)

What is so wrong with a file system honoring the order of operations?

I mean if a application does a write then rename, why not wait to commit the rename to disk until after the write is committed?

Nobody is caring if the data is flushed to the drive immediately on a rename; just that the data is on the disk by the time the rename is on the disk. That way if the system crashes then your old copy of the data is still valid.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:07 UTC (Mon) by k8to (guest, #15413) [Link]

Because write is not a write to disk, and rename is not a rename to disk. They do occur in order in a perceptual way.

That they do not occur in order on disk is what you would want for the usual case.

This is a situation where the apis should be enhanced so that the application can tell the system what it needs.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 5:02 UTC (Mon) by dlang (guest, #313) [Link] (2 responses)

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.

and as a result, it would also cause the drives to wear out faster as the seek across the entire drive more.

you may think that you want that sort of guarantee, but you really don't. if you did than the 5 second window that ext3 has would be completely unacceptable to you as well.

Partial Ordering and Disk I/O

Posted Mar 16, 2009 13:21 UTC (Mon) by Pc5Y9sbv (guest, #41328) [Link]

I wish someone deeply familiar with file system design would give a detailed answer to this question. I am a computer scientist and software architect but don't have practical experience writing or optimizing general purpose file systems. I would, however, love to see pointers to more detailed reading.

But intuitively, I don't think it is as bad as you state. To honor the POSIX ordering all the way to disk would introduce a partial order on write operations, easily imagined as a queue-like structure comprised of a DAG of requests sequenced by write barrier relationships. Each set of siblings and descendents may be reordered, and this need only be maintained in system RAM and mapped to write barriers in the final queued I/O layer to disk. The kernel I/O scheduling would make some of the ordering decisions in mapping the DAG into a stream with write barriers, and leave the rest up to the disk controller. (Examples of mapping the DAG to the stream include deciding how bands of unordered writes from two different streams would be merged into the same band of the final stream, where that band is a set of writes between two write barriers, versus staggered out at different rates to adjust throughput of different streams.)

The sources of this partial ordering information could be explicit syscall/API extensions for write-barriers, but could also be heuristics for cases like that under discussion: maintain ordering with respect to batches of inode-file content writes and inode-linking metadata writes, and related atomic actions like separate relinks of the same file inode or directory inode. This would cover the broad range of "make file content available under a name" crash-recovery semantics and then some...

Coming from a scientific computing background, I suspect most more complex file writing scenarios, such as shared write access from multiple processes, would already have taken into account more elaborate rollback and recovery strategies for the file content in the case of crashes.

What is so wrong with a file system honoring the order of operations?

Posted Mar 20, 2009 19:06 UTC (Fri) by anton (subscriber, #25547) [Link]

because providing the ordering that you want would kill performance. it would mean that you could not reorder I/O from the order that the various programs happened to ask for it to something that the storage system can do more efficiently. it would mean that the storage system would (in most cases) not be able to combine separate I/O operations into a smaller number of them.
No (to each of these statements). A file system could combine many operations into one large batch, write out the batch in any order and with as few I/O operations as it (or the drive) likes, then commit the whole batch by writing one commit block. That would be efficient. Of course this means that no old block must be overwritten before the commit block is written, but that can be achieved by using a journal or a copy-on-write file system.

And yes, I want that guarantee, I really do, and I don't care if the file system loses 5 seconds or 30 seconds of operations, in case of a crash, but I do care if what it gives me is a state that never logically existed before the crash.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 8:47 UTC (Mon) by Nick (guest, #15060) [Link] (7 responses)

> What is so wrong with a file system honoring the order of operations?

There is nothing wrong with it, what is wrong is an application ignoring the documented
standards and assuming it will "honour" some semantics that they happen to think are
reasonable.

Historically the reason why they don't do this is performance. POSIX as far as I can see encoded
existing semantics in this regard, rather than a case of some particular OS or filesystem
developers making some legal interpretation of the document that goes against the spirit of it.

> I mean if a application does a write then rename, why not wait to commit the rename to
> disk until after the write is committed?

You could, but that's not a trivial thing to do for a lot of filesystems (without resorting to an
fsync), and it would also cost performance for apps that don't want it.

> Nobody is caring if the data is flushed to the drive immediately on a rename; just that the
> data is on the disk by the time the rename is on the disk. That way if the system crashes
> then your old copy of the data is still valid.

The way to do that is with fsync. If some filesystem happens to honour flush on rename, you are
still going to need fsync in order to have a correct and portable app, unfortunately. If you just
want the ordering but not the synchronous write that fsync gives, then you need to propose a
new syscall API for this (which would degenerate to fsync if a particular filesystem can't handle it
nicely).

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:25 UTC (Mon) by jamesh (guest, #1159) [Link] (6 responses)

> There is nothing wrong with it, what is wrong is an application ignoring
> the documented standards and assuming it will "honour" some semantics
> that they happen to think are reasonable.

Which standard is the application not honouring? The POSIX standard leaves behaviour over system crashes undefined so they can't rely on that one.

Absent some other standard to define the behaviour on crash, applications are left to assume that the implementation defined behaviour is sane.

Given the POSIX defined behaviour of rename() when the system isn't crashing and real world behaviour of ext3, zfs, etc, having the filesystem attempt to preserve the atomic "old content or new content" behaviour seems desirable.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:45 UTC (Mon) by k8to (guest, #15413) [Link] (5 responses)

They are not honouring the requirement for them to express that the data be on the disk when the rename is applied.

That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

In the replace-with-rename pattern, it's wrong.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:59 UTC (Mon) by jamesh (guest, #1159) [Link]

> They are not honouring the requirement for them to express that the data
> be on the disk when the rename is applied.

Right. There doesn't seem to be a way to do this without requiring that the data be written to disk right now. In these cases, the application is fine with delayed writes -- they just want the ordering of the write and the rename to be preserved.

> That's not wrong. It's just wrong if the application requires that the
> data be on disk after crash, which is what everyone is bitching about.

That isn't what the applications require though. The behaviour they are after is for the rename to be recorded only if the associated writes are also recorded.

It is acceptable if the rename is lost by a system crash. What is not acceptable is for the rename to occur but not the write.

If the application wanted to be sure that the data had been flushed, before the rename, then yes they should call fsync().

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:03 UTC (Mon) by drag (guest, #31333) [Link]

> That's not wrong. It's just wrong if the application requires that the data be on disk after crash, which is what everyone is bitching about.

Well they want either the old data or new data to be in a file system after recovering from a crash. Not files full of zeroes...

People are willing to put up with missing X number of seconds of work from the vast majority of applications they are using.

It's actually rare that people want data immediately written to disk. Stuff they want saved very carefully and immediately is generally going to be user-generated data (what your editing with Emacs) and not automatically generated data (my application remembering the position of icons in my windows).

Forcing a commit immediately to disk seems to be a much bigger hammer then what is wanted. They just want to have the OS not to corrupt files if it can be helped.

If fsync() is the only way to have the OS not to randomly blow away files on my hard drive, then so be it. It just seems like there should be a better way.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 10:09 UTC (Tue) by malor (guest, #2973) [Link] (2 responses)

What the author is arguing, and I agree with him, is that applications need a method to guarantee that the data on disk is always good, whatever version it is, but without the penalty of a full fsync. That may not matter _that_ much on a server or desktop, but a laptop, that means the drive absolutely has to spin up from sleep, or can't sleep in the first place. This is an substantial battery hit. I don't have any easy way to test it, but hard drive spinups are expensive as hell (and slow), so it wouldn't shock me if this ext4 behavior change singlehandedly wiped out a good chunk of the work done to improve kernel power usage on laptops.

Atomic rename is not the same thing as fsync. Telling application authors that they have to use fsync is yet another example of, when something is hard to do in Linux, telling the user that what he or she wants is wrong and stupid. This pattern goes way, way back.

Once upon a time, in the early days of Linux, I commented on Slashdot that ext2 was a bad filesystem, and would lose data if the computer crashed or lost power. I was informed, by numerous people, that the data loss was my fault because the computer wasn't on a UPS, and that I should 'simply' have manually run a disk editor and restored a backup superblock to recover the corrupted files. Seriously: lost data, they claimed, was my fault because I didn't understand the layout of ext2 well enough to fire up a hex editor when it crashed.

Well, sometime in the next year or two, journaling showed up, and suddenly everyone was all about how wonderful it was, how horrible ext2 was in comparison, and how no sane person would use ext2 in production. But when I'd said that, when there was no other option, I was wrong and stupid for wanting reliability in my filesystem.

I see this argument the same way; by accident, the ext3 writers provided a very useful feature. Atomic rename isn't fsync; it's much lighter weight. People are not wrong and stupid for wanting it, but because it's hard, that's practically the first thing out of people's mouths. "You can't do that on ext4. That's not the POSIX semantics, and you're foolish to expect this behavior."

I disagree vehemently. It's a very good feature, and even if it "isn't the Posix standard", you guys should bring this behavior forward. Doing it via the regular rename operation might be a good choice, because it's backwards-compatible with the original accidental feature. Or, perhaps you'll instead want to add an explicit atomic rename operation, so that filesystems like xfs won't surprise users unpleasantly. That would require more pain on the part of application developers, but would make the guarantee explicit instead of implicit, which is probably better from a design perspective.

But telling people to use fsync instead of atomic rename, and that they're wrong and stupid for wanting a feature that's hard to do, is just a tired repetition of a very old game indeed.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 15:37 UTC (Tue) by smoogen (subscriber, #97) [Link] (1 responses)

As far as I can tell... the only way you are going to get what you want is an fsync() or battery backuped cache. Disk drives are limited to writing or reading and are pretty much a 'linear' device in that regards.

In the past, the fsync sort of happened every 5 seconds so you never really spun down your disk. It was the reason why people considered ext3 a slow filesystem compared to xfs, etc etc. One can get better performance, but at the price of reliability.

Garrett: ext4, application expectations and power management

Posted Mar 18, 2009 7:54 UTC (Wed) by malor (guest, #2973) [Link]

It's not the 5-second thing. Rather, something about how ext3 orders writes means that, purely by accident, a rename of a file will always be done after the data blocks of the file have been written to disk. I have no idea why this happens, and it obviously wasn't an intended feature, but that's how it actually works out in practice. The fact that xfs doesn't do this, in fact, is one of the reasons it's considered unreliable by people who've used it on the desktop.

Even if disk spinups were once every five minutes instead of every five seconds, you would still get that behavior; all the data blocks of a given file would be written to disk before that file was renamed over another one.

This means that you're guaranteed to always have either the old data OR the new data. You don't know which you have, after a kernel crash or power failure, but you have one or the other. And this happens without needing to do an fsync, which is a different logical thing, and which absolutely requires a drive spinup. This sync-and-rename functionality is much lighter weight, and can happen pretty much anytime. It doesn't add to the power burden of using the disk, but still guarantees a form of data integrity that many applications find very useful.

Either good old data OR good new data is not the same as fsync. Telling programmers to use fsync is forcing them to use the hammer that's convenient, instead of the screwdriver that would better solve the problem.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 15:10 UTC (Tue) by kjp (guest, #39639) [Link]

Wow. This thread is soaking up a huge portion of my work day. I posted this on ted's blog:

>> in order to get very high performance levels, file systems usually combine 5-30 seconds worth of >> file system operations into a single transaction commit.

Ted, thanks for continuing the dialogue here. It’s been educational. Thanks for putting the rename ‘kludge’ in, but I do think it’s absolutely necessary. I will give you my use case.

I use the write-close-rename all the time with configuration files that I don’t care if I lose the last N seconds of change to. This is easy and thread safe and has worked for years on ext3. At the same time, I have a daemon that is continually writing out large log files. We also run on cheap IDE hardware due to cost pressures. It seems that fsyncs will force your above described ‘mega transaction’ to complete, which involves seeking all over the disk to our other dirty log files. If I made multiple changes to a conf file during a 5-30 second ext4 flush interval, fsync will cause more seeks than not using it, which will wear out our disks.

You are right that the FS is not a database, certain things I do not care about instant durability. So we can:
1. have the FS support write barriers on rename or other new api
2. do writes to a user space cache daemon that only flushes when necessary
3. make every app much more complicated and cache its own data

#1 already works.

ordered(tm) brand

Posted Mar 15, 2009 16:33 UTC (Sun) by szh (guest, #23558) [Link] (15 responses)

ext4 shouldnot call its journalling mode "ordered"(tm), if it can't support the same (THE BEST) semantics, as ext3. By calling "ordered" they abused the brand, and fulled users, probably unintentionally.

I have a case where I intentionally dont use fsync (several MB files), in order to dont wait. I have a few auto-backup-restore copies instead.

ordered(tm) brand

Posted Mar 15, 2009 20:10 UTC (Sun) by nybble41 (subscriber, #55106) [Link] (9 responses)

As best I can tell, the semantics *are* the same as ext3. The only difference is in the interval in which an unclean shutdown can result in application-level inconsistency. For ext3 that window was about five seconds; for ext4 it's significantly longer. In both cases data journaling (data=journal) should preserve the same order on disk as exists in RAM, though at a significant cost in performance.

The default setting, data=ordered, only guarantees that any given file's contents are committed to disk before that same file's metadata -- essentially just the file's size -- thus ensuring that the on-disk version of the file never contains uninitialized data. It doesn't make any guarantees, in ext3 or ext4, regarding the ordering of writes to separate files or directories. Similarly, the semantics of rename() are such that atomicity is guaranteed only with respect to the directory entry, not data or metadata associated with the files themselves.

Semantic is NOT the same

Posted Mar 15, 2009 20:56 UTC (Sun) by khim (subscriber, #9252) [Link]

From user POV semantic is wildly different: with Ext3 I'm guaranteed my P2P state will not be corrupt - no five seconds window. With Ext4 I'm almost guranteed it'll be destroyed in crash. Big difference.

Granted - it looked like "more-or-less" the same mode from filesystem developer POV, but that's no excuse... May be it's accident, may be not - but semantic of ext3 and ext4 ordered case was quite different...

ordered(tm) brand

Posted Mar 16, 2009 12:35 UTC (Mon) by nye (guest, #51576) [Link] (7 responses)

>The default setting, data=ordered, only guarantees that any given file's contents are committed to disk before that same file's metadata

That's true for ext3 but not for ext4, which is the point of the discussion.

ordered(tm) brand

Posted Mar 16, 2009 15:24 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (6 responses)

Could you give an example of a case where it isn't true for ext4? The point of the discussion seems to me to be that some application developers assumed that data=ordered implied full on-disk ordering of filesystem operations, when in fact it only guarantees limited ordering within individual inodes (ensuring that any given file's contents on disk are never uninitialized).

Will ext4 increase a file's size before committing its contents to disk? If so, that would be a separate issue independent of the write()/rename() ordering issue discussed thus far. If not, then the guarantee is the same as with ext3.

ordered(tm) brand

Posted Mar 16, 2009 17:19 UTC (Mon) by nye (guest, #51576) [Link] (5 responses)

I think we have a semantic misunderstanding.

From your statement that the semantics for data=ordered were the same between ext3 and ext4, I assumed that you were using the word 'guarantee' to describe the known outcome of a particular operation, but I now realise that you meant the word more literally, as in 'this is the behaviour of the filesystem, and *this* is the subset of that behaviour which we guarantee will always hold true'. Since the actual behaviour was documented and predictable, and differs from ext4, I disagree that the semantics of the setting are the same, but concede that the actual guarantees made are.

Just so we can be clear what's happening, I'm going to go off on a tangent and try to work this through (assuming the no-fsync rename case)as I understand it. This isn't really directed at the parent poster:

The case in question is that you create a file (a.new). It gets an inode describing a length of zero, and a directory entry linking that name to that inode. Is this true? Does it get that inode and directory entry? I'm not sure what steps are skipped in the case of delayed allocation, but both of these thing appear to happen given that a zero-length file is indeed created upon a crash. Call this point A.

You write data to that file, and here the delayed allocation comes into play. The cached copy of the inode is updated. Is that true? I don't know if the in-memory representation corresponds to the on-disk representation. Clearly there can't be any real block pointers as the allocation hasn't happened yet.

Then 'a.new' is renamed to 'a', effectively unlinking the existing 'a' and 'a.new' and creating a new 'a' pointing to an inode previously known as 'a.new'. Call this point B.

At some point the inodes and directory entries are written, but because allocation is delayed, you are saving the version of the inode with size zero and no block pointers. Call this point C.

At this point you pray that there isn't a power cut.

At some point in the future the allocation is committed, and the inode for the new 'a' is updated to reflect this. Call this point D.

The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D? It appears that this will no longer happen in this particular case come 2.6.30, but is this not a specific case of some more general behaviour? Why commit the inode and directory entry for a file whose allocation hasn't happened yet?

ordered(tm) brand

Posted Mar 16, 2009 18:06 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (4 responses)

"The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D?"

The thing is, this *isn't* known at the filesystem level. The application knows that the rename() is useless until the write() has been committed, but there is no API to communicate this information to the filesystem. Perhaps there should be, but the lack of appropriately fine-grained userspace APIs is not the fault of the filesystem authors. All existing filesystems, ext3 included, assume that the rename() and write() operations are independent; the cases where the ordering happens to be correct from the application's point-of-view are purely accidental.

The more general issue is that application writers are depending on filesystems to provide full data journaling, which is a major performance killer and was never actually guaranteed. Metadata journaling, as used by ext3 and ext4 be default, is only a replacement for the fsck process; as with all asynchronous, non-journaled filesystems, the state of the recovered filesystem after fsck or journal playback will be internally consistent, but may not match any state which actually existed in RAM before the crash.

ordered(tm) brand

Posted Mar 17, 2009 11:10 UTC (Tue) by nye (guest, #51576) [Link]

I really think it is known by the filesystem, because it knows that the file has unallocated data at the time that it makes that write. However, Ted has answered the question here: http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsyn...

Guarantees and the belt-and-braces of journaling

Posted Mar 18, 2009 0:16 UTC (Wed) by xoddam (subscriber, #2322) [Link] (2 responses)

The use-case of writing a new file and renaming it to replace an existing one is documented and recommended by expert practitioners[citation needed] as the best, nay the only, way to achieve atomic operations on a POSIX system.

Specifically, whenever a rename replaces an old file with a recently-written one, the application developer's intention is to achieve atomic replacement of the file's contents. Invariably. No exceptions. Even if you "kill -9 1", this will not cause corruption truncation of the target file.

POSIX *does* guarantee this atomicity, with precisely *one* exception -- if the system crashes, behaviour is undefined.

A journaling filesystem exists for one reason only to provide reasonable behaviour in the event of a system crash, ie. to extend the guarantees POSIX provides and reduce the need to recover data.

In a couple of instances, users have observed that particular up-and-coming journaling filesystems make it more (not less!) likely for them to need to recover files than the status quo. It is only sane that they should report this as a bug. It has *nothing* to do with application developers, who are using the recommended pattern and generally don't have much influence over what happens when their users' computers crash.

It is wonderful news then, that the developers of both filesystems (to my knowledge) that did exhibit such behaviour have listened to the requests of their users and let the journal extend the POSIX guarantee of atomic replacement on rename across system failures.

Hurrah and thankyou.

Discussion of fsync is a complete red herring. On older POSIX-conforming filesystems there is NO GUARANTEE AT ALL that the filesystem will be accessible after a system crash, fsync or no. On some implementations, fsync can indeed make this particular kind of data loss less likely (and application developers in-the-know have used it for this purpose). There is still no POSIXLY_CORRECT guarantee that data will not be lost, so for a filesystem developer to say that his users don't really deserve to benefit from the safety that journaling can afford until application developers have jumped through an extra latency-imposing hoop is a bit rude. Not to say, putting the cart before the horse.

By the way, Ted T'so is 90% correct to say application developers shouldn't fear fsync, and 100% wrong to say that fsync is the correct way to achieve atomic replacement with rename. Rename alone is supposed to achieve this; if a filesystem is technically capable of preserving this guarantee even across system failures then it should do so.

Guarantees and the belt-and-braces of journaling

Posted Mar 20, 2009 13:28 UTC (Fri) by regala (guest, #15745) [Link] (1 responses)

journaling is not here to preserve data, but to preserve integrity. You would be pleased if the data were on disk, but the filesystem's got broken and unrepairable...
People need to know what journaling was introduced, and clearly it is not here to preserve your little settings you got smashed because you wanted to play World of Goo. Get serious.

Guarantees and the belt-and-braces of journaling

Posted Mar 20, 2009 15:41 UTC (Fri) by foom (subscriber, #14868) [Link]

> People need to know what journaling was introduced, and clearly it is not here to preserve
> your little settings you got smashed because you wanted to play World of Goo. Get serious.

If journaling is not for that, then I want something which is! And yes, I do want to play World of Goo!

Why should I give a damn about the filesystem structure except as a prerequisite to being able to
get to my files. I want my files, and that means file *content*. So I want a system which does a
reasonably reliably job of ensuring that content doesn't disappear. Ext3 is such a system. Maybe it
was unintentional at the time it was designed, but now that it's recognized as a good idea, let's
*keep* making systems that work as well as it!

ordered(tm) brand

Posted Mar 16, 2009 0:09 UTC (Mon) by njs (subscriber, #40338) [Link] (4 responses)

ext3's "ordered" mode does *not* have the best semantics; it (accidentally) makes very strong ordering guarantees -- in fact, much stronger than are needed for atomic-rename -- and the consequence of those strong guarantees is that fsync() becomes unbearably slow.

And one direct consequence of this is that on ext3, firefox *cannot* guarantee safety of your e.g. browsing history -- there is no way to do it that is fast enough for users to put up with (they tried). ext4 + Ted's flush-on-rename patch provide all the useful parts of ext3's semantics, while also making fsync fast and thus putting *less* data at risk than ext3.

ordered(tm) brand

Posted Mar 16, 2009 2:38 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (3 responses)

Well, no, if it flushes on rename then I disagree that it provides all the useful parts of ext3's semantics.

ordered(tm) brand

Posted Mar 16, 2009 4:45 UTC (Mon) by njs (subscriber, #40338) [Link] (2 responses)

"flush-on-rename" is probably a misleading description (I just don't know a better short phrase for it). From Ted's blog post, it seems pretty clear that ext4+patch gives the same behavior for write-then-rename as ext3, i.e., it ensures that the new file data lands on the disk at the same time as the new metadata (whenever that ends up being). So... flush-on-rename-taking-effect, or flush-from-virtual-pages-to-allocated-pages or something like that.

(I think that answers your objection; it's a bit terse.)

ordered(tm) brand

Posted Mar 16, 2009 13:24 UTC (Mon) by nye (guest, #51576) [Link] (1 responses)

I didn't interpret Ted's post the way you did, though now I've read through it again I can see that you may be right. Specifically, Ted says:
>These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced.

I interpreted that to mean that those blocks would be written to disk as if fsync() had been used, but is that incorrect?

Am I correct in believing that your interpretation is as follows:
When a file is replaced, it is not marked for delayed allocation, so its data will be written immediately before its metadata, *whenever that happens to be*. In other words, the disk will be spun up only at the same time it would have been without those patches, but now the data will be written in addition to the metadata.

If that's correct, then it appears to be the correct resolution to me.

ordered(tm) brand

Posted Mar 16, 2009 21:29 UTC (Mon) by njs (subscriber, #40338) [Link]

Yeah, Ted's speaking filesystemdeveloperese there, but if you read other comments about the difference between ext3 and ext4 there's enough context to decipher. What he's basically saying is that at rename time they will now force the kernel to *decide* which disk blocks the data will eventually end up on (that's what "allocation" means), and then it's a pre-existing rule that before the metadata transaction commits (i.e., that's when the rename will hit disk) you have to write out all "allocated" blocks to disk.

So your interpretation is correct.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 2:37 UTC (Mon) by Tara_Li (guest, #26706) [Link] (12 responses)

I'm at a loss on one thing...

When I start up Gnome, or KDE, I'm not making any changes to my configuration - so why are all of the config files getting re-written? For that matter, why are they using a truckload of tiny files, instead of one big file of key=value lines? That way, you only update the metadata once, you pull it all into memory, edit it however you need to, and once you're done editting it, dump it back to the HD in one single write operation...

This trash of tiny little files buried down in a tree under a .directory, often containing binary data making it a PITA to manually edit in case of complete Windows-ism. Looks just like the stupid Windows Registry that is at the heart of 90% of the problems in Windows, as far as I can tell.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:38 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (8 responses)

>When I start up Gnome, or KDE, I'm not making any changes to my
configuration - so why are all of the config files getting re-written?
For that matter, why are they using a truckload of tiny files, instead
of one big file of key=value lines? That way, you only update the
metadata once, you pull it all into memory, edit it however you need to,
and once you're done editting it, dump it back to the HD in one single
write operation...

You would think, wouldn't you? Lots-of-tiny-files is not a good idea for performance reasons, although I can see the advantage in not having to do any parsing, just traverse the directory tree to find the file with the setting you want.

But truncating and recreating them with (possibly) the same data? That's just insane. There is NO good reason for doing that. Even if the files are used the way I described above, and you need to make a change, there is no good reason to use O_TRUNC. Open a new file, write, fsync, close, rename. That's the safe way to do it. And don't recreate a file that isn't changing, for heaven's sakes! What are those desktop environment developers thinking?

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 5:24 UTC (Mon) by bojan (subscriber, #14302) [Link] (6 responses)

> there is no good reason to use O_TRUNC

O_TRUNC is used on the new file:

3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)

That's because the file by the same name (i.e. baz.new) may exist and contain random garbage, which is then removed by O_TRUNC.

> And don't recreate a file that isn't changing, for heaven's sakes!

Yeah, that's one of the key things. Don't touch what didn't change.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 6:12 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (5 responses)

Sorry, I should have been more specific. There's no good reason to use O_TRUNC on a file that the program itself created. Especially a config file.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 8:46 UTC (Mon) by gdt (subscriber, #6284) [Link] (4 responses)

So how do you propose to retain ownership, permission bits, creation time and other attributes? Or is your's one of those annoying applications which alter the permissions that I set manually?

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:40 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (3 responses)

Reading the ownership, permissions, and other attributes from the existing file is not an onerous task. As for creation time, if it's a config file in the "one setting per file" mode, that really doesn't need to be preserved.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:40 UTC (Mon) by cortana (subscriber, #24596) [Link] (2 responses)

It is very onorous!

You have to replicate *all* of the following!

* user and group owner
* mode
* ACLs
* XFS ACLs (it uses its own non-posix API I think)
* attributes (some of which only root may set)
* extended attributes (some of which only root may set, and some of which only root may read!)
* user extended attributes
* XML user extended attributes (again, it uses its own non-posix API)
* reiserfs extended attributes (again, I believe it uses its own non-posix API. And probably reiser3 and reiser4 use different APIs...)

And this is just for apps for Linux. If you want your program to run on Windows, Mac OS X, FreeBSD, etc. etc., you have an entirely different set of tasks to perform...

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 12:13 UTC (Tue) by nye (guest, #51576) [Link] (1 responses)

Surely you can make a copy of the file and truncate that - or am I being naive?

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 13:23 UTC (Tue) by cortana (subscriber, #24596) [Link]

Is there a system call to duplicate a file and all those properties?

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:21 UTC (Mon) by nix (subscriber, #2304) [Link]

Does KDE recreate everything when you start it? I see no sign of this.
Indeed the ksycoca system takes exactly the approach you suggest,
*reading* all the KDE configuration (heaps of tiny files, some
in /usr/share, some in ~/.kde) ahd dumping it into a binary file
in /var/tmp for use by the KDE apps.

Regular rewriting of the configuration files certainly doesn't happen
after that, because KDE maintains inotify watches on all the configuration
files and reruns kbuildsycoca whenever they change. This takes several
seconds so you do *not* want to trigger it unnecessarily, by doing
pointless writes.

Looking at my KDE session now, only one process has any files under ~/.kde
open, and that's akregator and those are its archive files which are most
definitely *not* small files, nor rewritten constantly (they come to about
100Mb altogether).

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 5:13 UTC (Mon) by drag (guest, #31333) [Link] (1 responses)

> Looks just like the stupid Windows Registry that is at the heart of 90% of the problems in Windows, as far as I can tell.

The windows registry is a real database that stores it's data in few hive files.

The developers for the Linux desktop are using the file system as the backing store for their settings. Which is different. This way is much superior in a stability and usability way, but tends to tax the file system.

Of course Linux is Unix, right? Very file oriented... And thus it should be able to handle lots of small files in a easy manner, right?

Because I REALLY REALLY dislike the tendency for developers to be lazy and simply dump their configurations into a SQLite file. Flat file access is actually usually much faster and robust. The single file or sql approach is actually very Windows-like. The wrong bit out of order and your toast.

What Tso recommended for desktop developers to do, which is use a database or single file for storing everything... That is actually what Windows does. (Maybe the registry concept makes better sense now? Considering it was designed for Fat32 days..)

------------------------

You say one big file for storing hundreds of application's configs would be superior, but I don't think so.

Not any more then mbox format is superior to maildir.

Remember your not dealing with a single application, or even a dozen applications. The Gnome or KDE folks are trying to design a generic API that can be accessed and used by thousands of programs without them stomping on each other and without the application developer really having to focus on low-level features.

With Gconf I can 'reset' a application simply by logging out then deleting it's configuration out of my home directory. And it's much easier to deal with then having each application trying to dream up it's own .*rc file format (mostly because I only have to learn the syntax once rather then a thousand different syntaxes and almost all the key/value pairs are self-documenting via the gconf-editor interface.. they give you possible values and a English description of what they do quite often.)

Now I don't know how KDE deals with stuff like that. KDE always has been much harder for me to deal with then Gnome... but to each their own.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:27 UTC (Mon) by nix (subscriber, #2304) [Link]

KDE has a daemon that keeps an inotify watch on all the configuration
directories, and rebuilds its binary-form system configuration cache from
the configuration state whenever that state changes.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 15:33 UTC (Mon) by cortana (subscriber, #24596) [Link]

Gconf will actually quite happily write to a single data file. But for some reason, the default is for a million little directories, each with a data file containing just the values for that directory in the Gconf hierachy; and there is no user-accessible way to switch from one format to the other.

I just checked gconf's README.Debian file and it says,

> By default, the home directory structure is created as a tree layout
> since it improves write performance. If you want to use a merged tree
> on the home directory, you should run the following command:
> gconf-merge-tree ~/.gconf/

It occurs to me that the write performance would not be an issue if it didn't write out the entire file each time an entry was changed, but instead collected up the logical writes and did a single rewrite of the %gconf-tree.xml file after a certain period of idle time (or 5 seconds maximum). I should probably file a bug about that...

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 3:36 UTC (Mon) by neilbrown (subscriber, #359) [Link] (9 responses)

The bulk of that post seems to be saying something quite different (though not incompatible with) the footnote that has been quoted.

The main point of the article seems to be something about power management (hence the title). Forcing a 'sync' on 'rename' implies the drive has to be written to before each rename. If instead the filesystem imposes an ordering between the flush and the rename, but doesn't necessarily hurry either of them along, then you get the guarantees which (it is claimed) application writers want, without the power costs that Matthew is (justifiably) concerned about.

In contrast, the quoted foot note is a somewhat aggressive way of saying "let's sit down and develop an API for telling the filesystem that a given collection of files should be optimised for 'database-like access'", which means (I think) "expect small files, don't worry about hard links or differing access modes, etc".

In response to the first point, I agree that it might be nice, but I don't envy the various filesystem designers the task of implementing it. Ordering considerations are fairly fundamental to the design of a journaling system. Adding extra requirements at the last minute would be quite non-trivial. If we as a community really want stronger ordering rules than POSIX provides, then we should really have a broad and open discussion about that, rather than ranting about some recently-apparent breakage.

In response to the second, we again need open and constructive conversation. Supporting "lots of small files" and still allowing hard links and chmod and extended attributes would be a significant challenge for a filesystem. I suspect that the easiest approach would be to use a "database-like" approach for files in a directory until some operation is attempted which doesn't fit, and then move that file out of the "database". e.g. store file contents inside the directory until the file exceeds 512 bytes, or a hard link is created, or it is renamed to a different directory, or a chmod/chown is performed.

For this to be truly useful there would need to be general agreement about what operations are allowed to "break" the database. Hence the need for an API. The API doesn't need to mean new syscalls or new fcntl calls. It just needs to be an agreement between filesystem developers and application developers.

The over lap between these two considerations (power-friendly data integrity and small-file optimisation) is the question of how to provide transaction semantics across a set of small files. One idea that occurs to me is to allow file locking to be applied to directories. If an application takes an exclusive lock on a directory, then we could arrange that no changes made are externally visible until the lock is voluntarily
released. If the lock is released by application-exit or system-crash, then the contents of the directory remain unchanged. If any operation is attempted on a locked directory which would break the "it is a database" property, that operation is disallowed.

I wonder if that could be made to work... and if it would actually be useful. It would certainly be a challenge to export some of this via
NFS :-)

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:47 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (8 responses)

Or, we could use filesystems like, well, filesystems. And if we want databases, we use databases.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:58 UTC (Mon) by neilbrown (subscriber, #359) [Link] (5 responses)

What would you suggest is the key difference between the two, which would allow people to decide which to use in a particular situation? In my view they have fairly independent strengths, and if we could unify them that would be useful.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 9:49 UTC (Mon) by mjthayer (guest, #39183) [Link]

I think that a database lends itself better to indexing than most filesystems, notwithstanding indexing daemons running in the background and holding millions of inotify fds.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 17:56 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (2 responses)

That's a fair question. I think the main issue is the level of abstraction: databases are more abstract than filesystems, and depend on properly working filesystems for their correct operation. A database is one particular application of data storage, while a filesystem is a general mechanism for data storage.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 14:46 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Data bases depend on filesystems for their correct operation?

What about native Pick, where the database IS the filesystem?

Or Oracle, where it's configured to use raw partitions for data storage?

Cheers,
Wol

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 17:58 UTC (Tue) by flewellyn (subscriber, #5047) [Link]

I'd argue that those databases implement their own filesystems.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 18:44 UTC (Mon) by larryr (guest, #4030) [Link]

Good support for using the filesystem as an efficient mechanism for a persistent hierarchical collection of named values is what I would like. Where the names are the filenames and the values are the file contents. Similar to sysfs. I think a lot of people are using sqlite or dbm files for this because using filesystem operations takes too long.

Larry@Riedel.org

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:57 UTC (Mon) by flammon (guest, #807) [Link] (1 responses)

Too bad Hans is in jail because I think that Reiser4 was designed to address the many small files problem with something called block sub-allocation http://en.wikipedia.org/wiki/Block_suballocation. Maybe we can salvage a few ideas from Reiser4.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 18:30 UTC (Mon) by job (guest, #670) [Link]

Btrfs has that feature as well, according to documentation.

Wishful thinking

Posted Mar 16, 2009 3:39 UTC (Mon) by bojan (subscriber, #14302) [Link] (58 responses)

> It's simple. open(),write(),close(),rename() and open(),write(),fsync(),close(),rename(), are not semantically equivalent. One is "give me either the original data or the new data"[2]. The other is "always give me the new data". This is an important distinction. fsync() means that we've sent the data to the disk[3]. And, in general, that means that we've had to spin the disk up.

Yeah, it would be nice if the semantics of rename were defined that way, wouldn't it? Alas, they are not.

I vote for having rename2(), which would do exactly that. Easy to detect in configure, so if the system has it, use it. Otherwise, fsync() and then rename().

Wishful thinking

Posted Mar 16, 2009 4:27 UTC (Mon) by k8to (guest, #15413) [Link] (53 responses)

how about
sync_and_rename('src', 'dst')

Or something else that actually says what it does.

srename() if we have to keep things inscrutable.

Wishful thinking

Posted Mar 16, 2009 8:19 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (52 responses)

When would an application ever want the non-robust rename variant? It doesn't make any sense to want that, ever.

Wishful thinking

Posted Mar 16, 2009 8:56 UTC (Mon) by Nick (guest, #15060) [Link]

If there is no new API then the app has to do fsync then rename, then any filesystem that can
cheaply order rename with previous data writes can't avoid the fsync overhead.

fbarrier() fdatabarrier() etc. IMO would be nice simpler API for ordering. Might be harder to
implement in a general way than a data/rename barrier, however.

Wishful thinking

Posted Mar 16, 2009 9:20 UTC (Mon) by bojan (subscriber, #14302) [Link] (50 responses)

Think temporary files and performance.

Wishful thinking

Posted Mar 16, 2009 14:45 UTC (Mon) by jamesh (guest, #1159) [Link] (49 responses)

Could you elaborate? In what situations do you imagine an application renaming a temporary file over an existing file name and not wanting the data flushed to disk before the rename is flushed?

For truly temporary files (as opposed to those used to stage atomic changes to long lived files), the application usually leaves them with their initial name, right?

Wishful thinking

Posted Mar 16, 2009 14:50 UTC (Mon) by k8to (guest, #15413) [Link] (39 responses)

Because rename does not always get used to replace files....

Wishful thinking

Posted Mar 16, 2009 15:02 UTC (Mon) by jamesh (guest, #1159) [Link] (2 responses)

The file system knows when it is being used to replace a file though, right?

So which cases where an application writes to a file then uses rename() to replace another file would they not want the ordering of the write and rename to be preserved?

Wishful thinking

Posted Mar 16, 2009 15:31 UTC (Mon) by endecotp (guest, #36428) [Link] (1 responses)

It's the converse question you need to ask: "which cases where an application writes to a file then uses rename() NOT to replace another file would WANT the ordering to be preserved?"

The first time you save a config file would be one example. Say my config file is a hundred bytes or so of XML. Each time I save it I first save it to a .new file and then rename it. The very first time I do this I am not overwriting an existing file, but every subsequent time I am.

Now in this case you can argue that the "old or new" guarantee is not really needed; if I open the file and find it doesn't contain valid XML (e.g. it's a block zeros, or it's empty) then I can ignore the file and use defaults, and get the desired behaviour. But this needs extra work; I have to check the validity of the input in a way that is not needed for any other case and as we all know, rarely-executed code is a good place to find bugs. [My server crashed on the leap second, demonstrating the textbook example of this.]

Here's another example: I download a file into a .partial file. When the download is complete I do some sanity checks on the file (checksum maybe) and then rename it. My naive expectation is that when the application next starts it will either find a .partial file from an abandoned download, which it can delete, or it will find a data file that it can trust to be valid because it was sanity-checked before renaming. I really don't want to have to re-do the checking for all previous downloads when the app starts because users already complain about start-up time.

So unfortunately, simply adding extra safety in the case where the target of the rename already exists is not sufficient for these real cases.

Wishful thinking

Posted Mar 16, 2009 15:51 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Then do it for all renames. The behavior is still sane and consistent, and handles your partial file case just fine.

Wishful thinking

Posted Mar 16, 2009 15:28 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (35 responses)

No, it doesn't, but that doesn't matter. Give me a concrete example use of rename in which the application develop didn't intend to insert a write barrier for the data blocks. Just one example.

I bet you can't -- because that's insane behavior! Without the write barrier, rename alone is useless.

Files that won't matter after a crash!

Posted Mar 16, 2009 16:03 UTC (Mon) by Pc5Y9sbv (guest, #41328) [Link]

The write-barrier ONLY serves a purpose for preserving IO order after a crash. For the processes using the file before the crash, the kernel is already providing the abstraction of ordered IO, and the on-disk ordering does not matter at all.

So, for many temporary files which will be purged/ignored on crash recovery, the write ordering doesn't matter at all. The only reason the files ever need to go to disk is for backing store in case the block cache is flushed.

So while I agree it would be nice if write-barriers could be on by default, this might need to be a configurable policy if it cannot be proven to have negligible performance penalty for all cases. And if it can ever be disabled as the default, then applications need a way to request it explicitly to override a default "fast but unsafe" treatment.

Wishful thinking

Posted Mar 16, 2009 16:46 UTC (Mon) by endecotp (guest, #36428) [Link]

I can't think of an example, and I would also be interested to hear of any that others can come up with.

However, what you can find are cases where the user doesn't really care much about robustness at all if it impacts performance. (Compiles, for example; I can imagine a compiler "safely" overwriting the output object file so that it doesn't leave a broken environment after ctrl-C. But the user might be happy to "make clean" after a system crash.)

So it may be legitimate for a filesystem to behave in this way iff there is a performance benefit.

Wishful thinking

Posted Mar 16, 2009 21:38 UTC (Mon) by bojan (subscriber, #14302) [Link] (32 responses)

Any file that you don't care about to have data on disk and can be discarded is such a file.

Essentially, you are giving kernel more room to manage writes the way it sees fit, which is always good for performance.

Wishful thinking

Posted Mar 16, 2009 21:50 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (31 responses)

No, not really. I'm talking about rename being a write barrier, and not making rename flush data to disk at the instant it's called. The performance impact is minimal.

In a common case, the file being renamed has no dirty blocks, so no more work will be caused by writing the data blocks before the rename.

When the file being renamed has dirty blocks, these blocks will have to be written anyway. Forcing them out before, instead of after, the metadata will have a negligible impact on performance, especially since the elevator can combine these writes with other ones at the time of its choosing, not the application's.

And actually, forcing application developers to call fsync is worse for performance than making rename a write barrier. If rename isn't a write barrier, rename without fsync is dangerous, and therefore applications will add fsync. These fsync calls are worse for system performance than the rename being a write barrier: fsync forces out-of-order, non-coalesced disk flushes and vast increases in application latency. It also diminishes the effectiveness of laptop_mode --- forcing the disk to spin up.

Now, sometimes you need temporary files. But when do you go around renaming temporary files you don't intend to keep after dumping lots of data in them? If they have lots of data, the write ordering forced by the barrier won't matter much. If they have a little data, they're probably not going to get written out at all before they're unlinked. Either way, rename-as-write-barrier doesn't affect performance of temporary files.

Besides: if temporary files are a bottleneck, metadata journaling will hurt far more than the write barrier anyway. When you really, really want to optimize the performance of operations on temporary files, either use tmpfs or a freshly-created ext2 filesystem.

Making rename a write barrier is a performance win. Avoid fsync.

Wishful thinking

Posted Mar 16, 2009 22:00 UTC (Mon) by bojan (subscriber, #14302) [Link] (26 responses)

If the directory you renamed the file in needs to be synced to disk (because another file in it changed and someone asked to fsync it, for instance; or because it has to be evicted from cache etc.) and rename is ordered, you need to commit the data in the file first, because that is your guarantee now. If that file is big and would not normally get committed (because it's temporary and would get removed a bit later), you just caused a performance hit.

For such a file, the application may not care that this particular temporary file is empty or corrupt if the machine crashes just then. It may just remove it and continue.

So, current POSIX rename semantics are there for a good reason - to allow kernel to order writes as it sees fit. Sure, it would be good to have a call such as rename2(), for which the order is guaranteed.

Wishful thinking

Posted Mar 16, 2009 22:08 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (25 responses)

That's an obscure corner case, and I've already addressed it in another reply. Shuffling around huge files with large numbers of outstanding blocks and then immediately deleting them is a contrived case that doesn't occur in real life.

Freedom isn't always a good thing. Making rename work consistently is far more important. The non-ordered rename case just doesn't make a whole lot of sense. It's not a useful degree of freedom, unlike synchronous-versus-cached, atomic-versus-not, and so on. The performance gain is minimum, and the danger large. Making rename ordered with respect to the data blocks in the file is a huge win.

Wishful thinking

Posted Mar 16, 2009 22:13 UTC (Mon) by bojan (subscriber, #14302) [Link] (24 responses)

> Making rename work consistently is far more important.

rename() already works consistently with documentation.

What you addressed in the other thread is the you would like rename() to work they way you _think_ it should work. We know that already.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (23 responses)

Let me repost here something I posted to Theodore Tso's blog.

With that out of the way, let’s talk about rename. It should create a conceptual write barrier for the data blocks of the file involved. It’s not inventing a filesystem semantic out of thin air any more than writing a zero-length file is: POSIX doesn’t say much at all about what happens after a crash, and so this whole discussion is uncharted territory. It’d be perfectly fine for a POSIX system to overwrite all your files with pictures of donuts on an unclear shutdown. This is not an issue of standards conformance: it is a quality of implementation issue. The standard allowing you to do something terrible isn’t an excuse for that behavior. It’s like saying “yes, it’s perfectly fine that I live off of Crisco and tequila. The law allows it!”

Now, first of all: there’s a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.

Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and there’s a good chance it’ll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. That’s not a world I want to live in.

Third, there’s the issue of API parsimony. Your semantics change rename from a hard to misuse API to one that’s very prone to misuse. See http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html (”How Do I Make This Hard to Misuse?”) and http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html (”What If I Don’t Actually Like My Users?”). On that scale, you’ve moved rename from a very good 7 (”the obvious use is (probably) the correct one”) to an appalling -5 (”do it right and it will sometimes break at runtime”).

On a running system, of course, a rename is atomic with respect to both the filename and its contents — otherwise it’d be useless. Under your semantics, however, you’ve effectively made rename without fsync a useless and dangerous, yet very conceptually tempting operation. Scolding application programmers to insert fsync calls will lead to confusion and frustration: fsync, as “make the data hit the disk now” doesn’t have anything conceptually to do with atomic replacement except as an arcane filesystem implementation detail. Anything that appears to work in the typical case, but that does something dangerous in special corner cases, is broken by design.

When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

But for the sake of argument, let’s bite our tongue and insert this fsync. The system can’t tell the difference between an fsync intended to ensure, say, message receipt, and an fsync that ensures after-crash consistency across a rename. Because we’re blocking and waiting for disk IO, application latency greatly increases (by up to three seconds, apparently). Users begin to complain. Now, the application developer has two choices: either implement the threaded solution you mention, or remove the fsync.

The threaded solution gives the correct behavior, but is horribly complicated, or requires libraries on which the developer might not want to depend, especially for a small operation. Look what you’ve done now: not only have you made the correct code non-obvious, but you’ve made the correct code under-performing as well. It’s absolutely ludicrous to expect every program that wants to correct replace a file’s contents to spawn off a worker thread.

Thus, most application developers will just remove the fsync. (Or do the moral equivalent, as KDE has done, and provide a knob to turn the fsync back on.) Now we’ve created a deliberate rare data-loss bug because the correct code is far too complicated.

Now, this situation in itself would be bad. But add laptop_mode, and now we’ve made an API the very contemplation of which drives men to unspeakable acts. We’ve added fsync everywhere, and we find it’s causing problems: the disk spins up all the time, as it must in order to maintain fsync’s semantics. So the solution is to neuter the very fsync you’ve implored application developers to add? Because you, the one making fsync a non-op, know that most of these fsyncs are there maintain data consistency, you can have laptop_mode trade durability for battery life.

But some fsyncs are there to ensure application-level durability. Imagine an SMTP server. So, you create an “fsync-really-means-fsync” inheritable process flag. If an application developer has an *important* fsync call, he’ll just set this process flag and call fsync. Now, since that flag is contrary to 20 years of established use and will be a footnote on a newish version of the fsync manual, most application developers won’t actually know about it. Oh, they’ll call fsync, and their programs will appear to work just fine, even after dutiful hard-reboot testing.

Except when someone using laptop_mode has an unexpected power failure. Now that user has lost, and it wasn’t his fault. (”How the hell wasn’t the message on the disk? fsync returned success. Must be a bad disk. [hours pass] Oh, what’s this laptop_mode? Changes fsync? @!#$%!@#%”) And before you say “caveat modulator:” you shouldn’t need to be an expert on the data retention needs of each of your programs to extract good battery life.

Now when an application developer needs to actually use the *real* fsync, he turns on this process flag. Except he’s also dutifully using fsync to ensure rename consistency, so he has to create plumbing to manage the state of the magic fsync flag across different parts of his program so only the fsyncs that need to be real fsyncs are real. Let’s imagine this program also runs arbitrary other programs: it then needs to unset the magic inheritable fsync flag before fork, otherwise programs that don’t really it will be running with he real fsync. That’s a non-trivial amount of work.

Also, application developers everywhere need to add autoconf tests for the magic process flag. Older programs will actually be broken, through no fault of their own. It’s either that, or rewrite the initialization scripts for programs that need the real fsync. (And in that case, the program may very well run far more real fsyncs that needed.)

Now you’ve made *two* traditional, long-standing system calls, rename and fsync, act dangerously in certain hard-to-test boundary cases, with elaborate and arcane workarounds that are so counter-intuitive (”fsync almost always means fsync?”) that developers will almost certainly get it wrong, at least the first time. Correct behavior might as well be in a disused lavatory behind a “beware of the leopard” sign.

What’s the alternative? fsync_and_i_mean_it? You could create an
fbarrier system call that applications would use to ensure data
consistency while preserving fsync’s current role. fbarrier might come
in handy in other contexts too. But of course that system call
wouldn’t be portable — but wait… we’ve already established that
when an application calls rename, it *means* to insert a write
barrier. fbarrier might be useful, but we can also infer it from a
rename call, and with perfect accuracy: when does an application *not*
want this behavior on rename?

So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today. With no changes, or even
a recompile. Applications that call fsync before a rename at least do
no harm. rename remains an intuitive, powerful, and simple way for an
application developer to express what he wants to do (instead of being
a tasty-looking landmine). fsync doesn’t have to be treated specially
in certain bizarre modes. And you don’t really lose any efficiency,
because under your scheme, every correct application would have to
call fsync anyway — and I bet fbarrier would be far less expensive
than an outright fsync. (Or, if fsync really is cheap enough on a
given filesystem, make fbarrier *be* fsync.)

How often do you get to improve performance *and* safety at the same
time?

Wishful thinking

Posted Mar 16, 2009 22:20 UTC (Mon) by bojan (subscriber, #14302) [Link] (19 responses)

> It should

I didn't want to read past that, sorry. I can also imagine that things should be some way or the other. They are, however, not.

Wishful thinking

Posted Mar 16, 2009 22:27 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (18 responses)

Now you're just trolling. The whole core of your argument is that "since POSIX allows this, and some filesystems do it that way, we should keep doing it that way". It doesn't logically follow. Maybe if you were to read my posts, you'd get a sense for what logic looks like.

Wishful thinking

Posted Mar 16, 2009 22:36 UTC (Mon) by bojan (subscriber, #14302) [Link] (16 responses)

> It doesn't logically follow.

Actually it does. If you _don't_ allow for what POSIX specifies in your applications (which is where the problem is), then there will be consequences (i.e. the applications will lose files).

This can be properly fixed in two ways:

1. By calling fsync() from the application when required.
2. By introducing something new that does what you keep talking about.

Overloading specified behaviour with unspecified things is dangerous, because it encourages application writers to do the wrong thing. We've seen that before with XFS and wrong people got blamed that time too.

Sure, Ted is a practical person, so he doesn't want to break things unnecessarily. I admire him for keeping his cool.

Wishful thinking

Posted Mar 16, 2009 23:32 UTC (Mon) by jamesh (guest, #1159) [Link] (14 responses)

This whole issue is about what should happen in a case that POSIX specifies, so I don't know why you keep on bringing this up.

In cases where POSIX does not specify behaviour, it is left up to the implementation. If the choice is between trying to provide the runtime atomic rename guarantee over a crash or slightly higher performance, I'd pick the first option. After all, that's why I am running a crash-resistant file system in the first place.

Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

Wishful thinking

Posted Mar 16, 2009 23:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (13 responses)

The problem is with portability. If you write your applications to _not_ do what POSIX requires, they will be broken when they go to a different system which happens to have an FS that doesn't order renames on disk.

> Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

1. Yes, I can understand it.
2. No, this is not what rename() specifies.

So, when an application writer thinks that it will be like that everywhere, he/she is wrong and the application may lose data. That is bad.

Hence, I'm suggesting that for the cases where ordered rename is warranted, we should have a new API.

PS. As I explained elsewhere, unordered rename has its use as well, so one cannot just assume that everyone should drop that and do ordered. It is also not practical to demand that, because too many systems would have to be audited and changed to achieve it. And before you say "but don't we have to fix more apps already" - well, the applications are buggy right now according to specification - not the other way around.

Wishful thinking

Posted Mar 17, 2009 0:18 UTC (Tue) by nix (subscriber, #2304) [Link] (10 responses)

You're acting as if POSIX is set in stone and can never change to account
for new de-facto standards, when in reality that is the *only* way it ever
changes (and often Linux is the source of such changes).

Ten years ago, would you have been arguing that programs that relied on
symlinks were broken because POSIX did not require them?

Wishful thinking

Posted Mar 17, 2009 0:31 UTC (Tue) by bojan (subscriber, #14302) [Link] (8 responses)

> Ten years ago, would you have been arguing that programs that relied on symlinks were broken because POSIX did not require them?

If the programs correctly tested to see if the support is there and then refused to work if symlinks were not there, there would be nothing wrong with them. So, by all means, if you write an application that tests that the underlying FS has ordered renames and refuses to work otherwise with sloppy open()/write()/close()/rename() sequence, that's perfectly OK. You just need to write even _more_ code to do this then if you just used fsync(). Up to you.

Wishful thinking

Posted Mar 17, 2009 1:24 UTC (Tue) by nix (subscriber, #2304) [Link] (7 responses)

The vast majority of programs, even when symlinks were optional, assumed
their presence, because the enormous majority of the installed base had
them.

This is actually worse. If you get the open()/write()/fsync()/close()/
rename() sequence wrong, by leaving out the fsync(), the visible effect
during development is *nil*, even on filesystems like pre-patch ext4,
because this is a change which only has an effect when something goes
really wrong and the OS crashes or you lose power at the wrong instant,
and if that happens, any data loss will be written off to the power
failure, like as not.

Expecting any but the most skilled developers to remember that fsync()
when omitting it has *no visible negative consequence* in normal operation
is a complete and total pipe-dream. You can wish all you will, but only a
few percent will ever conform.

It is much better to arrange to do the right thing in the filesystem,
which *does* have especially skilled people hacking at it, than in the
vast mass of wildly-varying-in-quality code out there in the real world.

Wishful thinking

Posted Mar 17, 2009 2:17 UTC (Tue) by bojan (subscriber, #14302) [Link] (6 responses)

> The vast majority of programs, even when symlinks were optional, assumed their presence, because the enormous majority of the installed base had them.

WOW! Programs have bugs. Imagine that ;-)

> Expecting any but the most skilled developers to remember that fsync() when omitting it has *no visible negative consequence* in normal operation is a complete and total pipe-dream.

The no negative visible consequence applies to one file system in one mode _only_ (and according to some, not even on it all the time). The rest - it depends.

If you ever tried to debug a race condition, you'd know that it can be really hard to do, because the system doesn't get into such conditions all the time. Did someone guarantee to you that programming was going to be easy? I must have missed that lesson ;-)

Oh, and for all the forgetful unskilled developers: man 2 close. I sure needed it :-(

> You can wish all you will, but only a few percent will ever conform.

And their applications will still suck and they will still rely on hacks in file systems to work. And of course, people doing this will be the ones loudest complaining that "file system is broken" when they encounter problems on another platform. Not even my six year old is this childish. But, hey - that's life.

> It is much better to arrange to do the right thing in the filesystem, which *does* have especially skilled people hacking at it, than in the vast mass of wildly-varying-in-quality code out there in the real world.

All you need to do is this:

1. Convince all FS writers to only use new semantics.
2. Convince POSIX folks to change the spec.

Good luck doing that.

PS. The vast majority of people do not program using APIs we are talking about here. They are using libraries that wrap all this up, other programming languages that have calls that wrap all this up etc. These will be written by people familiar with lower level POSIX APIs we are talking about here. For a good example, see: http://mail.gnome.org/archives/gtk-devel-list/2009-March/...

Wishful thinking

Posted Mar 17, 2009 2:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

> people doing this

Of course, I mean your supposed vast majority that won't do the fsync here.

Wishful thinking

Posted Mar 17, 2009 2:26 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (3 responses)

The POSIX spec doesn't need to change one bit. Both behaviors entirely conform to POSIX.

And as for getting filesystems to change -- that's going to be the case. Any widely-used filesysem will encounter the same problem that started this mess, and will either implement the same fix or suffer the fate of XFS.

Wishful thinking

Posted Mar 17, 2009 2:35 UTC (Tue) by bojan (subscriber, #14302) [Link] (2 responses)

I see FS implementers shaking in their boots :-)

BTW, people already started fixing the code. Or didn't you read that GTK thread?

PS. Even Ted's workarounds in ext4 do not do full ordered rename in all cases. These are only for the cases of the most widely known application breakage. But, if you keep insisting, he may do the lockup-on-fsync for you, ext3 style, just so that you can get that nice UI feeling in properly written apps ;-)

Wishful thinking

Posted Mar 17, 2009 2:37 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (1 responses)

Care to link to this thread?

Wishful thinking

Posted Mar 17, 2009 2:44 UTC (Tue) by bojan (subscriber, #14302) [Link]

Already have. You have to go a few posts up.

Wishful thinking

Posted Mar 17, 2009 20:37 UTC (Tue) by nix (subscriber, #2304) [Link]

>> Expecting any but the most skilled developers to remember that fsync()
>> when omitting it has *no visible negative consequence* in normal
>> operation is a complete and total pipe-dream.
>
> The no negative visible consequence applies to one file system in one
> mode _only_ (and according to some, not even on it all the time). The
> rest - it depends.

I repeat: omitting fsync() has no negative visible consequence *in normal
operation* on *any* POSIX-compliant system. Turning off the power or
locking up the box is *not* 'normal operation'.

I know of no developers of anything other than full-blown databases who do
anything like that to test their programs. Thus, for nearly all programs,
omitting fsync() is harmless during the development and testing phase.
Thus, it will regularly be omitted, *no matter what* you might wish.

... and, um, changing POSIX really isn't that hard. Make a good case that
some behaviour is common enough and POSIX will bend. The Austin Group is
populated with normal human beings^W^Wraging pedants like you or I, not
gods. (There are some demigods there, though.)

It is quite possible to convince them that a change is needed, and POSIX
regularly changes semantics in new release.

Wishful thinking

Posted Mar 17, 2009 0:33 UTC (Tue) by bojan (subscriber, #14302) [Link]

Oh, and if you want to change POSIX, please do so. I have no objection. As if my opinion mattered here ;-)

Wishful thinking

Posted Mar 17, 2009 8:35 UTC (Tue) by jamesh (guest, #1159) [Link]

> If you write your applications to _not_ do what POSIX requires, they will
> be broken when they go to a different system which happens to have an FS
> that doesn't order renames on disk.

We are talking about a case that POSIX leaves undefined here. An OS can wipe the disk on system crash and still be POSIX compliant.

We are in the realm of implementation defined behaviour, so talking of "applications doing what POSIX requires" doesn't really make sense. Claiming that the applications are buggy in a case where the specification offers no guidance doesn't help anyone.

Ext4's crash resistance is a desirable feature that exceeds the minimum requirements needed for POSIX conformance. Preserving atomic renames over a crash also exceeds those minimum requirements.

I'd be willing to pay the performance penalty from providing this behaviour in the same way I am willing to pay the performance penalty from metadata journaling.

A filesystem's job is not to punish users for application developers' oversights.

Posted Mar 18, 2009 0:52 UTC (Wed) by xoddam (subscriber, #2322) [Link]

This is *so* not about application developers or POSIX!

The *only* behaviour under discussion is recoverability across system failures. That's what POSIX doesn't (can't) guarantee, and it's what a journaling filesystem is supposed to provide *in addition* to the POSIX guarantees.

System administrators and users choose to run journaling filesystems so they don't waste time cleaning up after a crash. A journaling filesystem that makes it more, not less, likely for users to lose data isn't doing its job.

POSIX guarantees atomicity of rename -- while the system is running. Application developers code to that guarantee, without particular reference to what happens when the power is cut or some video driver scribbles on the kernel heap. If the system crashes, there is no POSIXLY_CORRECT guarantee that anything will be recoverable at all. Whether you use fsync or not.

A journaling filesystem is supposed to provide more reasonable behaviour FOR USERS. Its job is not to punish users for the corner cases that application developers didn't consider.

Wishful thinking

Posted Mar 18, 2009 0:14 UTC (Wed) by dvdeug (subscriber, #10998) [Link]

What POSIX specifies is that a compliant system, upon a system crash, can hunt down all hard copies that have been made and burn them, after overwriting the data on the disk seven times with zeros, ones, and random data. I'm not sure how an application is supposed to allow for that.

Wishful thinking

Posted Mar 17, 2009 20:44 UTC (Tue) by man_ls (guest, #15091) [Link]

Now you're just trolling.
Now?

Wishful thinking

Posted Mar 16, 2009 22:37 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

quote:
When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

one very good example of when you would want to do a rename, but don't need to do a write barrier is when you are rolling logs.

this is usually done where one program has the file open and is writing data to it, another program renames the file, and then tells the first program to close the file and re-open the original name.

there is no need for a write barrier anywhere in this case.

laptop mode explicitly breaks normal expected (failsafe) behavior in the interest of of saving power. if the distro turns it on by default and never tells the user they are doing so (along with allowing the user to specify the 'how much data am I willing to loose' parameter) the distro is at fault, but that can be fixed.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

there is no need for a write barrier anywhere in this case.
You don't need a write barrier for a completely unmodified file either. But a write barrier hurts neither case.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.
That's a nightmare API that makes it very difficult to determine whether you're actually writing to the disk or not. If an application's data aren't critical, it shouldn't be calling fsync in the first place. And if the property of whether the data are critical can change, the application itself should provide a knob to control that. A process-level flub is both coarse and crude as a means of controlling that.

Wishful thinking

Posted Mar 16, 2009 23:56 UTC (Mon) by jlokier (guest, #52227) [Link]

By the way, Mac OS X reall has an "fsync-and-I-really-mean-it" flag!

Look up F_FULLSYNC, and why Linux fsync isn't a proper fsync anyway on most of its filesystems.

Wishful thinking

Posted Mar 16, 2009 22:26 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

the problem is that making the write barrier be part of the rename requires changes to all filesystems on all operating systems, and applications are only safe if they are running on a new enough version of an OS.

doing the fsync before rename works on all filesystems on all operating systems, but requires changing the applications.

if the applications push this into the OS/filesystem they will need to document that they are unsafe on any but (...) which is a list that will change over time without any control (and probably without the knowledge) of the application developers. but if they put in the fsync they work with everything that's on the market today.

they don't have to do the fsync of the directory if they don't care which version of the file exists after the crash, just doing it for the data is enough.

Wishful thinking

Posted Mar 16, 2009 22:35 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

the problem is that making the write barrier be part of the rename requires changes to all filesystems on all operating systems, and applications are only safe if they are running on a new enough version of an OS.
True enough, but applications work without fsync already, so upgrades will be required in either case. But there are far fewer filesystems than applications, so from a purely logistic point of view, it makes more sense to change the filesystem.

if the applications push this into the OS/filesystem they will need to document that they are unsafe on any but (...) which is a list that will change over time without any control (and probably without the knowledge) of the application developers. but if they put in the fsync they work with everything that's on the market today.
There are plenty of applications that don't include fsync. Making this change turns existing incorrect applications into correct ones, and doesn't harm any application that already uses fsync. Besides -- how many of these applications are portable to other systems anyway? A certain degree of unportability is required to drive change. Sometimes you don't need to worry about supporting every POSIX-compliant system and can take advantage of better functionality.

they don't have to do the fsync of the directory if they don't care which version of the file exists after the crash, just doing it for the data is enough.
...which causes performance problems. In fact, Theodore Tso recommends running fsync in a separate thread, which drastically increase the complexity of simple applications. Applications are left to choose one of incorrect, slow, or complicated. They shouldn't be required to make that choice.

Not to mention the problems rampant fsync would cause for laptop_mode.

Wishful thinking

Posted Mar 16, 2009 22:39 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

no, applications _sometimes_ or _mostly_ work without fsync on some OS/filesystem/mount option combinations

that's far from what you are asserting "applications work without fsync already"

Wishful thinking

Posted Mar 16, 2009 22:44 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

You can add those qualifiers to almost any statement. The fact is that these applications are written to assume rename inserts a write barrier, and the vast majority of the time, these applications get what they want. With pre-patch ext4 and XFS, this assumption breaks. The breakage was noticed very quickly after ext4 entered wide use. If these applications' assumptions were so unreliable, we would have been an uproar well before ext4 was released.

Wishful thinking

Posted Mar 16, 2009 21:46 UTC (Mon) by bojan (subscriber, #14302) [Link] (8 responses)

> Could you elaborate?

Yes. Say you just renamed a very big temporary file (think GBs) and it just so happens that it would be a good time for the kernel to sync the directory your file is in to disk, because another file in that directory changed and somebody asked to fsync the directory (or some other condition that kernel finds appropriate - doesn't matter). If you guarantee the order with rename(), you then first need to commit a few GBs of data in order to do this (which would otherwise never happen, because the file is temporary and would go away a bit later). If you don't guarantee the order, you then just commit the directory and you are done.

In other words, kernel currently has the freedom to do what it finds most appropriate and is allowed to by POSIX. Ordered renames restrict that freedom.

Wishful thinking

Posted Mar 16, 2009 21:57 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (7 responses)

Yes, the kernel has the freedom to do what's appropriate, but uses that freedom to do mind-numbingly stupid things.

You're grasping at straws. First of all, you're most likely never going to see gigabytes of dirty blocks for a single file. They'll have been flushed well before your rename! Second, even if we do end up in your scenario, that file's blocks will be flushed in very short order anyway, so you're going to incur the penalty for that IO whether you do it before or after the rename.

As for the case of immediately unlinking the file --- that's a rather unlikely scenario. Can you give me one non-contrived real-life example where this would actually happen?

Remember:

  • You have a large temporary file with many unflushed data blocks
  • This temporary file has just been renamed
  • This temporary file is about to be unlinked
  • Between the rename of the large temporary file and its unlinking, fsync must be called on the directory
When would this plausibly happen?

Wishful thinking

Posted Mar 16, 2009 22:07 UTC (Mon) by bojan (subscriber, #14302) [Link] (6 responses)

> First of all, you're most likely never going to see gigabytes of dirty blocks for a single file.

You are right. And nobody will ever need more than 640 kB of RAM ;-)

> They'll have been flushed well before your rename!

Not if someone calls fsync on that directory. Or kernel decides (for whatever reason) that this directory must go out to disk.

But look, you obviously don't want to accept that:

1. This happens.
2. POSIX says what it says.
3. Kernel is allowed to do what POSIX says.

That's OK. The documentation is crystal clear on this. It is all in the manual pages for close() and rename(). Unfortunately, people choose to ignore it.

Sure, it would be nice to have a call that guarantees all this, but thinking that rename() is that call is simply false.

Wishful thinking

Posted Mar 16, 2009 22:12 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (5 responses)

And you clearly don't want to accept that we can improve over POSIX with little danger and a large upside. POSIX doesn't make any guarantees about what happens on an unclean shutdown. The kernel could overwrite all your files with pictures of carrots and it'd be POSIX-complaint. Would you be okay with that outcome? After all,
  1. This happens.
  2. POSIX says what it says.
  3. Kernel is allowed to do what POSIX says.

Please stop justifying poor behavior by resorting to POSIX.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by bojan (subscriber, #14302) [Link] (4 responses)

> little danger

Yeah, we've seen that with applications that were losing files on a perfectly good FS.

> Please stop justifying poor behavior by resorting to POSIX.

The behaviour is not poor. It is there for a reason, which you don't want to admit.

Lucky Ted is a nice man, so he put workarounds in place for folks that want to continue using sloppy code on ext4.

Wishful thinking

Posted Mar 16, 2009 22:24 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (3 responses)

Yeah, we've seen that with applications that were losing files on a perfectly good FS.
What are you talking about? A reordering rename is strictly safer than your broken one.

The behaviour is not poor. It is there for a reason, which you don't want to admit.
No, the behavior is dangerous and unintuitive, and there is no sound reason for it to work that way other than to make metadata write-out a little simpler. There is no performance or correctness upside to rename working the way you insist. You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (2 responses)

> There is no performance or correctness upside to rename working the way you insist.

I'll just answer about correctness. If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data. So, there is an upside to programming correctly and according to spec.

I think I answered the performance bit elsewhere, but you don't want to accept it. Which is fine by me.

> You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

POSIX is not nonsensical, it is completely asynchronous and unordered, which is what you don't seem to like. Sure, we could have another mechanism for ordered renames - I don't deny that. It's just that current rename() isn't it, which you don't seem to be able to understand.

Wishful thinking

Posted Mar 16, 2009 22:53 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (1 responses)

If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data.
Adhering to POSIX is not the same as being "perfectly good".

POSIX is not nonsensical, it is completely asynchronous and unordered
No, it's perfectly synchronous, ordered, and atomic with respect to a running system. It's only in the case of an unclean shutdown that we disagree, and POSIX really doesn't say much at all about that scenario. My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly. To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

Wishful thinking

Posted Mar 16, 2009 23:29 UTC (Mon) by bojan (subscriber, #14302) [Link]

> Adhering to POSIX is not the same as being "perfectly good".

I love it how you think that your or my opinion actually matters here. What matters is what's been written in the documentation for years now. That is the _only_ objective thing programmers on both ends of this can rely on.

> No, it's perfectly synchronous, ordered, and atomic with respect to a running system.

Yeah, confuse the issue some more, when you don't have anything new to add.

We are discussing here the "ordering of writes to disk on rename". In respect to this, POSIX is asynchronous and unordered. Heck, you cannot even tell if two consecutive writes will be written in that order to disk.

> My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly.

Your behaviour is truly what you would like it to be. What you are describing as my behaviour is what the documentation actually says, so it's not mine at all.

As or carrots, that is wrong, because you never fsynced carrots to disk, so they should not be there. But sure - implement it ;-)

> To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

And also encourages application writers to keep writing broken code, file system writers to put hacks into the system to work around that broken code etc.

Wishful thinking

Posted Mar 16, 2009 5:20 UTC (Mon) by alonz (subscriber, #815) [Link] (2 responses)


> Yeah, it would be nice if the semantics of rename were defined
> that way, wouldn't it? Alas, they are not.

Unfortunately, POSIX doesn't actually define any relation between operations on file contents and those on file metadata. File-system developers prefer to pretend that this means the operations should be independent; most application developers prefer the interpretation that any file-system operations (whether they refer to data or metadata) are part of the same “transaction space”.

Personally, I believe the application developers have the saner perspective here, and file-system developers are taking the narrow view. But then, I'm a systems engineer, I like broad views :).

Wishful thinking

Posted Mar 16, 2009 5:33 UTC (Mon) by bojan (subscriber, #14302) [Link]

POSIX rename defines that processes should see a consistent picture (which ext4 provides, with or without patches). Text from the rename man page:

> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.

Unfortunately, there is not interpretation here. The requirements are clearly set.

Wishful thinking

Posted Mar 16, 2009 7:04 UTC (Mon) by njs (subscriber, #40338) [Link]

>Unfortunately, POSIX doesn't actually define any relation between operations on file contents and those on file metadata. File-system developers prefer to pretend that this means the operations should be independent

There are a lot of misunderstandings about how filesystems actually work in these threads... hopefully I won't add to them :-)

But I think it's more like: POSIX doesn't actually define any relation between operations, whether on file contents or on file metadata or both. File-system developers tend to create a linear ordering on file metadata changes because that makes it easier to implement filesystems that can survive a crash without destroying your whole partition, but they prefer not to impose any other ordering guarantees, because when they do, the users whine about how unbearably slow the filesystem is. (Also, they've never made those guarantees before, and somehow computers have worked.)

In particular, note that when it comes to crash recovery, unless you use data=journal, there is no "transaction space" for data writes at all. You may find any arbitrary subset of your writes have completed, and some may have completed partially -- only the middle of your write buffer has made it to disk -- and etc. That's just how it works.

What we're seeing here is some very limited ordering guarantees being added in for particular heuristically defined sequences of operations, where it turns out they don't hurt performance much. But apps that rely on those guarantees will still be broken when running on any other filesystems. And that's going to bite the folks who develop the next round of filesystems, because they don't know what random non-standard guarantees apps will expect them to provide.

Wishful thinking

Posted Mar 16, 2009 10:53 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

If the best way to add support for this is with additional API, then I'm absolutely all for that.

Ted speaks again

Posted Mar 16, 2009 6:37 UTC (Mon) by bojan (subscriber, #14302) [Link] (13 responses)

Ted speaks again

Posted Mar 16, 2009 10:19 UTC (Mon) by regala (guest, #15745) [Link] (12 responses)

please restrain yourself if all your contribution is a rant against a fellow developer who devoted the last 18 or so years to GNU/Linux. I don't like the way it is turning to, people in Opensource ecosystems being somewhat mean and wild (as in beasts) to just massacre they who fed them...

So to you, I'd say, "your mouth spits again". You all rant against the fact that you never read programming guidelines (POSIX standards) and that makes me sick. Know your place, for christ's sake.

Ted speaks again

Posted Mar 16, 2009 12:29 UTC (Mon) by forthy (guest, #1525) [Link] (5 responses)

Maybe some people didn't read the POSIX standard. But this doesn't actually matter, because Ted Ts'o didn't read it either. He's just ducking behind it, because POSIX makes no promise in case of a system crash. That's anal-retentive, because POSIX makes promises about ordering (ordering is strong), and a supposed-to-be-reliable file system should keep that order even after a crash. If ext4 doesn't (ext3 in data=ordered mode does, btrfs should do according to the FAQ, and will actually do - bugs happen - in 2.6.30, too, etc.), then the users will just not use ext4.

In the long thread of the previous discussion about this topic, the semantics behind the different operations were clearly described. POSIX promises atomicy of operations like rename(). Should this atomicy be preserved in case of a crash? Sane file system design says: yes. People who use create-write-close-rename want atomicy, if they also want durability (i.e. know that the new file is actually committed), they need fsync, too. To be precise: fsync on the file and on the directory. Atomic operations are part of the POSIX file system semantics, durability is part of fsync's semantics.

When do you need durability? E.g. in a networked ordering system - if you receive an order over network, you update your books, make the first half of the booking durable, confirm the order, and when you know that the confirmation is out, then you finalize your booking and make that durable, as well (double handshake). You don't need durability if you just update your configuration settings, but you need atomicy to avoid loss of all configuration settings.

Note that POSIX still does not guarantee anything in case of a crash - complete loss of data and metadata is "allowed". Whether I or anyone else actually wants to use such a file system is a completely different question.

Ted speaks again

Posted Mar 16, 2009 13:17 UTC (Mon) by kleptog (subscriber, #1183) [Link] (2 responses)

As far as I can see all this is just changing expectations. Just a few years ago we were *happy* that our filesystems were readable after a crash (after running fsck). Then we progressed to being happy that after a crash we could use the filesystem without waiting hours for the fsck.

Now we're at the stage of worrying about exactly what the files should look like after a crash. Give it a few years and I'm sure we'll find something else to worry about. Also, POSIX was written a long time ago and deliberately vague on some points because they wanted to support many existing systems which all worked slightly differently.

NB: ISTM the solution to the 'lots of little files on ext3' problem is obvious. Create all the new files, then fsync them (fsync on ext3 may be slow, but it wouldn't be as much of a problem this way because all the data would be written out for all the files in one go). Finally rename them all.

Ted speaks again

Posted Mar 16, 2009 15:25 UTC (Mon) by drag (guest, #31333) [Link] (1 responses)

> Now we're at the stage of worrying about exactly what the files should look like after a crash. Give it a few years and I'm sure we'll find something else to worry about. Also, POSIX was written a long time ago and deliberately vague on some points because they wanted to support many existing systems which all worked slightly differently.

Well ya. That's progress I guess. People always want better, demand better.

In the case of Linux your traditionally dealing with half-way decent hardware running with UPS and ran by professionals. That is your designing the OS to perform well and reliably when managed by a person who knows, understands, and cares quite a bit about the hardware they are using.

Now with consumer-oriented Linux devices your dealing with people constantly putting excessive demands and loads on the system (especially graphics, which has been a weak point in stability for all systems including Linux) devices that are cheap and mass produced, ran by people that don't even understand what a OS is, have to operate with as low as power usage as possible, and have users with very low tolerances for anything really technical.

In this specific case your having Ubuntu users using unstable graphics drivers with developer versions of the operating system. They were crashing their system frequently; several times a day sometimes. They are doing weird things like over clocking RAM and all that crap.

They were finding that Ext4 was eating a significant portion of their file system, were as with Ext3 it didn't.

But that is just a tip of the iceberg. Your going to deal with mobile phones with batteries that just 'crap out'. Your going to deal with mobile internet devices that get used in abusive environments. Your going to deal with hand held devices that suspend to ram a dozen times a minute.

Try explaining to your grandma or to the guy down the street running a Moblin netbook that their system is not bootable anymore, or they can't use most of their applications, because POSIX doesn't give a shit that users get half their file system blown away when they shut their devices down incorrectly.

I don't know the best way to fix it, whether it's best to:
* Get the Kernel developers to care about maintaining a consistent file system image on the disk at all times
or
* Get the biggest clue stick in the world and collectively drive the "fsync is your friend" point home to all potential Linux developers.
or
* third option

I don't know.

But certainly demands and expectations change. Just like everything else in the computing landscape changes.

Ted speaks again

Posted Mar 16, 2009 16:54 UTC (Mon) by kleptog (subscriber, #1183) [Link]

Try explaining to your grandma or to the guy down the street running a Moblin netbook that their system is not bootable anymore, or they can't use most of their applications, because POSIX doesn't give a shit that users get half their file system blown away when they shut their devices down incorrectly.

Honestly, I don't see why POSIX should care. It's a standard that describes an API that can be used by programs that wish to be portable. In principle it could be implemented on anything from the smallest handheld to the largest mainframe. Reliability after a crash is outside the purview of POSIX since the requirements are vastly different in different situations. People writing software for embedded devices don't rely on POSIX to give them crash safety, they read the manuals for the device to see what the manufacturers say they should do.

POSIX compliance is a property of the OS-userspace boundary, crash-safety is a property of an entire system. They're largely orthogonal.

In my opinion it's wrong for people to say that either behaviour is mandated by POSIX. IMHO it's neither mandated nor forbidden. Crash reliability is a contract between you and the OS+hardware+kernel. A ramdisk can be POSIX compliant yet is clearly not crash safe. Leave POSIX out of it, decide what Linux wants to guarantee. POSIX provides a way of guaranteeing a certain reliability but Linux is free to provide additional guarantees if it sees fit.

Maybe something for LSB? I'd like to see the language lawyers work out a way of defining "crash-safety" in a way that doesn't exclude things like ramdisks and several existing filesystems.

Ted speaks again

Posted Mar 16, 2009 14:54 UTC (Mon) by k8to (guest, #15413) [Link]

The idea that tytso hasn't read the posix standard. hah! A good one sir!

Ted speaks again

Posted Mar 17, 2009 0:08 UTC (Tue) by jlokier (guest, #52227) [Link]

POSIX promises atomicy of operations like rename()

It promises atomicity of the directory modification done by rename, and every version of ext4 provides that. Renaming is equivalent to an atomic sequence of unlink() and link() calls.

You're confusing atomicity of the directory modification with serialising against the file content modification. POSIX doesn't promise anything about that in the absence of fsync() or fdatasync() used as a barrier between them. [I can't tell from the standard if fdatasync() is sufficient.]

Ted speaks again

Posted Mar 16, 2009 15:34 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Er, I at least was pointing out that 99% of Unix app developers don't read
POSIX in the kind of detail required to figure out how to write a file
without losing it if the power fails (it's not just fsync()). Thus, if we
don't want our systems to be regarded as unreliable by virtually everyone,
we have to do what the applications developers expect, even if they *are*
boneheaded ignorant morons (as indeed ext4 and btrfs now do).

And, 'know your place'? WTF? This isn't an empire and Ted is not King:
although he is surely worthy of respect, we peons are not forbidden to
talk to the Mighty. This is a *good* thing, ffs.

Ted speaks again

Posted Mar 20, 2009 14:53 UTC (Fri) by regala (guest, #15745) [Link]

> although he is surely worthy of respect

reread all this. Respect is way behind this thread since long...

Ted speaks again

Posted Mar 16, 2009 21:33 UTC (Mon) by bojan (subscriber, #14302) [Link] (1 responses)

Chill dude! I just wanted to point out that Ted did speak about the problem again and that there is a link that people should go to and read. Hence the title of the post "Ted speaks again". As in, Ted speaks, we should _listen_.

PS. If you read any of my comments, you would know that I agree that Ted interpretation as to what is permitted by POSIX is correct.

Ted speaks again

Posted Mar 20, 2009 13:49 UTC (Fri) by regala (guest, #15745) [Link]

sorry about that, really, but this long thread gets along my nerves as I see some "I-know-better-because-I-lost-my-data-not-yours" whiners getting over the hill... and now I am really sick. Thank you for all the support you provide here to a constantly gentle tytso. :)

Know your place

Posted Mar 18, 2009 1:08 UTC (Wed) by xoddam (subscriber, #2322) [Link] (1 responses)

> Know your place, for christ's sake.

Talk like that makes me sick.

Know your place

Posted Mar 20, 2009 13:53 UTC (Fri) by regala (guest, #15745) [Link]

I am so sorry. Can you just sit people and realize what you are doing ? you are insulting someone who, if I clearly saw, did every little and crazy things you wanted. Maybe he should invent some magical thing that could guarantee that by the micro-second _before_ the crash, everything is on the disk, and if it is not what you wanted on disk after that, a simple blink of your eyes will revert it...
As we say in France, "le beurre _et_ l'argent du beurre"

btrfs fscked up, too?

Posted Mar 16, 2009 9:42 UTC (Mon) by forthy (guest, #1525) [Link] (11 responses)

I'm confused. btrfs is advertized as sort-of log structured, data&metadata COW file system. How on earth can it have the same problem? Either Chris Mason does not understand what a log structured file system is, or he has been infected by the bonehead synchronous metadata idea that exists in Unix file system developers brains.

The whole point about a log structured file system is that you write all updates of data and metadata into a sequential stream. This may cause updates to be delayed significantly, but it should not cause them to be reordered. For a definition of log structured file systems, look on Wikipedia. The article is quite comprehensive.

Actually, btrfs is not implemented as real log-structured file system, but the COW logging of data and metadata effectively should give the same semantics as a log-structured file system. If it doesn't, this semantics is broken. Either by Chris Mason being bonehead as well or because btrfs is not ready for prime time yet.

I think the main point here that counters the argument "POSIX allows this" is that POSIX actually defines an order of operations on file systems, at least as long as the system is running: Operations are performed in order, and metadata operations are atomic. If you advertise your file system as robust, as both ext4 and to an even higher degree btrfs do, you have to preserve this order, and keep metadata operations atomic (the latter is something ReiserFS doesn't). Everything else breaks the promise of being robust. Under POSIX, it is actually allowed to have non-robust file systems. This is deliberately meant for bonehead filesystem implementors, because there are so many ;-). Actually, this is meant as performance tradeoff. However, with delayed allocation of data blocks, also delaying metadata updates, gives a performance improvement, too!

btrfs fscked up, too?

Posted Mar 16, 2009 13:19 UTC (Mon) by masoncl (subscriber, #47138) [Link] (10 responses)

Btrfs isn't log structured in the traditional fill up a segment at a time sense. It shares many of the properties of log structured filesystems in that it does copy on write for all writes of both metadata and data.

Filesystems tend to break operations up into relatively large transactions. These include all the metadata changes to the FS over a 5 or 30 second interval. A big part of controlling the latencies of FS operations is to control the latencies of transaction commits. Regardless of how much of the commit you try to do in the background, there are always corner cases that break down to: wait for commit X to finish. Every FS has these, including ext34 (such as when the ext log wraps).

In the ext3 data=ordered model, the commit waits for all of the data writes during that transaction. If we assume the worst case of applications doing random data writes on slow spinning media, writing out all the data can take a very long time. This is what people noticed in the now famous firefox-fsync bug.

What btrfs does to limit transaction latencies is it only updates file metadata after file data IO is complete. This allows us to make atomic extent replacements in the file without having to flush all the data writes for a transaction before the commit can complete. xfs does something similar, but it only needs to make sure i_size updates are done after the extent is on disk.

This works well for single file overwrites. The rename case is different because the operations are between two different files.

I agree with Ted that fsync is the right answer, not just because it is what the standard says to do, but because skipping the fsync is explicitly what the standard says won't work. Adding these kinds of undocumented tricks to the filesystems today is sure to cause many problems for the next set of filesystem developers, who probably won't remember the famous firefox fsync bug or its evil twin the ubuntu gamer data loss on crash.

With all of that said, btrfs can give the ext3 behavior with little practical performance impact. fsyncs in btrfs almost always use a dedicated logging mechanism and don't have to wait for the full transaction commit.

So, I'll have patches in 2.6.30 that fix things in btrfs. This way we as a linux community can either document the new rename requirements or change the applications, and btrfs can move on to other problems ;)

btrfs fscked up, too?

Posted Mar 16, 2009 13:34 UTC (Mon) by endecotp (guest, #36428) [Link] (5 responses)

> skipping the fsync is explicitly what the standard says won't work

Hi Chris,

Can you give a reference for that? Thanks.

I'd also be interested to hear whether you believe that I should also opendir() and fsync() the directory, for portable code.

btrfs fscked up, too?

Posted Mar 16, 2009 14:29 UTC (Mon) by masoncl (subscriber, #47138) [Link] (4 responses)

I think the real disconnect here is which operations people expect to be atomic. The rename is atomic because after the operation is complete the directory entry either points to the old file or the new file.

The contents of those files are determined by what other operations you've done on them, and in the past we've always expected the fsync.

The main places that list data integrity in the standard are the fsync page and O_SYNC/O_DSYNC portions of the write page. They are the only ways to make sure things are on disk.

The close man page explicitly states the data isn't on disk yet when the close happens unless you run fsync.

I do understand why app writers want a rename that works differently from what we're providing today, and it is important for filesystems to be able to grow new APIs to work with today's applications.

Re: do you need an fsync on the directory? I honestly don't know what other operating systems require. The last time I looked through various mail servers, the directory fsync was under a linux-is-evil kind of #ifdef.

btrfs fscked up, too?

Posted Mar 16, 2009 16:58 UTC (Mon) by endecotp (guest, #36428) [Link] (1 responses)

> The close man page explicitly states the data isn't on disk yet when
> the close happens unless you run fsync.

But I don't care whether the data is on the disk.

I just care that any changes to the disk, as I observe them after a crash, happened in the order that I did them. As far as I'm aware POSIX says nothing about this, hence my request for a reference.

btrfs fscked up, too?

Posted Mar 17, 2009 0:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

How are people supposed to find a reference if POSIX doesn't say anything about the behaviour you desire? Exactly - it isn't there - it is unspecified. It can happen any which way.

The best you can find is rename manual page, which talks about processes that are currently running on the system always seeing the file. That's it. No further guarantees are made.

Now, given that a directory is a file, you have to fsync that if you want to see what you wrote there on disk (i.e. rename). Similarly, you have to fsync the data of the file if you want to see it on disk. Combine the two with the fact that nothing is specified as to the order of commits in the absence of fsync and you get the unordered rename (when it comes to the picture on disk), which is still atomic when it comes to running processes. But, you'll never know if your process is seeing the buffers or what's on disk (for both the data and the directory) unless you actually fsync beforehand.

btrfs fscked up, too?

Posted Mar 16, 2009 17:10 UTC (Mon) by forthy (guest, #1525) [Link] (1 responses)

do you need an fsync on the directory?

If you have synchronous metadata updates, you don't "need" to fsync the directory - it is updated synchronously, anyways (same with sync mounted devices: no fsync needed, either ;-). I can understand the "linux is evil" #ifdef, when you consider how ext2 works: Gather all data and metadata updates for some seconds, and then flush them out in random order. If you don't sync anything, you have a good chance that the atomicy is maintained (unless, of course, the crash happens during the short write period). If you sync data and directory, you have a very good chance that durability is maintained (unless the whole ext2 file system exploded, and now half of the files are in lost+found, and the others are completely missing).

BTW mail server: A mail server needs to fsync, because durability is required. If you receive a mail, you write it to the inbox (or indir in case of an mdir storage system), fsync, and then reply to the smtp client that the message has been accepted. The smtp client now considers the message as passed, and can remove it from its spool - if it doesn't get an ok, it has to retry later.

The question now is: Should you sync the directory? In an mdir case (that's where the directory matters, mboxes keep their name), you create a new entry in inbox/new. fsync only writes out the data, thus it reorders create-write-close on disk into write-close-...-create. Only the inode-related metadata is flushed with fsync. From the man-page it looks like POSIX cares only about durability in fsync, not about atomicy. Therefore, fsync is allowed to reorder operations (fsynced files end up earlier on the disk). To maintain atomicy in the operations on the mbox, first fsync the directory, then the file. If you are a mail reader, and e.g. take files out of new/ and move them to cur/, first fsync cur, then new. Otherwise, your mails may be orphaned (duplicates may be annoying, but harmless).

Would all be a lot easier if the filesystem had a transaction monitor behind it. You would say "new transaction, create new/msgid, write data, close, make durable, close transaction" for the delivery and "new transation, rename new/msgid -> cur/msgid, add line to index file, make durable, close transaction" for the IMAP client. If the transaction succeeds, tell the client ok, if it fails or is incomplete after a crash, it will be rolled back. Note that a transaction monitor only needs to maintain ordering within a transaction, and can reorder transactions as it sees fit (and it can even abort transactions).

btrfs fscked up, too?

Posted Mar 16, 2009 18:07 UTC (Mon) by masoncl (subscriber, #47138) [Link]

We're mixing up a bunch of concepts here, but for a mailserver workload, in ext2 you have to fsync both the directory and the file in order to make sure a newly created file is on disk.

In ext3, ext4, reiserfs, xfs, and btrfs (and probably jfs), you only need to fsync the file. The journals include the directory data because the directory mods happened along with the file creation, and it actually isn't possible to get one without the other.

The btrfs log is a little different, but it explicitly goes out and finds the directory changes to make sure they are logged with the file during the fsync.

btrfs fscked up, too?

Posted Mar 16, 2009 13:54 UTC (Mon) by forthy (guest, #1525) [Link] (3 responses)

I still think Ted misreads the standard. fsync is about durability, rename is about atomicy. That's two different things, fsync is not necessary to make rename atomic, because POSIX file system metadata operations are already atomic. Atomic metadata operations are a poor man's transaktion, but reordering them and data operations breaks that promise, even though only during a crash (outside the scope of POSIX).

Note that collecting lots of atomic operations and performing them all in one go is not necessarily breaking the order of all these updates. A true log-structured file system collects all operations in order, and writes them in one go - atomic and delayed. btrfs should share most of these properties, even though the internal design is quite different. As it shows, implementing it "right" is not costly. Thanks, Chris, for being responsive.

What we might need further are real transactions. Now, real transactions are harder; with a filesystem like btrfs which a snapshot facility, we get a step closer (but only one step). It's not that easy, in a real transaction monitor, you create a private "snapshot" at the start of a transaction, perform the transaction, and then commit this snapshot. If the commit finds a conflict (e.g. a file changed during the transaction has been changed by somebody else in the meantime), the transaction will be aborted. Also, if another transaction has already been merged and changed a file that is read accessed during the transaction, this transaction will be aborted, too.

btrfs fscked up, too?

Posted Mar 16, 2009 18:48 UTC (Mon) by tytso (✭ supporter ✭, #9993) [Link] (2 responses)

Your mistake is assuming that the atomicity of rename() is about anything other than the directory pathnames. If you read the rename specification, you will see that it is talking explicitly about directory entries, and nothing at all about the the contents of the inodes involved. For example, here's just a tiny sample from the rename(2) specification:
If the old argument points to the pathname of a file that is not a directory, the new argument shall not point to the pathname of a directory. If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began. Write access permission is required for both the directory containing old and the directory containing new.

To understand the history of the comment in Rationale section of Posix's rename() specification regarding atomicity, it's helpful to understand how rename functionality had been implemented in V7 Unix --- via a combination the link() and unlink() commands. Back in the bad old days, it was possible while renaming a directory to end up having two links to a directory, if the system crashed after link()'ing the new name of the directory, and before the old name of the directory was unlink()'ed.

But to say that this atomicity requirement, which was only about the functionality of rename(2) system call being atomic, would somehow extend to a open-write-close-rename sequence, is a gross misreading of the POSIX specification. And given that I implemented POSIX TTY Job Control from the specification back in the 0.12 days of Linux in fall of 1991, I rather suspect that I have a bit more experience reading the POSIX specification than you do...

btrfs fscked up, too?

Posted Mar 17, 2009 1:43 UTC (Tue) by bojan (subscriber, #14302) [Link]

When I tried to suggest exactly this in another thread, bullshit was called: http://lwn.net/Articles/323430/. So, thank you very much for posting this explanation here.

btrfs fscked up, too?

Posted Mar 17, 2009 10:22 UTC (Tue) by forthy (guest, #1525) [Link]

Sorry, you still refer to the fact that POSIX "allows" to replace all files with pumpkins in case of a crash (especially for squashfs, this is the "obviously right" action ;-). That's not the issue, that's not what we discuss here - we are talking about file systems doing something actually reasonable in case of a crash, which is similar to well specified behavior under normal operation. If you rename a file during operation, and at the same time open that file in another process, and read it, you either read the old data or the new data, but no empty files, no garbage files, no pumpkins (unless, of course, the file deliberately contains a pumpkin image). It is obvious that file metadata is closely tied to actual file data.

BTW reading: I've implemented two Forth compilers from a standard back in the early 90s, and I've been doing my best to implement reasonable behavior in those corner cases where the standard says "an ambiguous condition exists...". The Forth community was quite picky back then about all those ambiguous conditions in the standard, because many people were used to well-defined behavior from their particular systems they used - however, this well-defined behavior might not have been portable. The result of the discussion was that I first started to make one of these two Forth compilers a "model implementation", which had well-defined behavior on those parts where the standard was just sloppy without proper reason. This continued over time, and now the community is revising the standard, and we are now trying to be more precise and less ambiguous (the draft standard document now even includes a test suite). So now, I'm not just reading standard documents, I'm writing them.

What resulted of this activity on my side is a different view upon standard documents, and how to read them. Standards encode common practice. People have not always been careful when implementing things. A standard document is a compromise between different systems. If you implement your system, it's not your job to find excuses for unreasonable behavior, it's your job to find reasonable ways to deal with ambiguous conditions. And if you are really good at it, it's your job to implement these things in a way that can serve as model for others (it's always the duty of those who are good to serve as example). Take the compiler example again: If a symbol encountered by the compiler is neither a number nor a pre- or user-defined function or variable, this is an ambiguous condition. The compiler is "allowed" by the standard to transform the user into a pumpkin (by magic, of course), teaching him a final lesson about proper programming. The reasonable action on a syntax error however is to print a message which states file, source line and position within that line, plus a meaningful error message about the problem. No language standard will define this action. Yet, most compilers in the world (regardless of the language) stick to that behavior, and even use a similar output format to make IDEs happy.

I hope you now understand why I say you didn't read POSIX, but you duck behind it. With "reading" I mean: Try to find out what best practice would be in a case where POSIX indeed does not really define how it should be. And "best practice" is both what your users will be happy with and what serves as good example for other file system writers (pumpkins are no option). If you raise the bar of expectation, do it.

Levels of reality

Posted Mar 16, 2009 13:59 UTC (Mon) by itvirta (guest, #49997) [Link] (1 responses)

It seems to me that there are two different levels of reality regarding filesystem updates. And in all of this discussion, the two levels seem to be quite confused.

A) There's the OS level which is in effect while the system is running. This is the domain for which POSIX gives all those nice guarantees about the ordering of operations and rename being atomic etc. The OS might buffer things, but it's also quite capable of reading back things from the buffer.

B) There's the storage level, the one which you see if you look at the data written to the disk. This level only becomes apparent if the OS crashes because then (and only then) the buffers are lost. Also, apparently POSIX doesn't give any guarantees in this domain.

Because of this difference, it's completely moot to say things like "POSIX allows the fs show a renamed but empty file after a crash". Of course it does, it also allows the complete fs to get trashed after a crash. I think everyone agrees that the latter one isn't a good idea. But the first one isn't either.

So POSIX semantics do not matter in case of a crash, Something else is needed.

Actually, I think all this makes fsync() quite odd all in all. If POSIX doesn't guarantee anything after a crash, then who cares about fsync(). Ok, fsync() might commit the data to the disk, but it's still allowed for the whole fs to be destroyed after the crash. So, if an fs developer says something like "call fsync, because POSIX allows things to go wrong otherwise", he is already giving guarantees above those given by POSIX. And that, I think, is slightly contradictory.

What the application developers would seem to like, is for the ordering of operations (writes and renames) to be consistent within a file(*) even in case of a crash. They don't care when something happens, they care about it happening in order. It's not required by POSIX, but so isn't any kind of saving after a crash, like journaling. Journaling is already only for the convenience of users (instead of being for compliance to standards).
(*) I mean "file" in the application point of view. With the directory entry included.

Doing the data-metadata commits in order doesn't rule out delayed allocation or any such. The fs can still delay as it likes. It also doesn't mean that everything should happen in order, just that things relevant to the same file need to. (And if the file gets deleted before anything is committed, well, good, no need to write.) But committing something that was called later (in the famous example, this would be the rename) before something that was called earlier (this would be the write) seems a bit silly, and counter-intuitive.

Ok, now yell at me for being horribly wrong.

Levels of reality

Posted Mar 17, 2009 0:21 UTC (Tue) by jlokier (guest, #52227) [Link]

The *precise* meaning of fsync has some wiggle-room, as you noticed.

It's physically impossible to be absolutely sure of retrieving your data after any kind of crash. No storage device is immune to weird failures.
So it would be pointless to define fsync to mean that - it could never be implemented.

fsync is just a way of asking "please do what you reasonably can - flush delayed writes, etc., so I can expect the file to be in its current state after a crash following the fsync provided no really weird mega-corruption failure happened".

It's obviously very useful.

The POSIX talk is pointless

Posted Mar 16, 2009 23:15 UTC (Mon) by sbergman27 (guest, #10767) [Link] (6 responses)

The situation here, from a bird's eye view, is quite simple. If ext4, with real applications running in the real world, exhibits this behavior then ext4 and Linux, not necessarily in that order, are going to come to be considered unstable crap. This includes old abandoned FOSS code as well as proprietary code used by businesses that is not likely to get rewritten any time soon. The Linux and Ext4, and not the application developers, will take the fall. If the solution involves hurting Ext4 in the benchmarks, then so be it. If it means a little more fragmentation, so be it. I cannot believe that there is any question about this. If nodelalloc needs to be made the default, with a big warning slapped onto the option that enables delayed allocation, then so be it. It seems absolutely surreal to see extX devs debating whether an easily demonstrable and severe lack of data integrity might be worth it to get better benchmark numbers. (And it makes me start wondering how much I should trust ext4, in general.) I guess maybe they've just been obsessing so much on bechmark figures that they've totally lost perspective. Hopefully, it is just a temporary condition.

The POSIX talk is pointless

Posted Mar 16, 2009 23:45 UTC (Mon) by foom (subscriber, #14868) [Link] (5 responses)

> If ext4, with real applications running in the real world, exhibits this behavior then ext4 and
Linux, not necessarily in that order, are going to come to be considered unstable crap.

Please note that they already made and applied patches to restore the behavior people are arguing
for. You don't have to convince anybody that the most practical thing to do right now is fix this
behavior in the filesystem. So ext4 does *not* exhibit this behavior, anymore.

The only question remaining is if the filesystems *should* have had to make this change, or if what
they were doing before is really perfectly okay, except for all those darn applications people wrote
all of which are broken.

The POSIX talk is pointless

Posted Mar 17, 2009 0:40 UTC (Tue) by sbergman27 (guest, #10767) [Link] (4 responses)

I started to post again to clarify that point. But decided not to.

My understanding is that those patches keep existing files from being zapped, but still provides significantly less in the way of guarantees for the data integrity of new files. I simply don't see any room for reliability regressions at default settings, relative to ext3, in the name of performance, period. Old files, new files, whatever. If delayed allocation makes your data less safe and that can't be fixed, then it needs to be turned off by default. People can turn it on to get their benchmark numbers.

One of the major benefits of delayed allocation is supposed to be reduced fragmentation. Well, we're supposed to be getting an online defragger, aren't we? And I thought ext filesystems were already supposed to be doing well enough on fragmentation avoidance that we didn't really have to worry about it anyway.

The POSIX talk is pointless

Posted Mar 17, 2009 7:09 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (3 responses)

The patches fix the behaviour to match application usage patterns and reliability doesn't seem to be affected anymore than Ext3 for the typical use cases plus you get a nice performance benefit.

Ext3 avoids fragmentation to a good extend but it is still possible to run into situations where the filesystem is badly fragmented especially as you get closer to reaching your storage limits. Ext4 improves the fairly simple and stupid allocation method in Ext3 with a more intelligent one. A online defragmenter is a additional feature and not a replacement of the allocator.

The POSIX talk is pointless

Posted Mar 17, 2009 13:46 UTC (Tue) by sbergman27 (guest, #10767) [Link] (2 responses)

"""
...reliability doesn't seem to be affected anymore than Ext3 for the typical use cases...
"""

That sounds like a "weasel phrase". "Typical Use Cases"? The lofty goal of ext4 is now to preserve user data in "typical use cases"?

If I mount two filesystems, one ext3 and one ext4, both using the defaults. And then open a new file on both, write some data to both, close both, wait 30 seconds, and pull the power plug, what happens?

If the answer is "the same thing for both" or "ext4 is more reliable in that scenario" then great. If the answer is "ext4 is less reliable" then ext4 will never see my production systems, with or without nodelalloc.

The POSIX talk is pointless

Posted Mar 17, 2009 17:04 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

I can't speak on behalf of Ext4 developers but no filesystem will preserve your data in all cases. The principle is to optimize for the common use cases and then add things to cover the corner cases as well to the extend possible. That is what has happened in Ext4 as well.

With the current patches, you should see the same behaviour in both filesystems. Feel free to test and report if you see any changes.

The POSIX talk is pointless

Posted Mar 17, 2009 17:31 UTC (Tue) by sbergman27 (guest, #10767) [Link]

"""
...no filesystem will preserve your data in all cases...
"""

Ext3 with data=journal will come quite close to it, though. In fact, ext3 with data=ordered comes impressively close to it.

"""
With the current patches, you should see the same behaviour in both filesystems.
"""

I'll be running exactly this scenario when 2.6.30 is released to see what happens.


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds