User: Password:
|
|
Subscribe / Log in / New account

ext4 and data loss

ext4 and data loss

Posted Mar 12, 2009 16:15 UTC (Thu) by iabervon (subscriber, #722)
In reply to: ext4 and data loss by ricwheeler
Parent article: ext4 and data loss

I think there are plenty of applications that expect that rename() will atomically replace an old file with a new file, so that there is no point at which a system crash and restart will cause another process attempting to access the path to find it missing or with contents other than one or the other. POSIX doesn't specify anything about system crashes, but it does specify that any concurrently-running process can't see anything other than the state where the rename hasn't happened or the state where the rename has happened after the data is written. I'm not sure that POSIX even specifies that fsync() or fdatasync() will be particularly useful in a system crash; it does specify that your data will have been written when the system call returns, but that doesn't mean that the system crash won't completely or even selectively destroy your filesystem.

So the sensible thing to do is to treat rename as dependent on the data write. Now, any program that truncates a file and then writes to it will tend to lead to 0-length files in a system crash, but that also tends to lead to 0-length files in an application crash or in a race with other code, or at least be a case where it's okay and expected to not find any particular file contents. If your desktop software actually erases all of your files and then hopes to be able to write new contents into them before anybody notices, that is an application bug, but using ext3 won't change the fact that it's failure-prone even aside from filesystem robustness.


(Log in to post comments)

ext4 and data loss

Posted Mar 13, 2009 0:11 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I'm not sure that POSIX even specifies that fsync() or fdatasync() will be particularly useful in a system crash; it does specify that your data will have been written when the system call returns, but that doesn't mean that the system crash won't completely or even selectively destroy your filesystem.

It doesn't talk about system crashes (it wouldn't be practical to specify what a system does when it's broken), but it heavily implies crash-related function. It also does not specify that data will have been written after fsync -- POSIX is more abstract than that. The POSIX user doesn't know what a cache is; he doesn't know there's a disk drive holding his files. In POSIX, write() writes to a file. It doesn't schedule a write for later, it writes it immediately. But it allows (by implication) that certain kinds of system failures can cause previously written data to disappear from a file. It then goes on to introduce the concept of "stable storage" -- fsync() causes previously written data to be stored that way. fsync() isn't about specific I/O operations; what it does is harden previously written data so that these certain kinds of system failures can't destroy it.

POSIX is, incidentally, notoriously silent on just how stable stable is, leaving it up to the designer's imagination which system failures it hardens against. And there is a great spectrum of stability. For example, I know of no implementation where fsync hardens data against a disk drive head crash. I do know of implementations where it doesn't harden it against a data center power outage.

ext4 and data loss

Posted Mar 13, 2009 2:14 UTC (Fri) by iabervon (subscriber, #722) [Link]

Beyond POSIX, I think that users of a modern enterprise-quality *nix OS writing to a good-reliability filesystem expect is that operations which POSIX says are atomic with respect to other processes are usually atomic with respect to processes after a crash (mostly of the unexpected halt variety), and that fsync() forces the other processes to see the operation having happened.

That is, you can think of "stable storage" as a process that reads the filesystem sometimes, and, after a crash, repopulates it with what it read last, and that fsync will only return after one of these reads after when you call it. You don't know what "stable storage" read, and it can have all of the same sorts of race conditions and time skew that any other concurrent process can. If the filesystem matches some such snapshot, it's the user or application's carelessness if anything is lost; if the filesystem doesn't match any such snapshot, it's crash-related filesystem damage.

ext4 and data loss

Posted Mar 13, 2009 2:51 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Beyond POSIX, I think that users of a modern enterprise-quality *nix OS writing to a good-reliability filesystem expect is that operations which POSIX says are atomic with respect to other processes are usually atomic with respect to processes after a crash (mostly of the unexpected halt variety)
In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage. Most filesystems weaken this guarantee somewhat, but leaving NULL-filled and zero-length files that never actually existed on the running system is just unacceptable.

fsync() forces the other processes to see the operation having happened
Huh? fsync has nothing to do with what other processes see. fsync only forces a write to stable storage; it has no effect on the filesystem as seen from a running system. In your terminology, it just forces the conceptual "filesystem" process to take a snapshot at that instant.

ext4 and data loss

Posted Mar 13, 2009 15:26 UTC (Fri) by iabervon (subscriber, #722) [Link]

In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage.
The model works if you include the fact that, in a system crash, unintended things are, by definition, happening. Any failure of the filesystem to make up a possible state afterwards appears as fallout from the crash. Maybe some memory corruption changed your file descriptors, and your successful writes and successful close were some other file (but the subsequent rename found the original names). Maybe something managed to write zeros over your file lengths. It's not a matter of standards how often undefined behavior leads to noticeable problems, but it is a matter of quality.
fsync has nothing to do with what other processes see. fsync only forces a write to stable storage; it has no effect on the filesystem as seen from a running system. In your terminology, it just forces the conceptual "filesystem" process to take a snapshot at that instant.
That's what I meant to say: it makes the "filesystem" process see everything that had already happened. (And, by extension, processes that run after the system restarts, looking at the filesystem recovered from stable storage)

ext4 and data loss

Posted Mar 13, 2009 16:04 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

In an ideal world, that would be exactly what you'd see: after a cold restart, the system would come up in some state the system was in at a time close to the crash, not some made-up non-existent state the filesystem cobbles together from bits of wreckage. Most filesystems weaken this guarantee somewhat, but leaving NULL-filled and zero-length files that never actually existed on the running system is just unacceptable.

You mean undesirable. It's obviously acceptable because you and most your peers accept it every day. Even ext3 comes back after a crash with the filesystem in a state it was not in at any instant before the crash. The article points out that it does so to a lesser degree than some other filesystem types because of the 5 second flush interval instead of the more normal 30 (I think) and because two particular kinds of updates are serialized with respect to each other.

And since you said "system" instead of "filesystem", you have to admit that gigabytes of state are different after every reboot. All the processes have lost their stack variables, for instance. Knowing this, applications write their main memory to files occasionally. Knowing that even that data isn't perfectly stable, some of them also fsync now and then. Knowing that even that isn't perfectly stable, some go further and take backups and such.

It's all a matter of where you draw the line -- what you're willing to trade.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds