Better guidance for database developers

Posted Sep 25, 2019 21:44 UTC (Wed) by nybble41 (subscriber, #55106)
In reply to: Better guidance for database developers by rweikusat2
Parent article: Better guidance for database developers

You seem to be arguing that POSIX compliance requires fsync() on a file to imply an fsync() on the parent directory, and potentially all other ancestor directories up to the root of the filesystem. Or possibly *multiple* parent directories and their ancestors in the case of hard links. Do you have any examples of POSIX-style operating systems which make such guarantees?

Personally I'd say that the Linux implementation is perfectly compliant. The fsync() call ensures that the data and metadata for the target file (i.e., inode) is written to the backing device. After reset and recovery any process with a reference to the file will read the data which was present at the time of the fsync() call (unless it was overwritten later). This is enough to satisfy the requirements. In order to get such a reference, however, you need directory entries to associate a path with that inode. Those directory entries are not part of the file, and the creation of a directory entry is not an I/O operation on the file, so an fsync() call on the file itself does not guarantee anything about the directory. For that you need to fsync() the directory.

Better guidance for database developers

Posted Sep 25, 2019 22:12 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

As I quoted in an earlier post:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

You're correct insofar as this doesn't explicitly demand that the data which was recorded can ever be retrieved again after such an event, IOW, that an implementation which effectively causes it to be lost is perfectly compliant :-). But that's sort of a moot point as any "sychronous I/O capability" is optional, IOW, loss of data due to write-behind caching of directory operations is just a "quality" of (certain) Linux implementations of this facility. I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament. In any case, POSIX certainly doesn't "mandate" this "feature".

Better guidance for database developers

Posted Sep 26, 2019 20:29 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> I'm - however - pretty convinced that the idea what that the data can be retrieved after a sudden "cache catastrophe" and not that it just sits on the disk as magnetic ornament.

Even if you mandated that fsync() == sync() so that *all* filesystem data was written to disk before fsync() returns it still wouldn't guarantee that there is actually a directory entry pointing to that file. For example, it could have been unlinked by another process, in which case the data on disk really would be nothing more than a "magnetic ornament".

Let's say process A creates a file with path "/a/file", writes some data to it, and calls fsync(). While this is going on, another process hard-links "/a/file" to "/b/file" and then unlinks "/a/file" prior to the fsync() call. Would you expect the fsync() call to synchronize both directories, or just the second directory?

Better guidance for database developers

Posted Sep 26, 2019 20:55 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

I'm sorry but you're just dancing around the issue. UNIX(*) file systems used to do directory modifications synchronously in order to guarantee (to the point this was possible) file system integrity in case of a sudden loss of cache contents. And that's what the people who wrote the POSIX text had in mind: A situation where there's file data in the filesystem but no directory entry pointing to it cannot occur. Hence, ensuring that all file data and metadata is written, as per definition of fsync, is sufficient to guarantee that the file won't be lost.

The Linux ext2 file system introduced write-behind caching of directory operations in order to improve performance at the expense of reliablity in situations deemed to be rare. Because of this, depending on the filesystem being used, fsync on a file descriptor is not sufficient to make a file crash-proof on Linux: An application would need to determine the path to the root file system, walk that down while fsyncing every directory and then call fsync on the file descriptor. This is obviously not a requirement applications will realistically meet in practice.

Possibily 'hostile' activities of other processes (as in "Let's say ...") are of no concern here because that's not a situation fsync is supposed to handle.