Fast commits for ext4

Posted Jan 15, 2021 18:49 UTC (Fri) by NYKevin (subscriber, #129325)
Parent article: Fast commits for ext4

I have some thoughts about this:

1. As usual, Ted's reasoning seems eminently sensible to me. Apps that really want to sync the whole filesystem should *already* be using syncfs(2).
2. Still, I could imagine some apps not knowing about the "and also you have to fsync the directory" rule (see fsync(2)), which is currently not really enforced since fsync in practice flushes "everything." It might be worth special-casing that, or providing a flag. But hard links make this trickier (which directory do you want to fsync?).
3. It would be really nice if rename(2) would function as a write barrier, or at least not be reordered before any writes to the file that is renamed, but I'm not sure where current filesystems stand on that... I know this has definitely been discussed in the past, though (see for example this article: https://lwn.net/Articles/351422/).
4. Maybe if this gets performant enough, we can rename O_DSYNC to O_PONIES.

Fast commits for ext4

Posted Jan 16, 2021 7:46 UTC (Sat) by khim (subscriber, #9252) [Link] (4 responses)

Hyrum's Law says that what Ted says is kind of irrelevant. If implementation provided full sync instead of what the documentation says then it should continue to do so.

You may provide optional support for something else and then, over course of many years test more and more apps with it. Maybe, just maybe it may become a new default, eventually. But that's not guaranteed…

Doing anything else is just asking for trouble, unfortunately.

Fast commits for ext4

Posted Jan 16, 2021 19:39 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

I am generally in favor of Hyrum's law, but I do have my limits, and one of those limits is "Was anyone even relying on that in the first place?"

See for example the line in https://www.freebsd.org/cgi/man.cgi?query=fcntl&sekti... about the "completely stupid semantics of System V" (yes, they really put that in a man page). IIRC someone looked into it, and:

1. It had been standardized that way because that's what one particular implementation decided to do. There was never a proper rationale or anything.
2. Nobody could find any apps which actually rely on this behavior.
3. They were able to find at least one app which *accidentally* triggers this behavior and gets buggy file locking as a result.

If this wasn't already standardized in POSIX, I would be heavily in favor of Linux etc. just changing the behavior to something more sensible. Unfortunately, it's probably a Bad Idea to deliberately violate POSIX, even when POSIX is obviously silly, so we're likely stuck with it. But it would've been nice if someone would have caught this earlier, and I think that Hyrum's Law would not have been helpful in that situation.

Moving on to the actual topic of discussion here. Applications, in general, don't use fsync much at all. Many of them do the open/write/rename ritual, which technically requires an fsync (after the write and before the rename) to guarantee a safe ordering, but I don't think many people bother with that. I am of the opinion that Hyrum's Law ought to require the rename to function as a barrier in this case, as I wrote in my initial comment. I don't think many apps are using an fsync of /path/to/foo.txt in order to guarantee durability of /path/to/bar.txt, but we would need to do a survey to be sure. If I'm right about that, then I don't think we need to apply Hyrum's Law here as (a) few apps would be broken in practice, (b) there's already an interface (syncfs) to do what those apps want (so it'd be easy to fix), and (c) the performance wins for all of the apps which use fsync correctly could potentially be quite large. Also (d) we don't want to penalize apps written for other Unices for using the standard interface, so introducing a Linux-only "no really, just fsync this one file" syscall is a Bad Idea.

Fast commits for ext4

Posted Jan 18, 2021 19:24 UTC (Mon) by tytso (subscriber, #9993) [Link]

So for decades, competently written text editors write new precious files, such as source files via:

1) Write the new contents of foo.c to foo.c.new
2) Fsync foo.c.new --- check the error return from the fsync(2) as well as the close(2)
3) Delete foo.c.bak
4) Create a hard link from foo.c to foo.c.bak
5) Rename foo.c.new on top of foo.c

This doesn't require an fsync of the directory, but it guarantees that /path/to/foo.c will either have the original contents of foo.c., or the new contents of foo.c, even if there is a crash any time during the above process. If you want portability to other Posix operating systems, including people running, say, retro versions of BSD 4.3, this is what you should do. It's what emacs and vi does, and some of the "ritual", such as making sure you check the error return from close(2), is because other wise you might lose data if you run into a quota overrun on the Andrew File System (the distributed file system developed at CMU, and used at MIT Project Athena, as well as several National Labs and financial institutions).

That being said, rename is not a write barrier, but as part of the O_PONIES discussion, on a close(2) of an open which was opened with O_TRUNC, or on a rename(2) where the destination file is getting overwritten, the file being closed, or the source file of the rename will have an immediate write-out initiated. It's not going to block the rename(2) operation from returning, but it narrows the race window from 30 seconds to however long it takes to accomplish the writeout, which is typically less than a second. It's also something that was implemented informally by all of the major file systems at the time of the O_PONIES controversy, but it doesn't necessarily account for what newer file systems (for example, like bcachefs and f2fs) might decide to do, and of course, this is not applicable for what other operating systems such as MacOS might be doing.

The compromise is something that was designed to minimize performance impact, since users and applications also depend upon --- and get cranky --- when there are performance regressions, while still papering over most of the problems caused by careless application. From file system developers' perspective, the ultimate responsibility is on application writers if they think a particular file write is precious and must lost be lost after a system or application crash. After all, if the application is doing something really stupid, such as overwriting a precious file by using open(2) with O_TRUNC, because it's too much of a pain to copy over ACL's and extended attributes, so it's simpler to just use O_TRUNC and overwrite the data file and crossing your fingers. There is absolutely no way the file system can protect against application writer stupidity, but we can try to minimize the risk of damage, while not penalizing the performance of applications which are doing the right thing, and are writing, say, a scratch file.

Fast commits for ext4

Posted Jan 19, 2021 8:15 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (1 responses)

What does the system call do in other filesystems?

If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a different filesystem.

Fast commits for ext4

Posted Jan 20, 2021 7:48 UTC (Wed) by viiru (subscriber, #53129) [Link]

> What does the system call do in other filesystems?

> If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a
> different filesystem.

In my understanding that is exactly what it is. Ext3 shares the behavior, but for example XFS does not. This caught many application developers by surprise, but that happened a couple of decades ago and has most likely been fixed in any sensible application.