Journalling commit is more like "asynchronous delayed commit" from a database point of view, when fsync() isn't used. They protect the integrity of filesystem structure itself, and are not used for application-level transactional changes. Sometimes that weaker kind of commit is fine, and the performance gain is large. fsync() makes it more like a standard database commit, where the data is supposed to be secure before the call returns. This is one area where traditional databases can learn from filesystems. There are some things where you don't actually need a database to commit quickly - that can take as long as it needs to batch and optimise I/O. All you need then is consistent rollback. For example, databases which hold cached calculations are like this. Your point about partial writes on power failure and not using overlapping blocks (will sectors do?) is valid, and I would like to know more about what the database professionals have discovered about what exactly is and isn't safe. For example, can failure to write sector N+1 corrupt sector N written a long time ago? Is the "failure block size" larger than a single sector when doing O_DIRECT (when that really works)? Is it larger than a filesystem/blockdev block size when not using O_DIRECT? What's the reason Oracle uses different journal block sizes on different OSes? I think the filesystem implementors do know about that effect. Journal entries are finished with a commit block, to isolate the commit into its own block, which is not touched by the next transaction. I think your two/three ping-pong blocks correspond to the journal's finite wraparound length on a filesystem - do say more if that's not so.
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds