What about other filesystems?

Posted Jan 17, 2021 21:03 UTC (Sun) by matthias (subscriber, #94967)
In reply to: What about other filesystems? by Wol
Parent article: Fast commits for ext4

>> Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

>And this is pretty much the perfect example of what is wrong with the current setup. The filesystem journal is there to protect the filesystem, to hell with the user's data.

If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

>So HOW as a database developer am I supposed to protect my database (other than writing to a raw partition!) if I can't trust the filesystem to protect my user-space journal!

You can trust it. Do an fsync on the user-space journal. Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative. And writing to a raw partition also does not guarantee that the writes are in order. You still have to flush the cashes to ensure that something actually has hit the disk, or use synchronous IO from the beginning.

>That's why ext3 journal="ordered" was so good - it gave the APPLICATION DEVELOPERS the guarantee that, after a crash, writes *appeared* to have been written to disk in the order that they were made. That's ALL a developer needs!

This is clearly wrong. data=ordered only ensures that data is written before related meta-data. Writes to different files are not guaranteed to be ordered. Overwriting a file is not guaranteed to be ordered. The blocks can hit the disk in a random order. The things that are guaranteed by data=ordered is that when appending a file the new data hits the disk before the new blocks are added to the inode, i.e., it is not possible that the file contains garbage instead of the new data. And I think it is guaranteed that a rename only is done after the renamed file has hit the disk. This of course helps for those people that update data by the create new file, rename model. But there is not much that really helps for databases.

>(Oh, and I don't think my work on raid would help that, even if raid could provide those guarantees which I hope it can, that an application could rely on it unless the filesystem ALSO provided those guarantees.)

The work on raid is essential, as of course all layers below the filesystem have to provide certain data safety features. Especially the filesystem requires some kind of transactional semantics for its own journal.

Best,
Matthias

P.S.: Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy. At least the big systems know pretty well which data has which ordering requirements, which data should be in memory cache or will likely not be used again soon and want to control all these aspects themselves. If they do not use a raw partition from the beginning, they will just use the filesystem to reserve a bunch of blocks and then use synchronous IO.

What about other filesystems?

Posted Jan 17, 2021 21:18 UTC (Sun) by Wol (subscriber, #4433) [Link] (2 responses)

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties

I did say the *appearance* of strong ordering guarantees :-)

> If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

But if the data in flight is corrupted, then the only way to get the system back to a sane (for the user) state may be "format, recover backup". Yes making sure the filesystem is itself consistent is important but it's only part of the picture. If I can't trust the state of the data, I need to recover from backup.

Cheers,
Wol

What about other filesystems?

Posted Jul 23, 2021 17:55 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (1 responses)

You can trust the data if you tell the OS you want to pay the price (use fsync, or O_SYNC/O_DSYNC) and you only rely on data known to be flushed after a crash.

The alternative you're proposing basically implies that writes cannot be buffered for any meaningful amount of time. Oh, your browser updated it's history database? Let's just make that wait for all temporary files being written out, the file being copied concurrently, etc. And what's worse, do not allow any concurrent sync writes to finish (like file creation), because that would violate ordering.

Ext3's ordering guarantees were weaker and yet lead to horrible stalls all the time. There constantly were complaints about Firefox's SQLite databases stalling the system etc.

The OS/FS aren't magic.

What about other filesystems?

Posted Jul 23, 2021 18:01 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> The OS/FS aren't magic.

In particular they have no reliable way of knowing which files (or even parts of files) are related and need constrained ordering between writes, and which are unrelated and thus can handled independently.

What about other filesystems?

Posted Jul 23, 2021 13:10 UTC (Fri) by Defenestrator (guest, #153400) [Link]

> And I think it is guaranteed that a rename only is done after the renamed file has hit the disk.

This is often true in practice (in particular, in ext3 and in ext4 outside of early versions), but not always explicitly guaranteed. See the auto_da_alloc option added to ext4 for more info and background.

What about other filesystems?

Posted Jul 23, 2021 17:43 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> Do an fsync on the user-space journal.

With a small bit of care fdatasync() should be enough, and will often be a lot faster (no synchronous updating of unimportant filesystem metadata like mtime, turning a single write into multiple).

> Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative.

O_DIRECT on its own does not remove the need for an fsync/fdatasync. Devices have volatile write caches, and they do loose data on power loss/reset. O_DIRECT alone only avoids issues with the OS write cache. And makes the f[data]sync cheaper, because it will often only have to send a cache flush (and transparently avoid that on devices without volatile write caches).

Alternatively it can be combined with O_DSYNC to achieve actually durable writes - but if one isn't careful that can tank throughput. It either adds the FUA flag to each write or does separate sync commands after each write, which can end up being more cache flushes for a workload. It can be significantly faster to do a series of writes and then a single cache flush.

It's hardware dependant too whether FUA or separate cache flush commands are faster :(. Dear Samsung, please fix your drives.

My testing says that on most NVMe devices with a volatile cache DSYNC wins if the queue depth is very low and latency is king (only one roundtrip needed). fdatasync wins if there's more than a few writes happening at once, especially if all/most need to complete before the "user transaction" finishes - the lower number of flushes saves more than the additional roundtrip costs.

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy.

Indeed! And there's no realistic way the FS can do better.