What about other filesystems?

Posted Jan 17, 2021 17:40 UTC (Sun) by Wol (subscriber, #4433)
In reply to: What about other filesystems? by NYKevin
Parent article: Fast commits for ext4

What I find hard to understand is, if the database (SQLite, whatever) is using linux syscalls, how does it know the data has actually been written? Or does it do loads of sync()s, and then pause all writes for ten seconds or so waiting for the data to flush, etc etc.

I can see how databases can provide 99.999% reliability. I'm active on the raid list. I know all about disk timeouts, disks lying, how long things take to get flushed, etc etc. I simply do not see how an application can guarantee safety.

As for "why should it be in the kernel" - because LOTS of developers will benefit from the ability to reason about the state of a system in a crash scenario. Why should all the database developers be forced to duplicate each others' work?

And frankly, if I commit something to the filesystem for saving, surely I should be able to ask the filesystem "have you saved it?" AND BE ABLE TO RELY ON THE ANSWER! (Yep, I know disks lie, and I don't expect the file system necessarily to deal with that, but it really should be held responsible for its own actions!)

Cheers,
Wol

What about other filesystems?

Posted Jan 17, 2021 22:13 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

The process that SQLite uses is documented in https://sqlite.org/atomiccommit.html in a very high level of detail.

TL;DR: They make a copy ("rollback journal") of the data they are about to overwrite, fsync that copy, overwrite the data, fsync the database itself, and finally delete the rollback journal.