Ensuring data reaches disk

Posted Sep 16, 2011 10:35 UTC (Fri) by andresfreund (subscriber, #69562)
In reply to: Ensuring data reaches disk by scheck
Parent article: Ensuring data reaches disk

If the storage device has a write cache but no independent power supply you have the problem that you will loose data on power loss because O_DIRECT will only guarantee that the write reaches the device, not that it reaches persistent storage inside that device.
For that you need to issue some special commands - which e.g. fsync() knows how to do.
Besides an O_DIRECT write doesn't guarantee that metadata updates have reached stable storage.

Ensuring data reaches disk

Posted Nov 8, 2020 21:55 UTC (Sun) by yzou93 (guest, #142976) [Link] (5 responses)

An awesome article.
My question about fsync() is how the OS could control/know the device-internal caching behavior.
When designing a block device hardware, for example if Samsung wants to design a new SSD, is a cache control support for fsync() command issued from OS required?

Thank you.

Ensuring data reaches disk

Posted Nov 8, 2020 23:15 UTC (Sun) by Wol (subscriber, #4433) [Link]

Not that this is necessarily the way it's done, but linux does have a disk database. I see that often enough watching the raid stuff. It's quite possible (though probably not) that linux looks up the drive characteristics.

Cheers,
Wol

Ensuring data reaches disk

Posted Nov 9, 2020 9:56 UTC (Mon) by farnz (subscriber, #17727) [Link] (3 responses)

Yes, such a command is needed, and the various interface specs (ATA, SCSI, NVMe) all have standardised commands for flushing the cache.

At a minimum, you get a FLUSH CACHE or SYNCHRONIZE CACHE type command, which is specified as not completing until all data in the cache is in persistent storage; this is enough to implement fsync() behaviour; beyond that, you can also have forced unit access (FUA) commands, which do not complete until the data written is on the persistent media, and even partial flush commands that only affect some sections of the drive.

There's an added layer of complexity in that some standards have queued flushes which act as straight barriers (all commands before the flush complete, then the flush happens, then the rest of the queue); others have queued flushes that only affect commands issued before the flush in this queue (and can over-flush by flushing data from later commands in the queue), and yet others only have unqueued flushes which require you to idle the interface, wait for the flush to complete, and then resume issuing commands.

Ensuring data reaches disk

Posted Nov 9, 2020 10:17 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

Ouch. As a database guy I'm desperate for "queued flush straight barrier", because if you want data integrity that at least makes reasoning possible - "if the transaction log is incomplete, revert; if the data write is complete, continue; if the log is complete and the data write isn't, re-play the log".

If you can't be sure what has or hasn't hit the disk - the nightmare scenario is "part of the log, and part of the data" - then you get the hoops that I believe SQLite and PostgreSQL go through :-(

Cheers,
Wol

Ensuring data reaches disk

Posted Nov 9, 2020 17:11 UTC (Mon) by zlynx (guest, #2285) [Link] (1 responses)

More fun is consumer grade SSDs that protect their metadata during power-loss but not necessarily the data.

I had to rebuild a btrfs volume because my laptop battery ran down in the bag and on reboot the drive contained blocks saying writes had completed, but those data blocks had old data in them. In other words, data that had been committed to physical storage (or that was CLAIMED by the drive) was no longer present after power-loss. It probably had to fsck or equivalent on the Flash FTL and lost some bits.

btrfs gets very upset about that.

I guess this behavior is still better than some older SSDs which had to be secure-erased and reformatted after losing their entire FTL? I guess.

Ensuring data reaches disk

Posted Nov 9, 2020 18:27 UTC (Mon) by farnz (subscriber, #17727) [Link]

To be fair to btrfs, that's it's USP compared to ext4 - when hardware fails, it lets you know that your data has been eaten at the time of the issue, and not months down the line.

And knowing consumer hardware, chances are very high that it did commit everything properly, and then had a catastrophic failure when there was a surprise power-down. Unfortunately, unless you have an acceptance lab verifying that kit complies with the intent of the spec, it often complies with the letter of the spec (if you're lucky) and no more :-(