Atomic writes without tears
Atomic writes without tears
Posted Jan 21, 2025 21:33 UTC (Tue) by zblaxell (subscriber, #26385)In reply to: Atomic writes without tears by alkbyby
Parent article: Atomic writes without tears
If you only want the filesystem to get out of the way of writing directly to your block device, yes. But you are missing some things.
The average cost isn't necessarily the worst case cost. The checksum blocks can be amortized across multiple data blocks, if the application follows the pattern "it does multiple buffered writes, then calls fdatasync() to make that data persistent and to detect if there are any problems." Efficiency depends on the amount of data batched up before fdatasync. There's no rule that says the metadata write has to go to the same device as the data write in a cow filesystem, which may affect the cost analysis (i.e. even in the one-data-one-metadata-block-write case, one of the writes may be much cheaper than the other).
API-level compatibility is sometimes more important than the performance. Do we want to force application developers to implement two backends for cow- and non-cow filesystems, by failing in cases where there is no direct hardware support? Or do we prefer a model like "use a thin filesystem if the hardware supports atomic writes, or use a cow filesystem if not", and keep only the backend that uses atomic writes? If there's already an ioctl associated with this, it seems we could have one that says what we can expect from the backend (no support, emulated software support, full hardware support, or "one or more of the lower block layers does not implement this ioctl"). I don't expect PostgreSQL to drop non-atomic-write backends any time soon, but new projects might not bother supporting the old ways if some other part of the system will do the work for them.
Administrator controls are sometimes more important than performance. e.g. if we bought all the good hardware we can, and we still need to buy more, we can end up with some terrible storage devices. The sysadmin might force all applications to run on a cow filesystem with checksums so that device failures (particularly unreported data corruption) can be detected. That covers "data integrity is more important than performance" too.
Filesystems that have checksums can verify writes long after the fact, and cow filesystems can do write ordering cheaply. In some use cases that makes the fdatasync unnecessary (most of the same cases where async commit would be acceptable in the database). We'd still want to make sure the some obscure part of the kernel doesn't try to merge writes in a way that ends up tearing or reordering them before they get to the filesystem.
