Why write twice

Posted Feb 20, 2025 19:55 UTC (Thu) by jengelh (guest, #33263)
Parent article: Support for atomic block writes in 6.13

>To work around this possibility, databases employ an additional technique called "double write".

Why do databases (still) need a double write? E.g. git, btrfs, liblmdb, seem to do just fine with writing the data block and then (atomically) updating a pointer.

Why write twice

Posted Feb 20, 2025 22:21 UTC (Thu) by butlerm (subscriber, #13312) [Link]

If you have atomic block writes you do not need to write two different versions of the block in a "double write". That said, most databases use a redo log and block updates are committed to redo a long time before data blocks are updated in a checkpoint of some sort.

Filesystems like the NetApp filesystem, zfs, and btrfs use a phase tree approach that is not yet common for (relational) databases. And one of the reasons it is not common is because most relational databases were and are designed to give acceptable performance on spinning rust type hard disks and internal pointers in the database - in database indexes in particular - in most designs need to translate to the physical address of the referenced block without doing any additional physical disk I/O.

That means that they run more efficiently on direct mapped non-phase tree filesystems like xfs, ext4, or ntfs if not actual raw disk devices, which used to be quite common in some environments. If you put a typical relational database on a filesystem like btrfs or zfs it will slow down dramatically for that reason. It can be done of course, especially with something like zfs, but most people don't do it. That goes for Oracle, MySQL, PostgreSQL, DB2, MS SQL Server, Sybase, and a number of other older relational databases that are not so popular anymore.

If you want to design a relational or similar database to use a phase tree approach internally the place you probably ought to start is with typical B-tree or hash indexes, which are already multiversioned in most designs, and sometimes with versions that last for a considerable amount of time to do index rebuilds without taking a table offline.

And although it is usually slower it is possible to store primary key references instead pf physical or logical database block references in secondary indexes and use an index organized table that basically puts the row data in the leaves of a B-tree or similar tree that is ordinarily only accessed by primary key value instead of by something like (datafile, block, row). Then of course it doesn't really matter if data blocks and rows have several versions at new file / block offsets because the database would not generally access them by file / block / row number anyway except at a very low level.

PostgreSQL might be more amenable to this because if I recall correctly Postgres table data is index organized already and old row versions are stored inline and have to be vacuumed or essentially garbage collected later. Oracle stores old row versions to allow multiversion read consistency in a separate areas referred to as "rollback segments", which is one of the reasons why although it supports very long running transactions it originally had a hard time keeping up with simpler non MVCC designs like DB2, Ingres, Informix, Sybase, and MS SQL, and (of course) MySQL especially before MySQL even had automated support for transactions and ACID properties in the first place. There was a major tradeoff there for years that was usually solved by throwing lots of disk spindles at the problem, like separate disks or RAID 1 mirrored pairs for different tablespaces, data files, index segments, rollback segments, control files, and redo logs.